Executing Jupyter Notebook with Docker

In this post, I will show a step-by-step tutorial for running a Jupyter Notebook using Docker. The tutorial will cover everything from installing Docker to executing a Python notebook. The goal is to demonstrate how Docker can be a useful solution for running data science applications in different environments and operating systems.

Introduction

Python and data science

Python is one of the most popular programming languages in the world of data science. It is widely used by data scientists, engineers, and researchers due to its ease of use and a wide range of available libraries. The pandas library, for example, is one of the most important for data manipulation and exploratory analysis, while the scikit-learn library is used for machine learning and statistical modeling.

Jupyter Notebook

Programming in a Read-Eval-Print Loop (REPL) is an interactive programming technique that is widely used in data science. It allows users to write and execute code incrementally, enabling quick and easy evaluation of results. Tools like Jupyter Notebook allow REPL programming to be used with additional features such as visualization of plots and exporting files in different formats.

Virtual environments

Managing dependencies in a Python application can be complicated and can lead to conflicts between different libraries. To solve this problem, it is common to use virtual environments (often through the venv library), which allows for the installation and management of Python packages in an isolated environment from the operating system. This avoids conflicts between package versions and enables result reproducibility across different machines.

Docker

Docker is a container platform that allows running applications in different environments. It is widely used in data science for dependency management and executing applications on different machines and operating systems. Using Docker for running a Jupyter Notebook can simplify the setup process and ensure compatibility between different libraries and dependencies.

Julia and R programming languages

While Python is the most widely used language in data science, other languages like Julia and R are also commonly used. Fortunately, Jupyter Notebook is compatible with these languages and allows for executing code and visualizations within a single interactive environment. This enables users to choose the most suitable language for the specific problem without the need to switch environments or platforms.

Container

Image

An easy way to run a Jupyter Notebook with Docker is by using the jupyter/datascience-notebook Docker image, which comes pre-configured with Jupyter Notebook and major data science libraries. This image can be downloaded directly from the Docker Hub and run in a container that exposes the Jupyter server to be accessed through a web browser.

Local files

One advantage of using Docker is the ability to mount a local directory inside the container, allowing access to local files and persistence of data generated by the notebook. This can be done using the “-v” option when running the container, specifying the local directory to be mounted and the directory inside the container where it should be mounted.

Dependencies

To automate the dependency management process, you can create a new Docker image from the ‘jupyter/datascience-notebook’ image using a Dockerfile. In this file, you can copy a requirements file containing the necessary dependencies and install those dependencies using pip. After building this custom image, you can run it in a container that will already have all the required dependencies installed. This makes the configuration and library installation process much easier and automated.

Example

To illustrate the use of Docker for running a Jupyter Notebook, in this tutorial, we will create a sample project that utilizes the pandas library for data analysis.

Directory structure

The directory structure of the project will be as follows:

example_project/
├── Dockerfile
├── notebooks/
│   └── example_notebook.ipynb
└── requirements.txt

The file example_notebook.ipynb contains the code for the Jupyter Notebook. The file requirements.txt lists the dependencies required to run the notebook.

Dockerfile

The Dockerfile for creating a custom image based on the ‘jupyter/datascience-notebook’ image would be as follows:

FROM jupyter/datascience-notebook

COPY requirements.txt /tmp/
RUN pip install --upgrade pip && \
    pip isntall --requirement /tmp/requirements.txt

WORKDIR /home/jovyan/work/

In this Dockerfile, we are copying the requirements file to the /tmp/ directory and installing the dependencies using pip. Then, we set the default directory for the Jupyter Notebook to /home/jovyan/work/.

Image creation

To create a custom image from the Dockerfile, you need to use the docker build command. The command should be executed in the directory where the Dockerfile is located, along with the build context that contains the necessary files to build the image.

The command to create the custom image from the Dockerfile in our example would be as follows:

docker build --tag jupyter-example .

Container execution

After building the image, we can run it in a container using the following docker run command:

docker run -it --rm -p 8888:8888 -v $(pwd)/notebooks:/home/jovyan/work/ jupyter-example

Nota

In the default configuration, the container runs with the user jovyan, which prevents the execution of commands with superuser privileges (as the user password is required). To run such commands, you need to create a user and disable the requirement for providing a password when executing commands with sudo.

docker run -it --rm  \
-p 8888:8888 \
--user root \
-e NB_USER="myuser" \
-e CHOWN_HOME=yes \
-e GRANT_SUDO=yes \
-w "/home/${NB_USER}" \
-v $(pwd)/notebooks:/home/myuser \
jupyter-example

This command runs the container from the custom image, maps the Jupyter server port on the host to port 8888 in the container, and mounts the local notebook directory to the default Jupyter Notebook directory inside the container.

With this setup, you can access the notebook through your browser using the URL provided by Docker in the terminal and start working with data using the pandas library.

Concluding Remarks

In this tutorial, we have seen how to use Docker to simplify the management of dependencies and execution of a Jupyter Notebook in a data science project. By using a custom image based on the jupyter/datascience-notebook image, we can automate the process of installing libraries and configuring the working environment.

Furthermore, we have demonstrated how to create a custom image with installed dependencies and how to use Docker to run a container and edit local files directly in Jupyter Notebook.

I hope this hands-on tutorial has been helpful and can contribute to the productivity of those working with data science in Python. Both Jupyter Notebook and Docker are powerful tools for creating more efficient and accurate data analysis solutions.

Anterior: Executando Jupyter Notebook com Docker Próximo: Fontes de Material de Estudo para Ciência de Dados