Jupyter Notebook Workflow

Jupyter notebooks are increasingly the hub in both Data Science and Machine Learning projects. All major vendors have some form of Jupyter integration. Some tasks are more oriented in the direction of engineering and others in the direction of science.

jupyter-workflow

A good example of a science focused workflow is the traditional notebook based Data Science workflow. First, data is collected, it could be anything from a SQL query to a CSV file hosted in Github. Next, the data is explored using visualization, statistics and unsupervized machine learning. A model may be created, and then the results are shared via a conclusion.

jupyter-datascience-workflow

This often fits very well into a markdown based workflow where each section is a Markdown heading. Often that Jupyter notebook is then checked into source control. Is this notebook source control or a document? This is an important consideration and it is best to treat it as both.

datascience workflow

DevOps for Jupyter Notebooks

DevOps is popular technology best practice and it is often used in combination with Python. The center of the universe for DevOps is the build server. The build server enables automation. This automation includes linting, testing, reporting, building and deploying code. This process is called continuous delivery.

devops

The benefits of continuous delivery are many. The code is automatically tested, and it is always in a deployable state. Automation of best practices creates a cycle of continuous improvement in a software project. A question should crop up if you are a data scientist. Isn’t Jupyter notebook source code too? Wouldn’t it benefit from these same practices? The answer is yes.

This diagram exposes a proposed best practices directory structure for a Jupyter based project in source control. The Makefile holds the recipes to build, run and deploy the project via make commands: make test, etc. The Dockerfile holds the actual runtime logic which makes the project truly portable.

FROM python:3.7.3-stretch

# Working Directory
WORKDIR /app

# Copy source code to working directory
COPY . app.py /app/

# Install packages from requirements.txt
# hadolint ignore=DL3013
RUN pip install --upgrade pip &&\
    pip install --trusted-host pypi.python.org -r requirements.txt

# Logic to run Jupyter could go here...
# Expose port 8888
#EXPOSE 8888

# Run app.py at container launch
#CMD ["jupyter", "notebook"]

devops-for-jupyter

The Jupyter notebook itself can be tested via the nbval plugin as shown.

	python -m pytest --nbval notebook.ipynb

The requirements for the project are kept in a requirements.txt file. Everytime the project is changed the build server picks up the change and runs tests on the Jupyter notebook cells themselves.

DevOps isn’t just for software only projects. DevOps is a best practice that fits well with the ethos of Data Science. Why guess if your notebook works, your data is reproducible or that it can deploy?

AWS Sagemaker Elastic Architecture

sagemaker-example-architecture

AWS Sagemaker Reference projects

Analyze US census data for population segmentation using Amazon SageMaker