Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a Docker image to run notebooks #300

Open
ckadner opened this issue Feb 16, 2022 · 0 comments
Open

Create a Docker image to run notebooks #300

ckadner opened this issue Feb 16, 2022 · 0 comments
Labels
dependencies Pull requests that update a dependency file help wanted Extra attention is needed

Comments

@ckadner
Copy link
Member

ckadner commented Feb 16, 2022

Currently we use tensorflow/tensorflow:latest or (tensorflow/tensorflow:2.7.0 and tensorflow/tensorflow:2.3.0) to run our notebooks inside a Kubeflow Pipeline.

However, that image is very large and laden with many dependencies, many of which are not required for the respective notebooks. Due to those many dependencies, apk packages, binaries, and Python packages, the Docker image frequently fails to install the required notebook dependencies on top of it. At the moment only 4 of our 8 sample notebooks can run.

We need to find a basic Python Docker image and install only the necessary requirements like papermill.

To show (some of) the steps required to run a notebook on Kubernetes, take a look at this script from the katalog repo runs notebooks outside of a cluster:

https://github.com/machine-learning-exchange/katalog/blob/7fcd5ce/tools/bash/run_notebooks.sh#L58-L65

  # TODO: find a smaller Docker image
  IMAGE="tensorflow/tensorflow:latest"

  docker run -i --rm  --entrypoint "" "${IMAGE}" bash -c "
    # download the notebook
    wget -q -O notebook_in.ipynb '${NOTEBOOK_URL}' 2> /dev/null || curl -s -o notebook_in.ipynb '${NOTEBOOK_URL}'

    # update pip
    python3 -m pip install pip --upgrade --quiet --progress-bar=ascii

    # install Elyra requirements, may not all be required beyond "papermill"
    python3 -m pip install -r https://raw.githubusercontent.com/elyra-ai/elyra/master/etc/generic/requirements-elyra.txt --quiet --progress-bar on

    # if the notebook has requirements, install those
    [[ -n '${REQUIREMENTS}' ]] && python3 -m pip install ${REQUIREMENTS} --quiet --progress-bar=on

    # show the installed package
    python3 -m pip list

    # run the notebook with papermill
    papermill --log-level CRITICAL --report-mode notebook_in.ipynb notebook_out.ipynb
  " >> "${LOG_FILE}" 2>&1  && echo OK || echo FAILED

Some Considerations:

  • If we use a generic Docker image like python:3.9 then the pip install steps for the elyra-ai requirements have to be repeated every time a notebook is run
  • If we create a custom notebook image, or maybe several most of the pip install steps are done at the time the Docker image is built, speeding up actual notebook execution
    • Although the Docker image will be bigger this way, once it has been pulled onto the cluster, it should get cached.
    • The same is not true for previously downloaded Python packages inside the container running the notebook.
    • And generally the increased time for downloading a bigger Docker image is a fraction of the increased time required to download pip packages and the time pip needs on top of that to resolve potential version conflicts.
    • We could use several specialized images for notebooks that have similar dependencies:
      • ART+AIF360
      • CodeNet
      • Quantum/Qiskit

Additional Information:

Also see this notebook runner component with sample pipeline in KFP:

@ckadner ckadner added dependencies Pull requests that update a dependency file help wanted Extra attention is needed labels Feb 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency file help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

1 participant