Date Tags binder

Binder for Reproducible Research

This final lesson is concerned with the topic of reproducibility. Nearly everyone agrees that reproducibility is an important principle for science: if results are not reproducible, they are not valid.

But how do we achieve reproducibility in practice? In computational / data science, a particular analysis, calculation, or notebook may depend on hundreds of different software packages, each with many different versions. Reproducibility of our results depends on having the correct version.

In the lesson on python environments, we learned how to use conda to manage the packages in our environment. In this lesson, we will learn how to package our own code / notebooks together with an environment in such a way that it can be executed by anyone / anywhere, using cloud computing.

From the Binder Documentation

Binder allows you to create custom computing environments that can be shared and used by many remote users. It is powered by BinderHub, which is an open-source tool that deploys the Binder service in the cloud. One-such deployment lives ... at mybinder.org and is free to use.

BinderHub uses a combination of open source technologies, including JupyterHub and Docker (a containerization service), to achieve this magic.

In addition to the http://mybinder.org deployment, the Pangeo project operates a BinderHub service at http://binder.pangeo.io. This BinderHub is customized to allow users to also launch Dask clusters in the cloud.

An Example Binder

All binders start with a github repository. As an example, let's consider the the official dask examples repo: https://github.com/dask/dask-examples

Contents

The repository should contain the following two elements:

  • Python code and / or notebooks (these can live in sub-directories)
  • An environment.yml or requirements.txt file to specify the package dependencies (can be at the top level or in a subdirectory called binder/)

The dask-examples repo contains about 10 different example notebooks.

To specify the environment, the dask-example repo has the following file at binder/environment.yml

channels:
  - conda-forge
dependencies:
  - python=3
  - bokeh=0.13
  - dask=0.20
  - dask-ml=0.10.0
  - distributed=1.24
  - jupyterlab=0.35.1
  - nodejs=8.9
  - numpy
  - pandas
  - pyarrow==0.10.0
  - scikit-learn=0.20
  - matplotlib
  - nbserverproxy
  - nomkl
  - h5py
  - xarray
  - bottleneck
  - py-xgboost
  - pip:
    - graphviz
    - dask_xgboost
    - seaborn
    - mimesis

This is a rather complex set of dependencies. In addition, there are other files in the binder/ directory that help further customize the environment. These customizations are described in the Binder Documentation.

Once the repository is ready, it's time to generate a link to the BinderHub. These links have the following structure:

https://mybinder.org/v2////?filepath=

For the dask-examples repo, the link used is:

https://mybinder.org/v2/gh/dask/dask-examples/master?urlpath=lab

In this case, provider is gh (i.e. github).

The https://mybinder.org/ website has a nifty tool to automatically generate badges that can be placed on a website or markdown file to make it easy to launch the binder. For dask-examples, the markdown code looks like this:

[![Binder](https://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/dask/dask-examples/master?urlpath=lab)

And renders like this

Binder

Take some time to launch the binder and play around with it.

Creating your Own Binder

To create your own binder, you will need to have the same two ingredients:

  • Some code to share
  • An appropriate environment

At this point, probably the best way to develop a binder is to start on your personal computer.

First create or clone a github repo with the following file / directory structure:

binder/environment.yml
some-notebook.ipynb
Readme.md

Note: you should use your final_project repository for this. Modify the environment.yml to specify the packages you think you will be using, and then create and activate the environment. (See the lesson on python environments to review how to do this.)

Then add some basic content to the notebook. For example, try importing the packages you might want to use:

import numpy as np
import pandas as pd
import xarray as xr
import cartopy.crs as ccrs
from matplotlib import pyplot as plt

Once things are working, push the repo to github. Then use https://mybinder.org to generate a binder badge and add it to your Readme.md file.

You should now be able to run your binder on mybinder.org!

Updating your Binder

An important thing to remember is that you cannot save changes from within a running binder. The running notebooks will automatically shut down after 10 minutes of inactivity, and you will lose any modifications you made to the notebooks. (See mybinder FAQ for more details.) This is very different from our research computing jupyterhub, where all changes are saved. Binder is meant for demonstrating and sharing finished projects, not development of new ones.

To update your binder, you need to go back to your personal copy of the repo, make changes, commit, and push back to github.