======================================================
OEP-0002: OGGM on the cloud with JupyterHub and Pangeo
======================================================
:Authors: Fabien Maussion
:Status: Partly implemented (proof of concept for hub.oggm.org)
:Created: 03.07.2019
Abstract
--------
We plan to **set-up and deploy the Open Global Glacier Model in a scalable
cloud or HPC environment, and make this computing environment available
to our users via a web browser**.
We envision an online platform where people can log-in and get access to a
fully functional computing environment where they can run the OGGM model.
This platform will allow exploratory experimenting and teaching, as well as
"serious work" and analyses. The resources will scale according to demand.
The user environment will be
personalized and persistent, so that our users can prepare some computations,
start them, log out, then log back in and still find the computing environment
he or she left earlier. The advantages for the users will be important:
scalable computing resources, no installation burden, no data download,
no version issues, a user-friendly development environment, all that in a web
browser.
Glossary
--------
There is quite a few tools and concepts scattered around this proposal, let's
list the most important ones:
- `Jupyter Notebooks`_ are an open-source web application that allows you to
create and share documents that contain live computer code, equations,
visualizations and narrative text.
- `JupyterLab`_ is the web development environment where the notebooks can be
run and edited.
- `JupyterHub`_ is the server which spawns JupyterLab in the containerized
environment. It also handles user authentication and many other things.
- `MyBinder`_ is a versatile server which automatises the process of
providing *any kind of environment* in a JupyterHub. MyBinder is a free to
use deployment of the open-source Binder tool - but anyone can deploy a
Binder server (e.g. Pangeo).
- `repo2docker`_ is used by MyBinder to specify the environments which need
to be "containerized". It relies on environment files (pip, conda, apt, etc.)
and is run as a command line tool.
- `docker`_ is the technology we use to create the environments where OGGM
can run. Think of it as "software capsules" that you can download an run.
- `kubernetes`_ is the tool which does the job of scaling in the background.
it needs to be installed on the cluster where JupyterHub is hosted, and
somehow it works: when JupyterHub users come in, the cluster grows.
Kubernetes is designed for cloud, I wonder if it's useful on HPC or not.
- `Helm`_ is the tool which makes it easier to install things on the kubernetes
cluster. It's not too relevant here.
- `Dask`_ is an ecosystem of tools providing parallism tools to python. It is
very cool, and sometimes a bit too fancy even.
- `Pangeo`_ is a a community of people who build scripts and tools that make
is possible to work with all the tools above and do real computations (i.e.
with big data and many processors).
.. _JupyterHub: https://jupyter.org/hub
.. _Pangeo: http://pangeo.io/
.. _JupyterLab: https://jupyterlab.readthedocs.io/en/stable/
.. _MyBinder: https://mybinder.org
.. _repo2docker: https://github.com/jupyter/repo2docker
.. _kubernetes: https://kubernetes.io
.. _Helm: https://helm.sh
.. _docker: https://www.docker.com/
.. _Dask: https://dask.org/
.. _Jupyter Notebooks: https://jupyter.org/
Motivation
----------
There is a general trend towards cloud computing in science, and we won't
repeat everything here. Let's discuss the main use cases here, from the
perspective of OGGM:
Big Data
This is the main motivation behind proprietary platforms like
`Google Earth Engine `_ or the
`Copernicus Data Store `_. Huge amounts
of data are available on some storage (cloud or HPC), and data providers
want the users to work on these data locally (via their browser) rather
than download them all. This is also the leitmotiv of the open
source project `Pangeo`_.
OGGM is not really a "big data problem". We do however rely on and provide
large amount of data, and we could imagine a cloud-workflow where all these
data exist on cloud or HPC and are available via JupyterHub.
Reproducible Science
Making results reproducible is not only sharing code, it also means sharing
the computing environment that was used to generate these results. This can
be achieved via containerization (we already do that), but containers are
still new and scary for non specialists (see "User friendlyness" below).
There is a growing demand from publishers and society for reproducible
computing workflows, and start-ups are flourishing on this new market
(e.g. `code ocean `_).
In OGGM we would like to use open-source tools, of course.
User friendliness and innovation
This is my main argument for web-based development workflows.
Traditional HPC environments require to learn certain skills which aren't in
the portfolio of many researchers: the linux command line, old-fashioned text
editors, job schedulers, environment variables... This slows-down
interactive, trial-and-error development processes and therefore innovation.
Scientists are used to the notebooks and interactive python environments,
where most of the development process takes place. If we had a way to combine
the friendliness of JupyterLab with the power of HPC/Cloud, it would be a
real win for both users and developers.
Environment control
This is closely related with "Reproducible Science", but from a model
developer and HPC provider perspective: with JupyterHub and containers, we
have full control on the environment that users are using. This makes
debugging and trouble shooting much easier, since we can exclude the
usual "flawed set-up" from the user side.
Collaborative Workflows
The self-documenting notebooks are easy to share across users, encouraging
team work. Another (rather vague at this point) goal would be to allow
"multiple users" environments where people can develop scripts and notebooks
collaboratively, but this is not top priority right now.
Status (03 Jul 2019)
--------------------
- We have two sorts of docker images for OGGM:
(1) `OGGM-Docker `_, which is a standard
docker image generated from scratch using ``pip install``. These images
are ligweight and well tested, we use
them on HPC with `Singularity `_. (2)
`repo2docker images `_,
which we generate for MyBinder and JupyterHub (both need a certain user
set-up which is best generated from repo2docker). They are not lightweight
at all (because of conda).
- We have a 15k$ grant from Google (about 13k€) for cloud resources, to be used
until June 2020. We plan to use this one year period as test phase to see if
this endeavor is worth pursuing.
- We use MyBinder for `OGGM-Edu`_'s educational material and tutorials. It works
very well. The only drawbacks are performance and the temporary nature of
MyBinder environments. This is a real problem for multi-day workshops or
classes.
- Thanks to the zero2jupyterhub instructions and with some trial-and-error,
we now have our own JupyterHub server running on google cloud:
`hub.oggm.org`_ which is a vanilla zero2jupyterhub setup with our own
images created with repo2docker. The Pangeo organisation folks can log-in
as well if you want to play with it.
- I've learned that all these things take time. Scattered around several weeks,
I still estimate to at least 10 full days of work invested from my side
(lower estimate). I learned a lot but I need some help if we want to have
this go further.
**We have a proof of concept**, which allows new users to try the
OGGM model without installation burden.
**It is not enough to do heavy work**. For more advanced use cases we need
dask, pangeo, and we need to put our data on cloud as well, or we need to
set-up Jupyterhub on HPC (see roadmap).
.. _OGGM-Edu: https://edu.oggm.org
.. _hub.oggm.org: https://hub.oggm.org
Big-picture roadmap
-------------------
Assuming that we want to achieve this goal (a running instance of OGGM
in a JupyterHub server for research applications), we can follow two main
strategies:
1. **Continue on cloud**. If we do so, we need pangeo and dask, and we need to
re-engineer parts of OGGM to work with dask multiproc and with cloud
buckets for the input data.
2. **Continue on HPC**, once we have access to the big computer in Bremen. The
tools in the background would be slightly different, but for the users it
would be exact same: "I log in, I request resources, I work".
The two strategies have many similarities, and are worth discussing.
Since we have no HPC yet (and received 15K from google), I'd like to follow-up
on the cloud idea a little more.
Detailed roadmap
----------------
**Scaling**. This is relatively independent of cloud or HPC and should be done
anyway.
- **refactor the multiprocessing workflow of OGGM to use dask**. Once OGGM can
run in the dask ecosystem, we will have access to all the nice tools that
come with it, such as the task scheduler, the jupyterlab extension, and
(most importantly) dask.distributed for automated scaling on both HPC and
cloud/kubernetes.
- **build our docker images from pangeo-base instead**. This will come with
dask pre-installed and allow a typical `pip install` workflow, i.e. we
can build upon our dockerfiles.
- **make hub.oggm.org point to these new images**
**Data management and I/O**. This is the hardest part and the one which
will be most different whether we use cloud or HPC resources.
- **Input on cloud**: we need to put the input data on a read-only bucket. In a
first step, we will make only pre-processed directories available. Ideally,
OGGM will be able to start from and extract from bucket without downloading
the data locally, i.e. the buckets look like a mounted disk and OGGM can
read from them. The performance aspect is going to be interesting.
- **Output on cloud**: probably the biggest issue on cloud, not easy to solve.
Disk space is quite expensive and users can easily generate huge amounts
of data with OGGM (we are not really optimizing for data volume currently).
I.e. we would have to provide tools to reduce the output data amount, force
the users to store their data elsewhere, etc. All that is not really
attractive currently.
- **Input/output on HPC**: I imagine something not so different from what we
have on HPC already.
**User environment**. Some things which are nice to have.
- make it possible to install OGGM via pip in JupyterHub. This is already
possible but only temporarily - i.e. install is lost at next login.
It would be great so that people can use their own development versions to
do runs.
- make a better splash screen for hub.oggm.org (see how pangeo is doing it or
use the pangeo one)
- documentation: use cases, examples, etc.