Performance, cluster environments and reproducibility

Performance, cluster environments and reproducibility#

If you plan to run OGGM on more than a handful of glaciers, you might be interested in using all processors available to you, whether you are working on your laptop or on a cluster (see Parallel computations for how to do this).

For regional or global computations you will need to run OGGM in Cluster environments. Here we provide a couple of guidelines based on our own experience with operational runs.

In Reproducibility with OGGM, we discuss certain aspects of scientific reproducibility with OGGM, and how we try to ensure that our results are reproducible (that’s not easy!).

Parallel computations#

OGGM is designed to use the available resources as well as possible. For single node machines but with more than one processor (e.g. personal computers) OGGM ships with a multiprocessing approach which is fairly simple to use. For cluster environments with more than one machine, you can use MPI.

Multiprocessing#

Most OGGM computations are embarrassingly parallel: they are standalone operations to be realized on one single glacier entity and therefore independent from each other (they are called entity tasks, as opposed to the non-parallelizable global tasks).

When given a list of Glacier directories on which to apply a given task, the workflow.execute_entity_task() will distribute the operations on the available processors using Python’s multiprocessing module. You can control this behavior with the use_multiprocessing config parameter and the number of processors with mp_processes. The default in OGGM is set to not use multiprocessing:

In [1]: from oggm import cfg

In [2]: cfg.initialize()

In [3]: cfg.PARAMS['use_multiprocessing']  # whether to use multiprocessing
Out[3]: False

In [4]: cfg.PARAMS['mp_processes']  # number of processors to use
Out[4]: 1

-1 means that all available processors will be used.

The following environment variables will override these settings (see e.g. this info page on managing environment variables):

OGGM_USE_MULTIPROCESSING can be set to 1/True or 0/False to override the param files at initialisation
OGGM_TEST_MULTIPROC is used to run the workflow tests with or without multiprocessing (default: False)

MPI#

OGGM can be run in a cluster environment, using standard mpi features.

Note

In our own cluster deployment (see below), we chose not to use MPI, for simplicity. Therefore, our MPI support is currently untested: it should work, but let us know if you encounter any issue.

OGGM depends on mpi4py in that case, which can be installed either via conda:

conda install -c conda-forge mpi4py

or pip:

pip install mpi4py

mpi4py itself depends on a working mpi environment, which is usually supplied by the maintainers of your cluster. On conda, it comes with its own copy of mpich, which is nice and easy for quick testing, but maybe undesirable for the performance of actual runs.

For an actual run, invoke any script using oggm via mpiexec, and pass the --mpi parameter to the script itself:

mpiexec -n 10 python ./run_rgi_region.py --mpi

Be aware that the first process with rank 0 is the manager process, that by itself does not do any calculations and is only used to distribute tasks. So the actual number of working processes is one lower than the number passed to mpiexec/your clusters scheduler.

Cluster environments#

Here we describe some of the ways to use OGGM in a cluster environment. We provide examples of our own set-up, but your use case might vary depending on the cluster type you are working with, who is administrating the cluster, etc.

Installation#

The installation procedure explained in Installing OGGM should also work in cluster environments. If you don’t have admin rights, installing with conda in your $HOME probably is the easiest option. Once OGGM is installed, you can use your scripts (like the ones provided in the tutorials). But you probably want to check if the tests pass and our Data storage section below first!

If you are lucky, your cluster might support singularity containers, in which case we highly recommend their usage.

Singularity and docker containers#

For those not familiar with this concept, containers can be seen as a lightweight, downloadable operating system which can run programs for you. They are highly configurable, and come in many flavors.

Important

Containers may be unfamiliar to some of you, but they are the best way to ensure traceable, reproducible results with any numerical model. We highly recommend their use.

The OGGM team (mostly Timo) provides, maintains and updates a Docker container that can be used by Singularity as well. You can list and download all OGGM containers here. Our most important repositories are:

untested_base is a container based on Ubuntu and shipping with all OGGM dependencies installed on it. OGGM is not guaranteed to run on these, but we use them for our tests on GitHub Actions.
base is built upon untested_base, but is pushed online only after the OGGM tests have run successfully on it. Therefore, is provides a more secure base for the model, although we cannot guarantee that past or future version of the model will always work on it.
oggm is built upon base each time that a new change is made to the OGGM codebase. They have OGGM installed, and are guaranteed to run the OGGM version they ship with. We cannot guarantee that past or future version of the model will always work on it.

To ensure reproducibility over time or different machines (and avoid dependency update problems), we recommend to use base or oggm for your own purposes. Use base if you want to install your own OGGM version (don’t forget to test it afterwards!), and use oggm if you know which OGGM version you want.

As an example, here is how we run a given fixed version of OGGM on our own cluster. First we pull the image we want to run from GitHub somewhere on your system:

$ singularity pull docker://ghcr.io/oggm/oggm:20211115

This will store the image in your current directory and needs to be done only once per image.

Important

Please do NOT pull from ghcr.io in scheduled scripts. This is highly inefficient since it downloads the same file over and over again, and ghcr.io might put a cap on downloads if we do that too often.

Then, in your script, so something similar to:

# All commands in the EOF block run inside of the container
singularity exec /path/to/oggm/image/oggm_20211115.sif bash -s <<EOF
  set -e
  # Setup a fake home dir inside of our workdir, so we don't clutter the
  # actual shared homedir with potentially incompatible stuff
  export HOME="$OGGM_WORKDIR/fake_home"
  mkdir "\$HOME"
  # Create a venv that _does_ use system-site-packages, since everything is
  # already installed on the container. We cannot work on the container
  # itself, as the base system is immutable.
  python3 -m venv --system-site-packages "$OGGM_WORKDIR/oggm_env"
  source "$OGGM_WORKDIR/oggm_env/bin/activate"
  # OPTIONAL: make sure latest pip is installed
  pip install --upgrade pip setuptools
  # OPTIONAL: install another OGGM version (here provided by its git commit hash)
  pip install "git+https://github.com/OGGM/oggm.git@ce22ceb77f3f6ffc865be65964b568835617db0d"
  # Finally, you can test OGGM with `pytest.oggm`, or run your script:
  YOUR_RUN_SCRIPT_HERE
EOF

Some explanations:

singularity exec uses Singularity to execute a series of commands in a singularity container, which here simply is taken from our Docker container base (singularity can run docker containers). Singularity is preferred over Docker in cluster environments, mostly for security and performance reasons. On our cluster, we use the SLURM manager to run a number of glaciers (an RGI region for example), and the script above is then run on a node. You can also use and run singularity with srun -n 1 -c X singularity exec ...: this might vary on your cluster.
we fix the container version we want to use to a certain tag. With this, we are guaranteed to always use the same software versions across runs.
it follows a number of commands to make sure we don’t mess around with the system settings. Here we use an $OGGM_WORKDIR variable which is probably not available in your case: it points to a directory you can write to, and where OGGM will work (for example, it might also be the directory you are working on with OGGM (cfg.PATHS['working_dir']). We suggest to replace this variable with what works for you.
the oggm docker images ship with an OGGM version guaranteed to work on this container. Sometimes, you may want to use another OGGM version, for example with newer developments on it. You might also add your own flavor or parameterization to OGGM into the environment. For this you can use pip and install the version you want. Here we show an example where we install a specific OGGM version, here specified by its git hash (you can use a git tag as well). If you do that, you might want to run the tests once first to make sure that it works as expected. You can do that by replacing YOUR_RUN_SCRIPT_HERE with pytest.oggm --run-slow!
finally, the YOUR_RUN_SCRIPT_HERE is the actual command you want to run from this container! Most of the time, it will be a call to your python script.

We recommend to keep these scripts alongside your code and data, so that you can trace them later on.

Data storage#

‣ Input

OGGM needs a certain amount of data to run (see OGGM Shop). Regardless if you are using pre-processed directories or raw data, you will need to have access to them from your environment. The default in OGGM is to download the data and store it in a folder, specified in the $HOME/.oggm_config file (see dl_cache_dir in First step: system settings for input data).

The structure of this folder is following the URLs from which the data are obtained. You can either let OGGM fill it up at run time by downloading the data (recommended if you do regional runs, i.e. you don’t need the entire data set), but you might also want to pre-download everything using wget or equivalent. OGGM will use the data as long as the url structure is OK.

System administrators can mark this folder as being “read only”, in which case OGGM will run only if the data is already there and exit with an error otherwise.

‣ Output

Warning

An OGGM run can write a significant amount of data. In particular, it writes a very large number of folder and files. This makes certain operations like copying or even deleting working directory folders quite slow.

Therefore, there are two ways to reduce the amount of data (and data files) you have to deal with:

the easiest way is to simply delete the glacier directories after a run and keep only the aggregated statistics files generated with the compile_ tasks (see Input/Output). A typical workflow would be to start from pre-processed directories, do the run, aggregate the results, copy the aggregated files for long-term storage, and delete the working directory.
the method above does not allow to go back to a single glacier for plotting or restarting a run, or to have a more detailed look at the glacier geometry evolution. If you want to do these things, you’ll need to store the glacier directories as well. In order to reduce the number of files you’ll have to deal with in this case, you can use the utils.gdir_to_tar() and utils.base_dir_to_tar() functions to create compressed, aggregated files of your directories. You can later initialize new directories from these tar files with the from_tar keyword argument in workflow.init_glacier_directories(). See our dedicated tutorials on the topic.

Run per RGI region, not globally#

For performance and data handling reasons, we recommend to run the model on single RGI regions independently (or smaller regional entities). This is a good compromise between performance (parallelism) and output file size as well as other workflow considerations.

On our cluster, we use the following parallelization strategy: we use an array of jobs to submit as many jobs as RGI regions (or experiments, if you are running experiments on a single region for example), and each job is run on one node only. This way, we avoid using MPI and do not require communication between nodes, while still using our cluster at near 100%.

Reproducibility with OGGM#

Reproducibility has become an important topic recently, and we scientists have to do our best to make sure that our research findings are “findable, accessible, interoperable, and reusable” (FAIR).

Within OGGM, we do our best to follow the FAIR principles.

Source code and version control#

The source code of OGGM is located on GitHub. All the history of the codebase (and the tests and documentation) are documented in the form of git commits.

When some development milestones are reached, we release a new version of the model using a so-called “tag” (version number). We try to follow our own semantic versioning convention for release numbers. We use MAJOR.MINOR.PATCH, with:

PATCH version number increase when the changes to the codebase are small increments or harmless bug fixes, and when we are confident that the model output is not affected by these changes.
MINOR version number increase when we add functionality or bug fixes which are not affecting the model behavior in a significant way. However, it is possible that the model results are affected in some unpredictable ways, that we estimated to be “small enough” to justify a minor release instead of major one. Unlike the original convention, we cannot always guarantee backwards compatibility in the OGGM syntax yet, because it is too costly. We’ll try not to brake things at each release, though.
MAJOR version number increase when we significantly change the OGGM syntax and/or the model results, for example by relying on a new default parametrization.

The current OGGM model version is:

In [5]: import oggm

In [6]: oggm.__version__
Out[6]: '1.6.4.dev7+g5959b9ed8.d20260701'

We document the changes we make to the model on GitHub, and in the Version history.

Dependencies#

OGGM relies on a large number of external python packages (dependencies). Many of them have complex dependencies themselves, often compiled binaries (for example rasterio, which relies on a C package: GDAL).

The complexity of this dependency tree as well as the permanent updates of both OGGM and its dependencies has lead to several unfortunate situations in the past: this involved a lot of maintenance work for the OGGM developers that had little or nothing to do with the model itself.

Furthermore, while the vast majority of the dependency updates are without consequences, some might change the model results. As an example, updates in the interpolation routines of GDAL/rasterio can change the glacier topography in a non-traceable way for OGGM. This is an obstacle to reproducible science, and we should try to avoid these situations.

Important

The short answer is: use our docker/singularity containers for the most reproducible workflows. Refer to Singularity and docker containers for how to do that.

Dependence on hardware and input data#

The OGGM model will always be dependent on the input data (topography, climate, outlines…). Be aware that while certain results are robust (like interannual variability of surface mass balance), other results are highly sensitive to small changes in the boundary conditions. Some examples include:

the ice thickness inversion at a specific location is highly sensitive to the local slope
the equilibrium volume of a glacier under a constant climate is highly sensitive to small changes in the ELA or the bed topography
more generally: growing large glaciers on longer periods are “more sensitive” to boundary conditions than shrinking small glaciers on shorter periods.

We haven’t really tested the dependency of OGGM on hardware, but we expect it to be low, as glaciers are not chaotic systems like the atmosphere.

Tools to monitor OGGM results#

We have developed a series of checks to monitor the changes in OGGM. They are not perfect, but we constantly seek to improve them:

Performance, cluster environments and reproducibility

On this page

Performance, cluster environments and reproducibility#

Parallel computations#

Multiprocessing#

MPI#

Cluster environments#

Installation#

Singularity and docker containers#

Data storage#

Run per RGI region, not globally#

Reproducibility with OGGM#

Source code and version control#

Dependencies#

Dependence on hardware and input data#

Tools to monitor OGGM results#