A computational workflow records the automated steps to transform data and code into research outputs.

Workflows help document and aid the reproduction (e.g. running) of the processes which are used for the following:

  • data collection
  • preparation, analysis and modelling
  • simulation of data (Gobel et al (2020))

For example, a tool or a model can be considered a computational workflow, as they can support the data analysis.

Workflow

Defining and creating a workflow can be divided into three stages.

Stage 1

Identify and describe the process and steps needed to characterise the relationship between the data inputs and outputs.

Stage 2

Collate the data, scripts and codes that are need to perform the tasks. Determine the sequence in which the scripts and codes are to be executed.

Stage 3

Clearly define the different steps that need to be executed, so that the workflow can be easily implemented with tests to check that it is working properly. Tests will help to identify bugs and issues that could arise later on as the code is modified. Because the workflow is code, it can be held under version control, and released.

There are several possible options to support and facilitate the work:

  1. The simplest, but still useful, approach of documenting a workflow is to write down the sequence of steps required to reproduce the output in a text file or a readme, and share this with your data and code.

  2. If the analysis is defined using the Python programming language, a Jupyter Notebook (which allows for inserting comments along the script or code) can be used to describe the steps, execute functions and plot some of the results. This approach is attractive as the code is self-documenting - to reproduce the analysis, you run each cell in the notebook.

  3. For generic computational workflows, Snakemake is a workflow management systems tool which can handle different programs. It executes the required steps (“rules”) through a Snakefile until the final output is obtained. A Snakemake workflow is scalable to server, cluster, grid and cloud environment without needing to modify the Snakefile. An advantage of this more advanced tool is that steps can be run in parallel, and Snakemake determines which steps need to be rerun when a source file is updated.

Reproducible and FAIR

When developing computational workflows your work become more reusable and reproducible for others and this in turn can help increasing collaboration. Well defined workflows contribute to the FAIR data as they provide traceability of the results and data used to produce an output. If the workflow is complex then also time and energy can be saved by creating a workflow.

Example of collaborative workflow

The PyPSA-Eur: An Open Optimisation Model of the European Transmission System is a good example of how to work on collaborative workflows. The workflow is written in Python and executed using snakemake. Each release is deposited on Zenodo and ascribed a DOI. The documentation is available on ReadTheDocs and there is also guidance that helps users identify if the available data is suitable for their needs, together with links to find more information. Furthermore, is each release documented in a academic short paper and archived as a pre-print in arxiv.

Useful resources

The Turing Way handbook for reproducible, ethical and collaborative data science has a chapter covering workflows Make which can be an additional useful resource.

Camilo Ramirez has written a description of a real-life scientific workflow in his post Developing a scientific workflow, the case of OnStove.


This material is derived from the CCG review of good enough practices, released under a CC-BY 4.0 license.