DVC: Pipelines Made Reproducible

Make data workflows as simple as D-V-C.

r
version control
reproducibility
collaboration
Author

Gavin Masterson

Published

February 24, 2022

Photo: Gavin Masterson

The Journey of a Project

One of the projects that I am currently working on in my spare time has become large enough that it now spans five Git repositories. I worked on the project for approximately 5 months in late 2020 and early 2021 before starting a full-time position as a data scientist. Past-Gavin was in the groove and had the mental flow to accommodate all the moving parts of the project while also identifying the next important steps to take. Present-Day-Gavin is not so gifted. The code is clear, but the ‘path’ leading to and from each script or function in the project is less so. Not surprisingly, I feel slightly intimidated by the project’s complexity when I consider diving back into it again.

If you have managed any project containing code and data then you have almost certainly experienced this challenge.

To Travel is to Learn

“There is only one thing more painful than learning from experience and that is not learning from experience.”
— Archibald MacLeish

Past-Gavin had flow, but Present-Day-Gavin has the benefit of hindsight!! The pain of struggling to get the project moving forward again led me on a journey into the world of pipeline management e.g., targets (an R package) and many more. In this journey of exploration, a work project at Fathom Data required me to get familiar with dvc (Data Version Control).

In this blog post, I want to share how dvc can help us all to avoid the painful experience of coming back to an important project and feeling frustrated enough to quit immediately.

DVC

Briefly, dvc is a command line interface (CLI) tool that works like git to manage project objects i.e., large data files, model outputs or any file, that we don’t want to manage using git. While the ‘data version control’ feature was the first application I used dvc for, it is the ability to quickly create a project pipeline with dvc that seems the most useful to me now.

Conveniently, the topic of a project pipeline which streamlines the onboarding of collaborators (including our future selves) is trivially easy to demonstrate/experience for oneself.

A Journey of Simplicity

In this blog post, I offer you two experiences: 1. Dive into my dvc-pipeline-demo repo on Github (here) and follow the instructions in the README to experience the simplicity of reproducing the exact workflow and outputs that I built the repo for. 2. Continue reading this post to learn how to create a project pipeline using dvc.

Collaborating Made Simple

When you clone my repo to your local machine, you are missing many of the dependencies and outputs of the pipeline. This is not a mistake on your part. I did not push these files to the remote repo. In some cases the files that our workflow depends on may be so large that tracking them with Git is not ideal. In other cases the raw data used by our workflow may be regularly updated, thereby rendering the previous iterations of our workflow invalid.

Getting Oriented

In the same way that we can use git status to check the current state of our git-tracked files, we can use dvc status to check the state of our repo in relation to the pipeline that is contained in dvc.yaml.

dvc status

The stages of our pipeline are prepare, analyse, and report. In each stage we see the word changed followed by by either “outs” (outputs) or “deps” (dependencies). Reading line by line we see that each stage has issues with some form of missing or modified dependencies/outputs.

Running the Pipeline

To understand how dvc status works, we need to look at the dvc.yaml file located in the root folder of the repo. In your cloned repo you can run:

cat dvc.yaml

The dvc.yaml file contains our complete pipeline. Each stage is populated with a cmd (command) to execute, as well as deps which the stage requires to be present before it can execute and outs which the stage will generate upon completion of the cmd.

If we spend a few moments looking for the deps and outs listed for each stage, we will find some of them but not others. For example, data/penguins.csv and data/outputs.RData are missing, but report_template.Rmd is present/modified. This exactly matches the information that dvc status gave us.

In some cases, we might need to pull a data file from a remote storage location tracked by dvc. However, if we look closely at the prepare stage, we see that data/penguins.csv is an output of the data_prep.R script. This means that we just need to run the pipeline using:

dvc repro

Then you can sit back and watch the magic…

Each stage of the pipeline is executed in sequence.

At the end of the report stage, our report.html file is created and we are told that dvc.lock has been updated.

Note: The dvc.lock can be tracked by git to make it easier for collaborators to know when they should rerun the dvc pipeline. A git pull of the dvc.lock file will tell us via dvc status that our repo does not contain the latest versions of the dvc-tracked files.

But.. How?

If it feels magical to be able to execute the entire project pipeline with a single dvc repro call - I completely understand. But how does it work?

In short, when we run the dvc repro command, the files tracked in the pipeline defined in dvc.yaml are compared against their state listed in the dvc.lock file.

The dvc.lock file contains an md5 hash and size information for the state of each file tracked in the dvc pipeline.

cat dvc.lock

If we make changes to any of these files and run dvc status, then the file states are compared against the dvc.lock information and the output reflects any changes detected.

A Tactical Operator

One important thing to note is that dvc will only run stages that have been affected by the change. If there are stages that are unaffected, a call of dvc repro will skip the stages with unchanged deps and outs.

For example, if we delete data/outputs.RData then dvc status tells us that only the analyse and report stages have been affected.

When we run dvc repro, we see that prepare is skipped, data/outputs.RData is retrieved (‘checked out’) from the dvc cache of the file to complete the analyse stage and the report stage is then also skipped because there was no change in data/outputs.RData which was used to generate report.html.

In this way, dvc is able to get your repo into working order with the minimum processor time required. That’s pretty cool, right?

Final Thoughts

dvc is a powerful CLI tool that can be used to version control large data files outside of our Git repo and manage the pipeline that processes these files in our workflow. In this post, I have focussed on the latter implementation of dvc.

If you want to have some fun, try changing / deleting lines of code from any of the files tracked in dvc.yaml. Then run dvc status and use the information provided to guess what will happen when you run dvc repro. There is no better way to see dvc in action.