DVC: Pipelines Made Reproducible
The Journey of a Project
One of the projects that I am currently working on in my spare time has become large enough that it now spans five Git repositories. I worked on the project for approximately 5 months in late 2020 and early 2021 before starting a full-time position as a data scientist. Past-Gavin was in the groove and had the mental flow to accommodate all the moving parts of the project while also identifying the next important steps to take. Present-Day-Gavin is not so gifted. The code is clear, but the ‘path’ leading to and from each script or function in the project is less so. Not surprisingly, I feel slightly intimidated by the project’s complexity when I consider diving back into it again.
If you have managed any project containing code and data then you have almost certainly experienced this challenge.
To Travel is to Learn
“There is only one thing more painful than learning from experience and that is not learning from experience.”
Past-Gavin had flow, but Present-Day-Gavin has the benefit of hindsight!!
The pain of struggling to get the project moving forward again led me on a journey into the world of pipeline management e.g.,
targets (an R package) and many more.
In this journey of exploration, a work project at Fathom Data required me to get familiar with
dvc (Data Version Control).
In this blog post, I want to share how
dvc can help us all to avoid the painful experience of coming back to an important project and feeling frustrated enough to quit immediately.
dvc is a command line interface (CLI) tool that works like
git to manage project objects i.e., large data files, model outputs or any file, that we don’t want to manage using
While the ‘data version control’ feature was the first application I used
dvc for, it is the ability to quickly create a project pipeline with
dvc that seems the most useful to me now.
Conveniently, the topic of a project pipeline which streamlines the onboarding of collaborators (including our future selves) is trivially easy to demonstrate/experience for oneself.
A Journey of Simplicity
In this blog post, I offer you two experiences:
- Dive into my
dvc-pipeline-demorepo on Github (here) and follow the instructions in the
READMEto experience the simplicity of reproducing the exact workflow and outputs that I built the repo for.
- Continue reading this post to learn how to create a project pipeline using
Collaborating Made Simple
When you clone my repo to your local machine, you are missing many of the dependencies and outputs of the pipeline. This is not a mistake on your part. I did not push these files to the remote repo. In some cases the files that our workflow depends on may be so large that tracking them with Git is not ideal. In other cases the raw data used by our workflow may be regularly updated, thereby rendering the previous iterations of our workflow invalid.
In the same way that we can use
git status to check the current state of our git-tracked files, we can use
dvc status to check the state of our repo in relation to the pipeline that is contained in
The stages of our pipeline are prepare, analyse, and report. In each stage we see the word changed followed by by either “outs” (outputs) or “deps” (dependencies). Reading line by line we see that each stage has issues with some form of missing or modified dependencies/outputs.
Running the Pipeline
To understand how
dvc status works, we need to look at the
dvc.yaml file located in the root folder of the repo.
In your cloned repo you can run:
You will see this:
dvc.yaml file contains our complete pipeline.
Each stage is populated with a cmd (command) to execute, as well as deps which the stage requires to be present before it can execute and outs which the stage will generate upon completion of the cmd.
If we spend a few moments looking for the deps and outs listed for each stage, we will find some of them but not others.
data/outputs.RData are missing, but
report_template.Rmd is present/modified.
This exactly matches the information that
dvc status gave us.
In some cases, we might need to pull a data file from a remote storage location tracked by
However, if we look closely at the prepare stage, we see that
data/penguins.csv is an output of the
This means that we just need to run the pipeline using:
Then you can sit back and watch the magic…
Each stage of the pipeline is executed in sequence.
At the end of the report stage, our
report.html file is created and we are told that
dvc.lock has been updated.
dvc.lockcan be tracked by git to make it easier for collaborators to know when they should rerun the
git pullof the
dvc.lockfile will tell us via
dvc statusthat our repo does not contain the latest versions of the
If it feels magical to be able to execute the entire project pipeline with a single
dvc repro call - I completely understand. But how does it work?
In short, when we run the
dvc repro command, the files tracked in the pipeline defined in
dvc.yaml are compared against their state listed in the
dvc.lock file contains an md5 hash and size information for the state of each file tracked in the
If we make changes to any of these files and run
dvc status, then the file states are compared against the
dvc.lock information and the output reflects any changes detected.
A Tactical Operator
One important thing to note is that
dvc will only run stages that have been affected by the change.
If there are stages that are unaffected, a call of
dvc repro will skip the stages with unchanged deps and outs.
For example, if we delete
dvc status tells us that only the analyse and report stages have been affected.
When we run
dvc repro, we see that prepare is skipped,
data/outputs.RData is retrieved (‘checked out’) from the
dvc cache of the file to complete the analyse stage and the report stage is then also skipped because there was no change in
data/outputs.RData which was used to generate
In this way,
dvc is able to get your repo into working order with the minimum processor time required.
That’s pretty cool, right?
dvc is a powerful CLI tool that can be used to version control large data files outside of our Git repo and manage the pipeline that processes these files in our workflow.
In this post, I have focussed on the latter implementation of
If you want to have some fun, try changing / deleting lines of code from any of the files tracked in
dvc status and use the information provided to guess what will happen when you run
There is no better way to see
dvc in action.