If you have spent time working with Machine Learning, one thing is clear: it’s an iterative process. Machine learning is about rapid experimentation and iteration, each experiment consists of different parts:
- the data you use,
- hyperparameters,
- learning algorithm,
- architecture,
- and the optimal combination of all of those
Throughout this iterative process, your accuracy on your dataset will vary accordingly, and without keeping track of your experimenting history you won’t be able to learn much. Versioning lets you keep track of all of your experiments and their different components.
How to Version Control ML Projects?
One of the most popular ways for Version Control is Git.
Well, Git is really cool. But in the ML domain, you can’t keep all the data produced by your experiments in Git. So, one way is to store all the datasets in the cloud server like Amazon S3 and all the codes in Git. It seems a good choice but recalling the ML projects iterative and experimental nature, this way will create confusion and lead to a mess in the long run.

Here comes Data Version Control (DVC): an open-source version control system for ML projects, which allows versioning data files and directories, intermediate results, and ML models using Git, but without storing the file contents in the Git repository.
It is hardly possible in real life to develop a good machine learning model in a single pass. ML modeling is an iterative process and it is extremely important to keep track of your steps, dependencies between the steps, dependencies between your code and data files and all code running arguments.
— Dmitry Petrov , Creator of DVC
What is DVC and Why to Choose it?
Data Version Control, or DVC, is a type of experiment management software that has been built on top of the existing engineering toolset that you’re already used to, and particularly on a source code version control system (currently Git). DVC reduces the gap between existing tools and ML needs, allowing users to take advantage of experiment management software while reusing existing skills and intuition.
DVC Core Features:
- DVC works on top of Git repositories and has a similar command-line interface and Git workflow.
- It makes data science projects reproducible by creating lightweight pipelines using implicit dependency graphs.
- Large data file versioning works by creating pointers in your Git repository to the cache, typically stored on a local hard drive.
- DVC is Programming language agnostic: Python, R, Julia, shell scripts, etc. as well as ML library agnostic: Keras, Tensorflow, PyTorch, Scipy, etc.
- It’s Open-source and Self-serve: DVC is free and doesn’t require any additional services.
- DVC supports cloud storage (Amazon S3, Azure Blob Storage, and Google Cloud Storage) for data sources and pre-trained model sharing.

How Does DVC Work?
Let’s see how we can version control our ML project using DVC.
Installation
DVC works on Linux, Windows, and MacOS. On all of them, we can install DVC simply using pip or conda, other options are available too.
pip install dvc
Or
conda install -c conda-forge dvc
Workspace Initialization
Let’s start by creating a workspace we can version with Git. Then run dvc init
inside to create the DVC project:
git init
dvc init
git commit -m "Initialize DVC project"
After DVC initialization, a new directory .dvc/
will be created with the config
and .gitignore
files, as well as cache/
directory.
.dvc/cache
is one of the most important DVC directories. It will hold all the contents of tracked data files. Note that .dvc/.gitignore
lists this directory, which means that the cache directory is not under Git control. This is a local cache and you cannot git push
it.
Configure
Once you install DVC, you’ll be able to start using it (in its local setup) immediately. However, remote storage should be set up if you need to share data or models outside of the context of a personal project.
Adding Amazon S3 remote:
dvc remote add mynewremote s3://mybucket/myproject
DVC currently supports seven types of remotes:
local
: Local Directorys3
: Amazon Simple Storage Servicegs
: Google Cloud Storageazure
: Azure Blob Storagessh
: Secure Shellhdfs
: Hadoop Distributed File Systemhttp
: HTTP and HTTPS protocols
Getting Data
dvc get
can use any DVC project hosted on a Git repository to find the appropriate remote storage and download data artifacts from it.
dvc get https://github.com/iterative/dataset-registry \
get-started/data.xml -o data/data.xml
Adding Data
To take a file (or a directory) under DVC control without checking the file contents into Git just run dvc add
on it.
dvc add data/data.xml
DVC stores information about the added data in a special file called a DVC-file. Committing DVC-files with Git allows us to track different versions of the project data as it evolves with the source code under Git control.
Storing and Sharing Data
You can push your data files from your repository to the default remote storage just like Git using push command.
dvc push
Retrieving Data
You can retrieve data files into the workspace in your local machine, also like Git using pull command.
dvc pull
Connecting Code and Data
To achieve full reproducibility, we’ll have to connect code and configuration with the data it processes to produce the result.
Suppose we have src/prepare.py
script in our repo that splits our data into train and test sets, the following command transforms it into a reproducible stage for the ML pipeline (discussed in the next section) we’re building:
dvc run -f prepare.dvc \
-d src/prepare.py -d data/data.xml \
-o data/prepared \
python src/prepare.py data/data.xml
dvc run
generates the prepare.dvc
DVC-file, which has information about the data/prepared
output (a directory where two files, train.tsv
and test.tsv
, will be written to), and about the Python command that is required to build it. You don’t need to run dvc add
to place output files (prepared/train.tsv
and prepared/test.tsv
) under DVC control. dvc run
takes care of this.
Pipelines and Reproducing The Experiments
By using dvc run
multiple times, and specifying outputs of a command (stage) as dependencies in another one, we can describe a sequence of commands that gets to the desired result. This is what we call a data pipeline or dependency graph.
Let’s create a second stage (after prepare.dvc
, created in the previous section) to perform feature extraction:
dvc run -f featurize.dvc \
-d src/featurization.py -d data/prepared \
-o data/features \
python src/featurization.py \
data/prepared data/features
And the third stage for training:
dvc run -f train.dvc \
-d src/train.py -d data/features \
-o model.pkl \
python src/train.py data/features model.pkl
Now that we have built our pipeline, we can visualize its stages:
dvc pipeline show --ascii train.dvc
+-------------------+
| data/data.xml.dvc |
+-------------------+
*
*
*
+-------------+
| prepare.dvc |
+-------------+
*
*
*
+---------------+
| featurize.dvc |
+---------------+
*
*
*
+-----------+
| train.dvc |
+-----------+
Commands:
dvc pipeline show --ascii train.dvc --commands
+-------------------------------------+
| python src/prepare.py data/data.xml |
+-------------------------------------+
*
*
*
+---------------------------------------------------------+
| python src/featurization.py data/prepared data/features |
+---------------------------------------------------------+
*
*
*
+---------------------------------------------+
| python src/train.py data/features model.pkl |
+---------------------------------------------+
And even outputs:
dvc pipeline show --ascii train.dvc --outs
+---------------+
| data/data.xml |
+---------------+
*
*
*
+---------------+
| data/prepared |
+---------------+
*
*
*
+---------------+
| data/features |
+---------------+
*
*
*
+-----------+
| model.pkl |
+-----------+
It’s now extremely easy for you or your colleagues to reproduce the result end-to-end:
dvc repro train.dvc
Conclusion
In this post, we discussed the problem of ML projects version control. We introduced a cool version control system for ML built on top of the most popular version control system Git you love to use.
Do you know that we use all this and other AI technologies in our app? Look at what you’re reading now applied in action. Try our Almeta News app. You can download it from Google Play or Apple’s App Store