Versioning Code, Data and Models
Repository
📬 Receive new lessons straight to your inbox (once a month) and join 30K+ developers in learning how to responsibly deliver value with ML.
Intuition
We learned how to version our code but there are several other very important class of artifacts that we need track and version: config, data and models. It's important that we version everything so that we can reproduce the exact same application anytime. And we're going to do this by using a Git commit as a snapshot of the code, config, data used to produce a specific model. Here are the key elements we'll need to incorporate to make our application entirely reproducible:
- repository should store pointers to large data and model artifacts living in blob storage.
- use commits to store snapshots of the code, config, data and model and be able to update and rollback versions.
- expose configurations so we can see and compare parameters.

Application
There are many tools available for versioning our artifacts (GitLFS, Dolt, Pachyderm, etc.) but we'll be using the Data Version Control (DVC) library for it's simplicity, rich features and most importantly modularity. DVC has lots of other useful features (metrics, experiments, etc.) so be sure to explore those as well.
We'll be using DVC to version our datasets and model weights and store them in a local directory which will act as our blob storage. We could use remote blob storage options such as S3, GCP, Google Drive, DAGsHub, etc. but we're going to replicate the same actions locally so we can see how the data is stored.
We'll be using a local directory to act as our blob storage so we can develop and analyze everything locally. We'll continue to do this for other storage components as well such as feature stores and like we have been doing with our local model registry.
Set up
Let's start by installing DVC and initializing it to create a .dvc directory.
# Initialization
pip install dvc==2.43.1
dvc init
Be sure to add this package and version to our
requirements.txt
file.
Remote storage
After initializing DVC, we can establish where our remote storage will be. We'll be creating and using the stores/blob
directory as our remote storage but in a production setting this would be something like S3. We'll define our blob store in our config/config.py
file:
1 2 3 |
|
We'll quickly run the config script so this storage is created:
python config/config.py
and we should see the blob storage:
stores/
├── blob
└── model
We need to notify DVC about this storage location so it knows where to save the data assets:
dvc remote add -d storage stores/blob
Setting 'storage' as a default remote.
Note
We can also use remote blob storage options such as S3, GCP, Google Drive, DAGsHub, etc. if we're collaborating with other developers. For example, here's how we would set up an S3 bucket to hold our versioned data:
# Create bucket: https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-bucket.html
# Add credentials: https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html
dvc remote modify storage access_key_id ${AWS_ACCESS_KEY_ID}
dvc remote modify storage secret_access_key ${AWS_SECRET_ACCESS_KEY}
dvc remote add -d storage s3://<BUCKET_NAME>
Add data
Now we're ready to add our data to our remote storage. This will automatically add the respective data assets to a .gitignore
file (a new one will be created inside the data
directory) and create pointer files which will point to where the data assets are actually stores (our remote storage). But first, we need to remove the data
directory from our .gitignore
file (otherwise DVC will throw a git-ignored error).
# Inside our .gitignore
logs/
stores/
# data/ # remove or comment this line
and now we're ready to add our data assets:
# Add artifacts
dvc add data/projects.csv
dvc add data/tags.csv
dvc add data/labeled_projects.csv
We should now see the automatically created data/.gitignore
file:
# data/.gitignore
/projects.csv
/tags.csv
/labeled_projects.csv
and all the pointer files that were created for each data artifact we added:
data
├── .gitignore
├── labeled_projects.csv
├── labeled_projects.csv.dvc
├── projects.csv
├── projects.csv.dvc
├── tags.csv
└── tags.csv.dvc
Each pointer file will contain the md5 hash, size and the location (with respect to the data
directory) which we'll be checking into our git repository.
1 2 3 4 5 |
|
Note
In terms of versioning our model artifacts, we aren't pushing anything to our blob storage because our model registry already takes care of all that. Instead we expose the run ID, parameters and performance inside the config
directory so we can easily view results and compare them with other local runs. For very large applications or in the case of multiple models in production, these artifacts would be stored in a metadata or evaluation store where they'll be indexed by model run IDs.
Push
Now we're ready to push our artifacts to our blob store:
dvc push
3 files pushed
If we inspect our storage (stores/blob
), we'll can see that the data is efficiently stored:
# Remote storage
stores
└── blob
├── 3e
│ └── 173e183b81085ff2d2dc3f137020ba
├── 72
│ └── 2d428f0e7add4b359d287ec15d54ec
...
Note
In case we forget to add or push our artifacts, we can add it as a pre-commit hook so it happens automatically when we try to commit. If there are no changes to our versioned files, nothing will happen.
1 2 3 4 5 6 7 |
|
1 2 3 4 5 6 7 8 9 |
|
Pull
When someone else wants to pull our data assets, we can use the pull
command to fetch from our remote storage to our local directories. All we need is to first ensure that we have the latest pointer files (via git pull
) and then pull from the remote storage.
dvc pull
We can quickly test this by deleting our data files (the
.json
files not the.dvc
pointers) and rundvc pull
to load the files from our blob store.
Operations
When we pull data from source or compute features, should they save the data itself or just the operations?
- Version the data
- This is okay if (1) the data is manageable, (2) if our team is small/early stage ML or (3) if changes to the data are infrequent.
- But what happens as data becomes larger and larger and we keep making copies of it.
- Version the operations
- We could keep snapshots of the data (separate from our projects) and provided the operations and timestamp, we can execute operations on those snapshots of the data to recreate the precise data artifact used for training. Many data systems use time-travel to achieve this efficiently.
- But eventually this also results in data storage bulk. What we need is an append-only data source where all changes are kept in a log instead of directly changing the data itself. So we can use the data system with the logs to produce versions of the data as they were without having to store separate snapshots of the the data itself.
To cite this content, please use:
1 2 3 4 5 6 |
|