Skip to content

Versioning Code, Data and Models


Versioning code, data and models to ensure reproducible behavior in ML systems.
Goku Mohandas
Goku Mohandas
· · ·
Repository

📬  Receive new lessons straight to your inbox (once a month) and join 30K+ developers in learning how to responsibly deliver value with ML.

Intuition

We learned how to version our code but there are several other very important class of artifacts that we need track and version: config, data and models. It's important that we version everything so that we can reproduce the exact same application anytime. And we're going to do this by using a Git commit as a snapshot of the code, config, data used to produce a specific model. Here are the key elements we'll need to incorporate to make our application entirely reproducible:

  • repository should store pointers to large data and model artifacts living in blob storage.
  • use commits to store snapshots of the code, config, data and model and be able to update and rollback versions.
  • expose configurations so we can see and compare parameters.
data versioning

Application

There are many tools available for versioning our artifacts (GitLFS, Dolt, Pachyderm, etc.) but we'll be using the Data Version Control (DVC) library for it's simplicity, rich features and most importantly modularity. DVC has lots of other useful features (metrics, experiments, etc.) so be sure to explore those as well.

We'll be using DVC to version our datasets and model weights and store them in a local directory which will act as our blob storage. We could use remote blob storage options such as S3, GCP, Google Drive, DAGsHub, etc. but we're going to replicate the same actions locally so we can see how the data is stored.

We'll be using a local directory to act as our blob storage so we can develop and analyze everything locally. We'll continue to do this for other storage components as well such as feature stores and like we have been doing with our local model registry.

Set up

Let's start by installing DVC and initializing it to create a .dvc directory.

# Initialization
pip install dvc==2.10.2
dvc init

Be sure to add this package and version to our requirements.txt file.

Remote storage

After initializing DVC, we can establish where our remote storage will be. We'll be creating and using the stores/blob directory as our remote storage but in a production setting this would be something like S3. We'll define our blob store in our config/config.py file:

1
2
3
# Inside config/config.py
BLOB_STORE = Path(STORES_DIR, "blob")
BLOB_STORE.mkdir(parents=True, exist_ok=True)

We'll quickly run the config script so this storage is created:

python config/config.py

and we should see the blob storage:

stores/
├── blob
└── model

We need to notify DVC about this storage location so it knows where to save the data assets:

dvc remote add -d storage stores/blob
Setting 'storage' as a default remote.

Note

We can also use remote blob storage options such as S3, GCP, Google Drive, DAGsHub, etc. if we're collaborating with other developers. For example, here's how we would set up an S3 bucket to hold our versioned data:

# Create bucket: https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-bucket.html
# Add credentials: https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html
dvc remote modify storage access_key_id ${AWS_ACCESS_KEY_ID}
dvc remote modify storage secret_access_key ${AWS_SECRET_ACCESS_KEY}
dvc remote add -d storage s3://<BUCKET_NAME>

Add data

Now we're ready to add our data to our remote storage. This will automatically add the respective data assets to a .gitignore file (a new one will be created inside the data directory) and create pointer files which will point to where the data assets are actually stores (our remote storage). But first, we need to remove the data directory from our .gitignore file (otherwise DVC will throw a git-ignored error).

# Inside our .gitignore
logs/
stores/
# data/  # remove or comment this line

and now we're ready to add our data assets:

# Add artifacts
dvc add data/projects.csv
dvc add data/tags.csv
dvc add data/labeled_projects.csv

We should now see the automatically created data/.gitignore file:

# data/.gitignore
/projects.csv
/tags.csv
/labeled_projects.csv

and all the pointer files that were created for each data artifact we added:

data
├── .gitignore
├── labeled_projects.csv
├── labeled_projects.csv.dvc
├── projects.csv
├── projects.csv.dvc
├── tags.csv
└── tags.csv.dvc

Each pointer file will contain the md5 hash, size and the location (with respect to the data directory) which we'll be checking into our git repository.

1
2
3
4
5
# data/projects.csv.dvc
outs:
- md5: b103754da50e2e3e969894aa94a490ee
  size: 266992
  path: projects.csv

Note

In terms of versioning our model artifacts, we aren't pushing anything to our blob storage because our model registry already takes care of all that. Instead we expose the run ID, parameters and performance inside the config directory so we can easily view results and compare them with other local runs. For very large applications or in the case of multiple models in production, these artifacts would be stored in a metadata or evaluation store where they'll be indexed by model run IDs.

Push

Now we're ready to push our artifacts to our blob store:

dvc push

3 files pushed

If we inspect our storage (stores/blob), we'll can see that the data is efficiently stored:

# Remote storage
stores
└── blob
    ├── 3e
    │   └── 173e183b81085ff2d2dc3f137020ba
    ├── 72
    │   └── 2d428f0e7add4b359d287ec15d54ec
    ...

Note

In case we forget to add or push our artifacts, we can add it as a pre-commit hook so it happens automatically when we try to commit. If there are no changes to our versioned files, nothing will happen.

1
2
3
4
5
6
7
# Makefile
.PHONY: dvc
dvc:
    dvc add data/projects.csv
    dvc add data/tags.csv
    dvc add data/labeled_projects.csv
    dvc push
1
2
3
4
5
6
7
8
9
# .pre-commit-config.yaml
- repo: local
  hooks:
    - id: dvc
      name: dvc
      entry: make
      args: ["dvc"]
      language: system
      pass_filenames: false

Pull

When someone else wants to pull our data assets, we can use the pull command to fetch from our remote storage to our local directories. All we need is to first ensure that we have the latest pointer files (via git pull) and then pull from the remote storage.

dvc pull

We can quickly test this by deleting our data files (the .json files not the .dvc pointers) and run dvc pull to load the files from our blob store.

Operations

When we pull data from source or compute features, should they save the data itself or just the operations?

  • Version the data
    • This is okay if (1) the data is manageable, (2) if our team is small/early stage ML or (3) if changes to the data are infrequent.
    • But what happens as data becomes larger and larger and we keep making copies of it.
  • Version the operations
    • We could keep snapshots of the data (separate from our projects) and provided the operations and timestamp, we can execute operations on those snapshots of the data to recreate the precise data artifact used for training. Many data systems use time-travel to achieve this efficiently.
    • But eventually this also results in data storage bulk. What we need is an append-only data source where all changes are kept in a log instead of directly changing the data itself. So we can use the data system with the logs to produce versions of the data as they were without having to store separate snapshots of the the data itself.

To cite this content, please use:

1
2
3
4
5
6
@article{madewithml,
    author       = {Goku Mohandas},
    title        = { Versioning - Made With ML },
    howpublished = {\url{https://madewithml.com/}},
    year         = {2022}
}