Modeling Baselines
Repository Β· Notebook
π¬ Receive new lessons straight to your inbox (once a month) and join 30K+ developers in learning how to responsibly deliver value with ML.
Intuition
Baselines are simple benchmarks which pave the way for iterative development:
- Rapid experimentation via hyperparameter tuning thanks to low model complexity.
- Discovery of data issues, false assumptions, bugs in code, etc. since model itself is not complex.
- Pareto's principle: we can achieve decent performance with minimal initial effort.
Process
Here is the high level approach to establishing baselines:
- Start with the simplest possible baseline to compare subsequent development with. This is often a random (chance) model.
- Develop a rule-based approach (when possible) using IFTTT, auxiliary data, etc.
- Slowly add complexity by addressing limitations and motivating representations and model architectures.
- Weigh tradeoffs (performance, latency, size, etc.) between performant baselines.
- Revisit and iterate on baselines as your dataset grows.
Tradeoffs to consider
When choosing what model architecture(s) to proceed with, what are important tradeoffs to consider? And how can we prioritize them?
Show answer
Prioritization of these tradeoffs depends on your context.
performance
: consider coarse-grained and fine-grained (ex. per-class) performance.latency
: how quickly does your model respond for inference.size
: how large is your model and can you support it's storage.compute
: how much will it cost ($, carbon footprint, etc.) to train your model?interpretability
: does your model need to explain its predictions?bias checks
: does your model pass key bias checks?time to develop
: how long do you have to develop the first version?time to retrain
: how long does it take to retrain your model? This is very important to consider if you need to retrain often.maintenance overhead
: who and what will be required to maintain your model versions because the real work with ML begins after deploying v1. You can't just hand it off to your site reliability team to maintain it like many teams do with traditional software.
Iterate on the data
We can also baseline on your dataset. Instead of using a fixed dataset and iterating on the models, choose a good baseline and iterate on the dataset:
- remove or fix data samples (false positives & negatives)
- prepare and transform features
- expand or consolidate classes
- incorporate auxiliary datasets
- identify unique slices to boost
Distributed training
All the training we need to do for our application happens on one worker with one accelerator (CPU/GPU), however, we'll want to consider distributed training for very large models or when dealing with large datasets. Distributed training can involve:
- data parallelism: workers received different slices of the larger dataset.
- synchronous training uses AllReduce to aggregate gradients and update all the workers weights at the end of each batch (synchronous).
- asynchronous training uses a universal parameter server to update weights as each worker trains on its slice of data (asynchronous).
- model parallelism: all workers use the same dataset but the model is split amongst them (more difficult to implement compared to data parallelism because it's difficult to isolate and combine signal from backpropagation).
There are lots of options for applying distributed training such as with PyTorch's distributed package, Ray, Horovd, etc.
Optimization
Distributed training strategies are great for when our data or models are too large for training but what about when our models are too large to deploy? The following model compression techniques are commonly used to make large models fit within existing infrastructure:
- Pruning: remove weights (unstructured) or entire channels (structured) to reduce the size of the network. The objective is to preserve the modelβs performance while increasing its sparsity.
- Quantization: reduce the memory footprint of the weights by reducing their precision (ex. 32 bit to 8 bit). We may loose some precision but it shouldnβt affect performance too much.
- Distillation: training smaller networks to βmimicβ larger networks by having it reproduce the larger networkβs layersβ outputs.

Baselines
Each application's baseline trajectory varies based on the task. For our application, we're going to follow this path:
We'll motivate the need for slowly adding complexity to both the representation (ex. text vectorization) and architecture (ex. logistic regression), as well as address the limitations at each step of the way.
If you're unfamiliar with of the modeling concepts here, be sure to check out the Foundations lessons.
Note
The specific model we use is irrelevant for this MLOps course since the main focus is on all the components required to put a model in production and maintain it. So feel free to choose any model as we continue to the other lessons after this notebook.
We'll first set up some functions that we'll be using across the different baseline experiments.
1 |
|
1 2 3 4 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
1 2 3 4 5 6 7 |
|
Our dataset is small so we'll train using the whole dataset but for larger datasets, we should always test on a small subset (after shuffling when necessary) so we aren't wasting time on compute.
1 2 |
|
Do we need to shuffle?
Why is it important that we shuffle our dataset?
Show answer
We need to shuffle our data since our data is chronologically organized. The latest projects may have certain features or tags that are prevalent compared to earlier projects. If we don't shuffle before creating our data splits, then our model will only be trained on the earlier signals and fail to generalize. However, in other scenarios (ex. time-series forecasting), shuffling will lead do data leaks.
Random
motivation: We want to know what random (chance) performance looks like. All of our efforts should be well above this baseline.
1 |
|
1 2 3 4 5 6 7 8 |
|
1 2 3 |
|
<LabelEncoder(num_classes=4)> ['computer-vision', 'mlops', 'natural-language-processing', 'other']
1 2 3 4 |
|
(144,) [0 0 0 1 3]
1 2 3 4 |
|
{ "precision": 0.31684880006233446, "recall": 0.2361111111111111, "f1": 0.2531624273393283 }
We made the assumption that there is an equal probability for every class. Let's use the train split to figure out what the true probability is.
1 2 3 |
|
[0.375, 0.08333333333333333, 0.4027777777777778, 0.1388888888888889]
1 2 |
|
1 2 3 4 |
|
{ "precision": 0.316412540257649, "recall": 0.3263888888888889, "f1": 0.31950372012322 }
limitations: we didn't use the tokens in our input to affect our predictions so nothing was learned.
Rule-based
motivation: we want to use signals in our inputs (along with domain expertise and auxiliary data) to determine the labels.
1 2 3 4 5 6 7 8 |
|
1 2 3 4 5 6 7 8 9 10 |
|
1 2 3 |
|
'natural-language-processing'
1 2 3 4 5 |
|
1 2 |
|
1 2 3 4 |
|
{ "precision": 0.9097222222222222, "recall": 0.18055555555555555, "f1": 0.2919455654201417 }
Why is recall so low?
How come our precision is high but our recall is so low?
Show answer
Only relying on the aliases can prove catastrophic when those particular aliases aren't used in our input signals. To improve this, we can build a bag of words of related terms. For example, mapping terms such as text classification
and named entity recognition
to the natural-language-processing
tag but building this is a non-trivial task. Not to mention, we'll need to keep updating these rules as the data landscape matures.
1 2 3 |
|
None
Tip
We could also use stemming to further refine our rule-based process:
1 2 3 4 |
|
democraci democraci
But these rule-based approaches can only yield labels with high certainty when there is an absolute condition match so it's best not to spend too much more effort on this approach.
limitations: we failed to generalize or learn any implicit patterns to predict the labels because we treat the tokens in our input as isolated entities.
Vectorization
motivation:
- representation: use term frequency-inverse document frequency (TF-IDF) to capture the significance of a token to a particular input with respect to all the inputs, as opposed to treating the words in our input text as isolated tokens.
- architecture: we want our model to meaningfully extract the encoded signal to predict the output labels.
So far we've treated the words in our input text as isolated tokens and we haven't really captured any meaning between tokens. Let's use TF-IDF (via Scikit-learn's TfidfVectorizer
) to capture the significance of a token to a particular input with respect to all the inputs.
Variable | Description |
---|---|
\(w_{i, j}\) | tf-idf weight for term \(i\) in document \(j\) |
\(\text{tf}_{i, j}\) | # of times term \(i\) appear in document \(j\) |
\(N\) | total # of documents$ |
\(\text{df}_i\) | # of documents with token \(i\) |
1 |
|
1 2 3 4 5 6 7 8 |
|
1 2 |
|
1 2 3 4 5 6 7 |
|
tao large scale benchmark tracking object diverse dataset tracking object tao consisting 2 907 high resolution videos captured diverse environments half minute long (668, 99664)
1 2 3 4 |
|
class counts: [249 55 272 92], class weights: {0: 0.004016064257028112, 1: 0.01818181818181818, 2: 0.003676470588235294, 3: 0.010869565217391304}
Data imbalance
With our datasets, we may often notice a data imbalance problem where a range of continuous values (regression) or certain classes (classification) may have insufficient amounts of data to learn from. This becomes a major issue when training because the model will learn to generalize to the data available and perform poorly on regions where the data is sparse. There are several techniques to mitigate data imbalance, including resampling, incorporating class weights, augmentation, etc. Though the ideal solution is to collect more data for the minority classes!
We'll use the imblearn package to ensure that we oversample our minority classes to be equal to the majority class (tag with most samples).
pip install imbalanced-learn==0.8.1 -q
1 |
|
1 2 3 |
|
Warning
It's important that we applied sampling only on the train split so we don't introduce data leaks with the other data splits.
1 2 3 4 |
|
class counts: [272 272 272 272], class weights: {0: 0.003676470588235294, 1: 0.003676470588235294, 2: 0.003676470588235294, 3: 0.003676470588235294}
limitations:
- representation: TF-IDF representations don't encapsulate much signal beyond frequency but we require more fine-grained token representations.
- architecture: we want to develop models that can use better represented encodings in a more contextual manner.
Machine learning
We're going to use a stochastic gradient descent classifier (SGDClassifier) as our model. We're going to use log loss so that it's effectively logistic regression with SGD.
We're doing this because we want to have more control over the training process (epochs) and not use scikit-learn's default second order optimization methods (ex. LGBFS) for logistic regression.
1 2 3 |
|
1 2 3 4 5 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
Epoch: 00 | train_loss: 1.16930, val_loss: 1.21451 Epoch: 10 | train_loss: 0.46116, val_loss: 0.65903 Epoch: 20 | train_loss: 0.31565, val_loss: 0.56018 Epoch: 30 | train_loss: 0.25207, val_loss: 0.51967 Epoch: 40 | train_loss: 0.21740, val_loss: 0.49822 Epoch: 50 | train_loss: 0.19615, val_loss: 0.48529 Epoch: 60 | train_loss: 0.18249, val_loss: 0.47708 Epoch: 70 | train_loss: 0.17330, val_loss: 0.47158 Epoch: 80 | train_loss: 0.16671, val_loss: 0.46765 Epoch: 90 | train_loss: 0.16197, val_loss: 0.46488
We could further optimize our training pipeline with functionality such as early stopping where we would use our validation set that we created. But we want to keep this model-agnostic course simplified during the modeling stage π
Warning
The SGDClassifier has an early_stopping
flag where you can specify a portion of the training set to be use for validation. Why would this be a bad idea in our case? Because we already applied oversampling in our training set and so we would be introduce data leaks if we did this.
1 2 3 4 5 |
|
{ "precision": 0.8753577441077441, "recall": 0.8680555555555556, "f1": 0.8654096949533866 }
Tip
Scikit-learn has a concept called pipeline which allows us to combine transformations and training steps into one callable function.
We can create a pipeline from scratch:
1 2 3 4 5 |
|
or make one with trained components:
1 2 3 |
|
limitations:
- representation: TF-IDF representations don't encapsulate much signal beyond frequency but we require more fine-grained token representations that can account for the significance of the token itself (embeddings).
- architecture: we want to develop models that can use better represented encodings in a more contextual manner.
1 2 3 4 |
|
['natural-language-processing']
1 2 3 |
|
{'computer-vision': 0.023672281234089494, 'mlops': 0.004158589896756235, 'natural-language-processing': 0.9621906411391856, 'other': 0.009978487729968667}
1 2 3 4 |
|
['natural-language-processing']
1 2 3 |
|
{'computer-vision': 0.13150802188532523, 'mlops': 0.11198040241517894, 'natural-language-processing': 0.584025872986128, 'other': 0.17248570271336786}
We're going to create a custom predict function where if the majority class is not above a certain softmax score, then we predict the other
class. In our objectives, we decided that precision is really important for us and that we can leverage the labeling and QA workflows to improve the recall during subsequent manual inspection.
Warning
Our models can suffer from overconfidence so applying this limitation may not be as effective as we'd imagine, especially for larger neural networks. See the confident learning section of the evaluation lesson for more information.
1 2 3 4 5 |
|
0.6742890218960005
Warning
It's very important that we do this on our validation split so we aren't inflating the value using the train split or leaking information prior to evaluation on the test split.
1 2 3 4 5 6 |
|
1 2 3 4 5 |
|
1 2 3 |
|
['other']
1 2 3 4 5 6 |
|
{ "precision": 0.9116161616161617, "recall": 0.7569444444444444, "f1": 0.7929971988795519 }
Tip
We could've even used per-class thresholds, especially since we have some data imbalance which can impact how confident the model is regarding some classes.
1 2 3 4 5 6 |
|
This MLOps course is actually model-agnostic (as long as it produces probability distributions) so feel free to use more complex representations (embeddings) with more sophisticated architectures (CNNs, transformers, etc.). We're going to use this basic logistic regression model throughout the rest of the lessons because it's easy, fast and actually has comparable performance (<10% f1 diff compared to state-of-the-art pretrained transformers).
To cite this content, please use:
1 2 3 4 5 6 |
|