Evaluating Machine Learning Models
Repository Β· Notebook
π¬ Receive new lessons straight to your inbox (once a month) and join 30K+ developers in learning how to responsibly deliver value with ML.
Intuition
Evaluation is an integral part of modeling and it's one that's often glossed over. We'll often find evaluation to involve simply computing the accuracy or other global metrics but for many real work applications, a much more nuanced evaluation process is required. However, before evaluating our model, we always want to:
- be clear about what metrics we are prioritizing
- be careful not to over optimize on any one metric because it may mean you're compromising something else
1 2 |
|
1 2 3 4 |
|
Coarse-grained
While we were iteratively developing our baselines, our evaluation process involved computing the coarse-grained metrics such as overall precision, recall and f1 metrics.
1 |
|
1 2 3 4 5 6 7 |
|
{ "precision": 0.8990934378802025, "recall": 0.8194444444444444, "f1": 0.838280325954406, "num_samples": 144.0 }
Note
The precision_recall_fscore_support() function from scikit-learn has an input parameter called average
which has the following options below. We'll be using the different averaging methods for different metric granularities.
None
: metrics are calculated for each unique class.binary
: used for binary classification tasks where thepos_label
is specified.micro
: metrics are calculated using global TP, FP, and FN.macro
: per-class metrics which are averaged without accounting for class imbalance.weighted
: per-class metrics which are averaged by accounting for class imbalance.samples
: metrics are calculated at the per-sample level.
Fine-grained
Inspecting these coarse-grained, overall metrics is a start but we can go deeper by evaluating the same fine-grained metrics at the categorical feature levels.
1 |
|
1 2 3 4 5 6 7 8 9 |
|
1 2 3 |
|
{ "precision": 0.9803921568627451, "recall": 0.8620689655172413, "f1": 0.9174311926605505, "num_samples": 58.0 }
1 2 3 4 5 |
|
[ "natural-language-processing", { "precision": 0.9803921568627451, "recall": 0.8620689655172413, "f1": 0.9174311926605505, "num_samples": 58.0 } ] [ "mlops", { "precision": 0.9090909090909091, "recall": 0.8333333333333334, "f1": 0.8695652173913043, "num_samples": 12.0 } ] [ "computer-vision", { "precision": 0.975, "recall": 0.7222222222222222, "f1": 0.8297872340425532, "num_samples": 54.0 } ] [ "other", { "precision": 0.4523809523809524, "recall": 0.95, "f1": 0.6129032258064516, "num_samples": 20.0 } ]
Due to our custom predict function, we're able to achieve high precision for the categories except for
other
. Based on our product design, we decided that it's more important to be precise about our explicit ML categories (nlp, cv, and mlops) and that we would have a manual labeling workflow to recall any misclassifications in theother
category. Overtime, our model will become better in this category as well.
Confusion matrix
Besides just inspecting the metrics for each class, we can also identify the true positives, false positives and false negatives. Each of these will give us insight about our model beyond what the metrics can provide.
- True positives (TP): learn about where our model performs well.
- False positives (FP): potentially identify samples which may need to be relabeled.
- False negatives (FN): identify the model's less performant areas to oversample later.
It's a good to have our FP/FN samples feed back into our annotation pipelines in the event we want to fix their labels and have those changes be reflected everywhere.
1 2 3 4 5 6 7 8 9 10 11 12 |
|
1 2 3 |
|
[1, 3, 4, 41, 47, 77, 94, 127, 138] [14, 88] [30, 71, 106]
1 2 3 4 |
|
pytest pytest framework makes easy write small tests yet scales support complex functional testing true: mlops pred: mlops
1 2 3 4 5 6 7 8 9 10 |
|
=== True positives === pytest pytest framework makes easy write small tests yet scales support complex functional testing true: mlops pred: mlops test machine learning code systems minimal examples testing machine learning correct implementation expected learned behavior model performance true: mlops pred: mlops continuous machine learning cml cml helps organize mlops infrastructure top traditional software engineering stack instead creating separate ai platforms true: mlops pred: mlops === False positives === paint machine learning web app allows create landscape painting style bob ross using deep learning model served using spell model server true: computer-vision pred: mlops === False negatives === hidden technical debt machine learning systems using software engineering framework technical debt find common incur massive ongoing maintenance costs real world ml systems true: mlops pred: other neptune ai lightweight experiment management tool fits workflow true: mlops pred: other
Tip
It's a really good idea to do this kind of analysis using our rule-based approach to catch really obvious labeling errors.
Confident learning
While the confusion-matrix sample analysis was a coarse-grained process, we can also use fine-grained confidence based approaches to identify potentially mislabeled samples. Here weβre going to focus on the specific labeling quality as opposed to the final model predictions.
Simple confidence based techniques include identifying samples whose:
-
Categorical
- prediction is incorrect (also indicate TN, FP, FN)
- confidence score for the correct class is below a threshold
- confidence score for an incorrect class is above a threshold
- standard deviation of confidence scores over top N samples is low
- different predictions from same model using different parameters
-
Continuous
- difference between predicted and ground-truth values is above some %
1 2 3 4 |
|
1 2 |
|
1 2 3 4 |
|
1 2 3 4 5 6 7 8 9 10 |
|
1 |
|
[{'pred': 'other', 'prob': 0.41281721056332804, 'text': 'neptune ai lightweight experiment management tool fits workflow', 'true': 'mlops'}]
But these are fairly crude techniques because neural networks are easily overconfident and so their confidences cannot be used without calibrating them.

On Calibration of Modern Neural Networks
- Assumption: βthe probability associated with the predicted class label should reflect its ground truth correctness likelihood.β
- Reality: βmodern (large) neural networks are no longer well-calibratedβ
- Solution: apply temperature scaling (extension of Platt scaling) on model outputs
Recent work on confident learning (cleanlab) focuses on identifying noisy labels (with calibration), which can then be properly relabeled and used for training.
pip install cleanlab==1.0.1 -q
1 2 |
|
1 2 3 4 5 6 |
|
Not all of these are necessarily labeling errors but situations where the predicted probabilities were not so confident. Therefore, it will be useful to attach the predicted outcomes along side results. This way, we can know if we need to relabel, upsample, etc. as mitigation strategies to improve our performance.
1 2 3 4 5 6 |
|
text: module 2 convolutional neural networks cs231n lecture 5 move fully connected neural networks convolutional neural networks true: computer-vision pred: other
The operations in this section can be applied to entire labeled dataset to discover labeling errors via confidence learning.
Manual slices
Just inspecting the overall and class metrics isn't enough to deploy our new version to production. There may be key slices of our dataset that we need to do really well on:
- Target / predicted classes (+ combinations)
- Features (explicit and implicit)
- Metadata (timestamps, sources, etc.)
- Priority slices / experience (minority groups, large customers, etc.)
An easy way to create and evaluate slices is to define slicing functions.
pip install snorkel==0.9.8 -q
1 2 3 |
|
1 2 3 4 5 6 |
|
1 2 3 4 |
|
Here we're using Snorkel's slicing_function
to create our different slices. We can visualize our slices by applying this slicing function to a relevant DataFrame using slice_dataframe
.
1 2 |
|
1 2 |
|
We can define even more slicing functions and create a slices record array using the PandasSFApplier
. The slices array has N (# of data points) items and each item has S (# of slicing functions) items, indicating whether that data point is part of that slice. Think of this record array as a masking layer for each slicing function on our data.
1 2 3 4 5 |
|
rec.array([(0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (1, 0) (0, 0) (0, 1) (0, 0) (0, 0) (1, 0) (0, 0) (0, 0) (0, 1) (0, 0) ... (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 1), (0, 0), (0, 0)], dtype=[('nlp_cnn', '<i8'), ('short_text', '<i8')])
To calculate metrics for our slices, we could use snorkel.analysis.Scorer but we've implemented a version that will work for multiclass or multilabel scenarios.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
1 |
|
{ "nlp_cnn": { "precision": 1.0, "recall": 1.0, "f1": 1.0, "num_samples": 1 }, "short_text": { "precision": 0.8, "recall": 0.8, "f1": 0.8000000000000002, "num_samples": 5 } }
Slicing can help identify sources of bias in our data. For example, our model has most likely learned to associated algorithms with certain applications such as CNNs used for computer vision or transformers used for NLP projects. However, these algorithms are not being applied beyond their initial use cases. Weβd need ensure that our model learns to focus on the application over algorithm. This could be learned with:
- enough data (new or oversampling incorrect predictions)
- masking the algorithm (using text matching heuristics)
Generated slices
Manually creating slices is a massive improvement towards identifying problem subsets in our dataset compared to coarse-grained evaluation but what if there are problematic slices of our dataset that we failed to identify? SliceLine is a recent work that uses a linear-algebra and pruning based technique to identify large slices (specify minimum slice size) that result in meaningful errors from the forward pass. Without pruning, automatic slice identification becomes computationally intensive because it involves enumerating through many combinations of data points to identify the slices. But with this technique, we can discover hidden underperforming subsets in our dataset that we werenβt explicitly looking for!

Automated Data Slicing for Model Validation
Hidden stratification
What if the features to generate slices on are implicit/hidden?

To address this, there are recent clustering-based techniques to identify these hidden slices and improve the system.
- Estimate implicit subclass labels via unsupervised clustering
- Train new more robust model using these clusters

Model patching
Another recent work on model patching takes this another step further by learning how to transform between subgroups so we can train models on the augmented data:
- Learn subgroups
- Learn transformations (ex. CycleGAN) needed to go from one subgroup to another under the same superclass (label)
- Augment data with artificially introduced subgroup features
- Train new robust model on augmented data

Interpretability
Besides just comparing predicted outputs with ground truth values, we can also inspect the inputs to our models. What aspects of the input are more influential towards the prediction? If the focus is not on the relevant features of our input, then we need to explore if there is a hidden pattern we're missing or if our model has learned to overfit on the incorrect features. We can use techniques such as SHAP (SHapley Additive exPlanations) or LIME (Local Interpretable Model-agnostic Explanations) to inspect feature importance. On a high level, these techniques learn which features have the most signal by assessing the performance in their absence. These inspections can be performed on a global level (ex. per-class) or on a local level (ex. single prediction).
pip install lime==0.2.0.1 -q
1 2 |
|
It's easier to use LIME with scikit-learn pipelines so we'll combine our vectorizer and model into one construct.
1 2 |
|
1 2 3 4 |
|

We can also use model-specific approaches to interpretability we we did in our embeddings lesson to identify the most influential n-grams in our text.
Counterfactuals
Another way to evaluate our systems is to identify counterfactuals -- data with similar features that belongs to another class (classification) or above a certain difference (regression). These points allow us to evaluate model sensitivity to certain features and feature values that may be signs of overfitting. A great tool to identify and probe for counterfactuals (also great for slicing and fairness metrics) is the What-if tool.

For our task, this can involve projects that use algorithms are typically reserved for a certain application area (such as CNNs for computer vision or transformers for NLP).
Behavioral testing
Besides just looking at metrics, we also want to conduct some behavioral sanity tests. Behavioral testing is the process of testing input data and expected outputs while treating the model as a black box. They don't necessarily have to be adversarial in nature but more along the types of perturbations we'll see in the real world once our model is deployed. A landmark paper on this topic is Beyond Accuracy: Behavioral Testing of NLP Models with CheckList which breaks down behavioral testing into three types of tests:
invariance
: Changes should not affect outputs.1 2 3 4
# INVariance via verb injection (changes should not affect outputs) tokens = ["revolutionized", "disrupted"] texts = [f"Transformers applied to NLP have {token} the ML field." for token in tokens] predict_tag(texts=texts)
['natural-language-processing', 'natural-language-processing']
directional
: Change should affect outputs.1 2 3 4
# DIRectional expectations (changes with known outputs) tokens = ["text classification", "image classification"] texts = [f"ML applied to {token}." for token in tokens] predict_tag(texts=texts)
['natural-language-processing', 'computer-vision']
minimum functionality
: Simple combination of inputs and expected outputs.1 2 3 4
# Minimum Functionality Tests (simple input/output pairs) tokens = ["natural language processing", "mlops"] texts = [f"{token} is the next big wave in machine learning." for token in tokens] predict_tag(texts=texts)
['natural-language-processing', 'mlops']
We'll learn how to systematically create tests in our testing lesson.
Evaluating evaluations
How can we know if our models and systems are performing better over time? Unfortunately, depending on how often we retrain or how quickly our dataset grows, it won't always be a simple decision where all metrics/slices are performing better than the previous version. In these scenarios, it's important to know what our main priorities are and where we can have some leeway:
- What criteria are most important?
- What criteria can/cannot regress?
- How much of a regression can be tolerated?
1 2 3 4 |
|
And as we develop these criteria over time, we can systematically enforce them via CI/CD workflows to decrease the manual time in between system updates.
Seems straightforward, doesn't it?
With all these different evaluation methods, how can we choose "the best" version of our model if some versions are better for some evaluation criteria?
Show answer
The team needs to agree on what evaluation criteria are most important and what is the minimum performance required for each one. This will allow us to filter amongst all the different solutions by removing ones that don't satisfy all the minimum requirements and ranking amongst the remaining by which ones perform the best for the highest priority criteria.
Online evaluation
Once we've evaluated our model's ability to perform on a static dataset we can run several types of online evaluation techniques to determine performance on actual production data. It can be performed using labels or, in the event we don't readily have labels, proxy signals.
- manually label a subset of incoming data to evaluate periodically.
- asking the initial set of users viewing a newly categorized content if it's correctly classified.
- allow users to report misclassified content by our model.
And there are many different experimentation strategies we can use to measure real-time performance before committing to replace our existing version of the system.
AB tests
AB testing involves sending production traffic to our current system (control group) and the new version (treatment group) and measuring if there is a statistical difference between the values for two metrics. There are several common issues with AB testing such as accounting for different sources of bias, such as the novelty effect of showing some users the new system. We also need to ensure that the same users continue to interact with the same systems so we can compare the results without contamination.

In many cases, if we're simply trying to compare the different versions for a certain metric, AB testing can take while before we reach statical significance since traffic is evenly split between the different groups. In this scenario, multi-armed bandits will be a better approach since they continuously assign traffic to the better performing version.
Canary tests
Canary tests involve sending most of the production traffic to the currently deployed system but sending traffic from a small cohort of users to the new system we're trying to evaluate. Again we need to make sure that the same users continue to interact with the same system as we gradually roll out the new system.

Shadow tests
Shadow testing involves sending the same production traffic to the different systems. We don't have to worry about system contamination and it's very safe compared to the previous approaches since the new system's results are not served. However, we do need to ensure that we're replicating as much of the production system as possible so we can catch issues that are unique to production early on. But overall, shadow testing is easy to monitor, validate operational consistency, etc.

What can go wrong?
If shadow tests allow us to test our updated system without having to actually serve the new results, why doesn't everyone adopt it?
Show answer
With shadow deployment, we'll miss out on any live feedback signals (explicit/implicit) from our users since users are not directly interacting with the product using our new version.
We also need to ensure that we're replicating as much of the production system as possible so we can catch issues that are unique to production early on. This is rarely possible because, while your ML system may be a standalone microservice, it ultimately interacts with an intricate production environment that has many dependencies.
Model CI
An effective way to evaluate our systems is to encapsulate them as a collection (suite) and use them for continuous integration. We would continue to add to our evaluation suites and they would be executed whenever we are experimenting with changes to our system (new models, data, etc.). Often, problematic slices of data identified during monitoring are often added to the evaluation test suite to avoid repeating the same regressions in the future.
Capability vs. alignment
We've seen the many different metrics that we'll want to calculate when it comes to evaluating our model but not all metrics mean the same thing. And this becomes very important when it comes to choosing the "best" model(s).
- capability: the ability of our model to perform a task, measured by the objective function we optimize for (ex. log loss)
- alignment: desired behavior of our model, measure by metrics that are not differentiable or don't account for misclassifications and probability differences (ex. accuracy, precision, recall, etc.)
While capability (ex. loss) and alignment (ex. accuracy) metrics may seem to be aligned, their differences can indicate issues in our data:
- β accuracy, β loss = large errors on lots of data (worst case)
- β accuracy, β loss = small errors on lots of data, distributions are close but tipped towards misclassifications (misaligned)
- β accuracy, β loss = large errors on some data (incorrect predictions have very skewed distributions)
- β accuracy, β loss = no/few errors on some data (best case)
Resources
- Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning
- On Calibration of Modern Neural Networks
- Confident Learning: Estimating Uncertainty in Dataset Labels
- Automated Data Slicing for Model Validation
- SliceLine: Fast, Linear-Algebra-based Slice Finding for ML Model Debugging
- Distributionally Robust Neural Networks for Group Shifts
- No Subclass Left Behind: Fine-Grained Robustness in Coarse-Grained Classification Problems
- Model Patching: Closing the Subgroup Performance Gap with Data Augmentation
To cite this content, please use:
1 2 3 4 5 6 |
|