Skip to content

Evaluating ML Models


Evaluating ML models by assessing overall, per-class and slice performances.
Goku Mohandas
· ·
Repository ยท Notebook

๐Ÿ“ฌ  Receive new lessons straight to your inbox (once a month) and join 30K+ developers in learning how to responsibly deliver value with ML.

Intuition

Evaluation is an integral part of modeling and it's one that's often glossed over. We'll often find evaluation to involve simply computing the accuracy or other global metrics but for many real work applications, a much more nuanced evaluation process is required. However, before evaluating our model, we always want to:

  • be clear about what metrics we are prioritizing
  • be careful not to over optimize on any one metric because it may mean you're compromising something else

1
2
# Metrics
metrics = {"overall": {}, "class": {}}
1
2
3
4
5
6
# Data to evaluate
device = torch.device("cuda")
loss_fn = nn.BCEWithLogitsLoss(weight=class_weights_tensor)
trainer = Trainer(model=model.to(device), device=device, loss_fn=loss_fn)
test_loss, y_true, y_prob = trainer.eval_step(dataloader=test_dataloader)
y_pred = np.array([np.where(prob >= threshold, 1, 0) for prob in y_prob])

Overall metrics

While we were iteratively developing our baselines, our evaluation process involved computing the coarse-grained metrics such as overall precision, recall and f1 metrics.

1
2
3
4
5
6
7
8
# Evaluate
# Overall metrics
overall_metrics = precision_recall_fscore_support(y_test, y_pred, average="weighted")
metrics["overall"]["precision"] = overall_metrics[0]
metrics["overall"]["recall"] = overall_metrics[1]
metrics["overall"]["f1"] = overall_metrics[2]
metrics["overall"]["num_samples"] = np.float64(len(y_true))
print (json.dumps(metrics["overall"], indent=4))
{
    "precision": 0.7896647806486397,
    "recall": 0.5965665236051502,
    "f1": 0.6612830799421741,
    "num_samples": 218.0
}

Note

The precision_recall_fscore_support() function from scikit-learn has an input parameter called average which has the following options below. We'll be using the different averaging methods for different metric granularities.

  • None: metrics are calculated for each unique class.
  • binary: used for binary classification tasks where the pos_label is specified.
  • micro: metrics are calculated using global TP, FP, and FN.
  • macro: per-class metrics which are averaged without accounting for class imbalance.
  • weighted: per-class metrics which are averaged by accounting for class imbalance.
  • samples: metrics are calculated at the per-sample level.

Per-class metrics

Inspecting these coarse-grained, overall metrics is a start but we can go deeper by evaluating the same fine-grained metrics at the per-class level.

1
2
3
4
5
6
7
8
9
# Per-class metrics
class_metrics = precision_recall_fscore_support(y_test, y_pred, average=None)
for i, _class in enumerate(label_encoder.classes):
    metrics["class"][_class] = {
        "precision": class_metrics[0][i],
        "recall": class_metrics[1][i],
        "f1": class_metrics[2][i],
        "num_samples": np.float64(class_metrics[3][i]),
    }
1
2
3
# Metrics for a specific class
tag = "transformers"
print (json.dumps(metrics["class"][tag], indent=2))

  "precision": 0.6428571428571429,
  "recall": 0.6428571428571429,
  "f1": 0.6428571428571429,
  "num_samples": 28.0
}

As a general rule, the classes with fewer samples will have lower performance so we should always work to identify the class (or fine-grained slices) of data that our model needs to see more samples of to learn from.

1
2
# Number of training samples per class
num_samples = np.sum(y_train, axis=0).tolist()
1
2
3
4
# Number of samples vs. performance (per class)
f1s = [metrics["class"][_class]["f1"]*100. for _class in label_encoder.classes]
sorted_lists = sorted(zip(*[num_samples, f1s])) # sort
num_samples, f1s = list(zip(*sorted_lists))
1
2
3
4
5
6
7
8
9
# Plot
n = 7 # num. top classes to label
fig, ax = plt.subplots()
ax.set_xlabel("# of training samples")
ax.set_ylabel("test performance (f1)")
fig.set_size_inches(25, 5)
ax.plot(num_samples, f1s, "bo-")
for x, y, label in zip(num_samples[-n:], f1s[-n:], label_encoder.classes[-n:]):
    ax.annotate(label, xy=(x,y), xytext=(-5, 5), ha="right", textcoords="offset points")

There are, of course, nuances to this general rule such as the complexity of distinguishing between some classes where we may not need as many samples for easier sub-tasks. In our case, classes with over 100 training samples consistently perform better than 0.6 f1 score, whereas the other class' performances are mixed.

Confusion matrix sample analysis

Besides just inspecting the metrics for each class, we can also identify the true positives, false positives and false negatives. Each of these will give us insight about our model beyond what the metrics can provide.

  • True positives: learn about where our model performs well.
  • False positives: potentially identify samples which may need to be relabeled.
  • False negatives: identify the model's less performant areas to oversample later.

It's a good to have our FP/FN samples feed back into our annotation pipelines in the event we want to fix their labels and have those changes be reflected everywhere.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# TP, FP, FN samples
index = label_encoder.class_to_index[tag]
tp, fp, fn = [], [], []
for i in range(len(y_test)):
    true = y_test[i][index]
    pred = y_pred[i][index]
    if true and pred:
        tp.append(i)
    elif not true and pred:
        fp.append(i)
    elif true and not pred:
        fn.append(i)
1
2
3
print (tp)
print (fp)
print (fn)

[4, 9, 27, 38, 40, 52, 58, 74, 79, 88, 97, 167, 174, 181, 186, 191, 194, 195]
[45, 54, 98, 104, 109, 137, 146, 152, 162, 190]
[55, 59, 63, 70, 87, 93, 125, 144, 166, 201]
1
2
3
4
index = tp[0]
print (X_test_raw[index])
print (f"true: {label_encoder.decode([y_test[index]])[0]}")
print (f"pred: {label_encoder.decode([y_pred[index]])[0]}\n")
simple transformers transformers classification ner qa language modeling language generation t5 multi modal conversational ai
true: ['language-modeling', 'natural-language-processing', 'question-answering', 'transformers']
pred: ['attention', 'huggingface', 'language-modeling', 'natural-language-processing', 'transformers']

1
2
3
# Sorted tags
sorted_tags_by_f1 = OrderedDict(sorted(
        metrics["class"].items(), key=lambda tag: tag[1]["f1"], reverse=True))
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
# Samples
num_samples = 3
if len(tp):
    print ("\n=== True positives ===")
    for i in tp[:num_samples]:
        print (f"  {X_test_raw[i]}")
        print (f"    true: {label_encoder.decode([y_test[i]])[0]}")
        print (f"    pred: {label_encoder.decode([y_pred[i]])[0]}\n")
if len(fp):
    print ("=== False positives === ")
    for i in fp[:num_samples]:
        print (f"  {X_test_raw[i]}")
        print (f"    true: {label_encoder.decode([y_test[i]])[0]}")
        print (f"    pred: {label_encoder.decode([y_pred[i]])[0]}\n")
if len(fn):
    print ("=== False negatives ===")
    for i in fn[:num_samples]:
        print (f"  {X_test_raw[i]}")
        print (f"    true: {label_encoder.decode([y_test[i]])[0]}")
        print (f"    pred: {label_encoder.decode([y_pred[i]])[0]}\n")

class = 'transformers'

{
  "precision": 0.6428571428571429,
  "recall": 0.6428571428571429,
  "f1": 0.6428571428571429,
  "num_samples": 28.0
}

=== True positives ===
  simple transformers transformers classification ner qa language modeling language generation t5 multi modal conversational ai
    true: ['language-modeling', 'natural-language-processing', 'question-answering', 'transformers']
    pred: ['attention', 'huggingface', 'language-modeling', 'natural-language-processing', 'transformers']

  bertviz tool visualizing attention transformer model bert gpt 2 albert xlnet roberta ctrl etc
    true: ['attention', 'interpretability', 'natural-language-processing', 'transformers']
    pred: ['attention', 'natural-language-processing', 'transformers']

  summary transformers models high level summary differences model huggingfacetransformer library
    true: ['huggingface', 'natural-language-processing', 'transformers']
    pred: ['huggingface', 'natural-language-processing', 'transformers']

=== False positives ===
  help read text summarization using flask huggingface text summarization translation questions answers generation using huggingface deployed using flask streamlit detailed guide github
    true: ['huggingface', 'natural-language-processing']
    pred: ['huggingface', 'natural-language-processing', 'transformers']

  silero models pre trained enterprise grade stt models silero speech text models provide enterprise grade stt compact form factor several commonly spoken languages
    true: ['pytorch', 'tensorflow']
    pred: ['natural-language-processing', 'transformers']

  evaluation metrics language modeling article focus traditional intrinsic metrics extremely useful process training language model
    true: ['language-modeling', 'natural-language-processing']
    pred: ['language-modeling', 'natural-language-processing', 'transformers']

=== False negatives ===
  t5 fine tuning colab notebook showcase fine tune t5 model various nlp tasks especially non text 2 text tasks text 2 text approach
    true: ['natural-language-processing', 'transformers']
    pred: ['natural-language-processing']

  universal adversarial triggers attacking analyzing nlp create short phrases cause specific model prediction concatenated input dataset
    true: ['natural-language-processing', 'transformers']
    pred: ['natural-language-processing']

  tempering expectations gpt 3 openai api closer look magic behind gpt 3 caveats aware
    true: ['natural-language-processing', 'transformers']
    pred: []

What can we do with this?

How can we leverage this type of inspection to improve on our performance?

Show answer

We can use the FPs to find potentially mislabeled samples and use FNs to see what aspects of our input data we're not able to map to.

Slices

Just inspecting the overall and class metrics isn't enough to deploy our new version to production. There may be key slices of our dataset that we expect to do really well on (ie. minority groups, large customers, etc.) and we need to ensure that their metrics are also improving. An easy way to create and evaluate slices is to define slicing functions.

1
2
3
from snorkel.slicing import PandasSFApplier
from snorkel.slicing import slice_dataframe
from snorkel.slicing import slicing_function
1
2
3
4
@slicing_function()
def cv_transformers(x):
    """Projects with the `computer-vision` and `transformers` tags."""
    return all(tag in x.tags for tag in ["computer-vision", "transformers"])
1
2
3
4
@slicing_function()
def short_text(x):
    """Projects with short titles and descriptions."""
    return len(x.text.split()) < 7  # less than 7 words

Here we're using Snorkel's slicing_function to create our different slices. We can visualize our slices by applying this slicing function to a relevant DataFrame using slice_dataframe.

1
2
short_text_df = slice_dataframe(test_df, short_text)
short_text_df[["text", "tags"]].head()
text tags
44 flask sqlalchemy adds sqlalchemy support flask [flask]
69 scikit lego extra blocks sklearn pipelines [scikit-learn]
83 simclr keras tensorflow keras implementation s... [keras, self-supervised-learning, tensorflow]
215 introduction autoencoders look autoencoders re... [autoencoders, representation-learning]

We can define even more slicing functions and create a slices record array using the PandasSFApplier. The slices array has N (# of data points) items and each item has S (# of slicing functions) items, indicating whether that data point is part of that slice. Think of this record array as a masking layer for each slicing function on our data.

1
2
3
4
5
# Slices
slicing_functions = [cv_transformers, short_text]
applier = PandasSFApplier(slicing_functions)
slices = applier.apply(test_df)
print (slices)
rec.array([(0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0),
           (1, 0) (0, 0) (0, 1) (0, 0) (0, 0) (1, 0) (0, 0) (0, 0) (0, 1) (0, 0)
           ...
           (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 1),
           (0, 0), (0, 0)],
    dtype=[('cv_transformers', 'i8'), ('short_text', 'i8')])

If our task was multiclass instead of multilabel, we could've used snorkel.analysis.Scorer to retrieve our slice metrics. But we've implemented a naive version for our multilabel task based on it.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# Score slices
metrics["slices"] = {}
for slice_name in slices.dtype.names:
    mask = slices[slice_name].astype(bool)
    if sum(mask):
        slice_metrics = precision_recall_fscore_support(
            y_test[mask], y_pred[mask], average="micro"
        )
        metrics["slices"][slice_name] = {}
        metrics["slices"][slice_name]["precision"] = slice_metrics[0]
        metrics["slices"][slice_name]["recall"] = slice_metrics[1]
        metrics["slices"][slice_name]["f1"] = slice_metrics[2]
        metrics["slices"][slice_name]["num_samples"] = len(y_true[mask])
1
print(json.dumps(metrics["slices"], indent=2))
{
  "pytorch_transformers": {
    "precision": 0.9230769230769231,
    "recall": 0.8571428571428571,
    "f1": 0.888888888888889,
    "num_samples": 3
  },
  "short_text": {
    "precision": 0.8,
    "recall": 0.5714285714285714,
    "f1": 0.6666666666666666,
    "num_samples": 4
  }
}

In our testing lesson, we'll cover another way to evaluate our model known as behavioral testing, which we'll also include as part of performance report.

How can we choose?

With all these different ways for evaluation, how can we choose "the best" option if some solutions are better for some evaluation criteria?

Show answer

You and your team need to agree on what evaluation criteria are most important and what is the minimum performance required for each one. This will allow us to filter amongst all the different solutions by removing ones that don't satisfy al the minimum requirements and ranking amongst the remaining by which ones perform the best for the highest priority criteria.

Resources


To cite this lesson, please use:

1
2
3
4
5
6
@article{madewithml,
    author       = {Goku Mohandas},
    title        = { Evaluation - Made With ML },
    howpublished = {\url{https://madewithml.com/}},
    year         = {2021}
}