Skip to content

Data Labeling

Labeling our data with intention before using it to construct our ML systems.
Goku Mohandas
· ·
Repository ยท Notebook

๐Ÿ“ฌ  Receive new lessons straight to your inbox (once a month) and join 30K+ developers in learning how to responsibly deliver value with ML.


Labeling (or annotation) is the process of identifying the inputs and outputs that are worth modeling (not just what could be modeled).

  • use objective as a guide to determine the necessary signals.
  • explore creating new signals (via combining features, collecting new data, etc.).
  • iteratively add more features to justify complexity and effort.


Be careful not to include features in the dataset that will not be available during prediction, causing data leaks.

What else can we learn?

It's not just about identifying and labeling our initial dataset. What else can we learn from it?

Show answer

It's also the phase where we can use our deep understanding of the problem, processes, constraints and domain expertise to:

- augment the training data split
- enhance using auxiliary data
- simplify using constraints
- remove noisy samples
- improve the labeling process


Regardless of whether we have a custom labeling platform or we choose a generalized platform, the process of labeling and all it's related workflows (QA, data import/export, etc.) follow a similar approach.

Preliminary steps

  • [WHAT] Decide what needs to be labeled:
    • identify natural labels you may already have (ex. time-series)
    • consult with domain experts to ensure you're labeling the appropriate signals
    • decide on the appropriate labels (and hierarchy) for your task
  • [WHERE] Design the labeling interface:
    • intuitive, data modality dependent and quick (keybindings are a must!)
    • avoid option paralysis by allowing the labeler to dig deeper or suggesting likely labels
    • measure and resolve inter-labeler discrepancy
  • [HOW] Compose labeling instructions:
    • examples of each labeling scenario
    • course of action for discrepancies
labeling view
Multi-label text classification for our task using Prodigy (labeling + QA)

Workflow setup

  • Establish data pipelines:
    • [IMPORT] new data for annotation
    • [EXPORT] annotated data for QA, testing, modeling, etc.
  • Create a quality assurance (QA) workflow:
    • separate from labeling workflow (no bias)
    • communicates with labeling workflow to escalate errors
labeling workflow

Iterative setup

  • Implement strategies to reduce labeling efforts
    • identify subsets of the data to label next using active learning
    • auto-label entire or parts of a dataset using weak supervision
    • focus labeling efforts on long tail of edge cases over time

Check out the data-centric AI lesson to learn more about the nuances of how labeling plays a crucial part of the data-driven development process.


  • projects.json: projects with title, description and tag.
  • tags.json: auxiliary information on the tags we are about of our platform.

Recall that our objective was to classify incoming content so that the community can discover them.


We'll first load our dataset from the JSON file.

from collections import Counter
import ipywidgets as widgets
import itertools
import json
import pandas as pd
from urllib.request import urlopen

Traditionally, our data assets will be stored, versioned and updated in a database, warehouse, etc. We'll learn more about these different data management systems later, but for now, we'll load our data as a JSON file from our repository.

# Load projects
url = ""
projects = json.loads(urlopen(url).read())
print (f"{len(projects)} projects")
print (json.dumps(projects[0], indent=2))
955 projects
  "id": 6,
  "created_on": "2020-02-20 06:43:18",
  "title": "Comparison between YOLO and RCNN on real world videos",
  "description": "Bringing theory to experiment is cool. We can easily train models in colab and find the results in minutes.",
  "tag": "computer-vision"

Now we can load our data into a Pandas DataFrame.

# Create dataframe
df = pd.DataFrame(projects)
print (f"{len(df)} projects")

id created_on title description tag
0 6 2020-02-20 06:43:18 Comparison between YOLO and RCNN on real world... Bringing theory to experiment is cool. We can ... computer-vision
1 7 2020-02-20 06:47:21 Show, Infer & Tell: Contextual Inference for C... The beauty of the work lies in the way it arch... computer-vision
2 9 2020-02-24 16:24:45 Awesome Graph Classification A collection of important graph embedding, cla... graph-learning
3 15 2020-02-28 23:55:26 Awesome Monte Carlo Tree Search A curated list of Monte Carlo tree search pape... reinforcement-learning
4 19 2020-03-03 13:54:31 Diffusion to Vector Reference implementation of Diffusion2Vec (Com... graph-learning
# Most common tags
tags = Counter(df.tag.values)
[('natural-language-processing', 388),
 ('computer-vision', 356),
 ('mlops', 79),
 ('reinforcement-learning', 56),
 ('graph-learning', 45),
 ('time-series', 31)]

We'll address the data imbalance after splitting into our train split and prior to training our model.


We're also going to be using an auxiliary dataset which contains a collection of all the tags that are currently relevant to us.

# Load tags
url = ""
tags_dict = {}
for item in json.loads(urlopen(url).read()):
    key = item.pop("tag")
    tags_dict[key] = item
print (f"{len(tags_dict)} tags")
4 tags
def display_tag_details(tag="computer-vision"):
    print (json.dumps(tags_dict[tag], indent=2))
"computer-vision": {
  "aliases": [

It's important that this auxillary information about our tags resides in a separate location so that everyone uses the same source of truth. This asset can also be versioned and kept up-to-date.


We're going to apply several constraints on labeling our data: - if a data point has a tag that we currently don't support, we'll replace it with other - if a certain tag doesn't have enough samples, we'll replace it with other

# Out of scope (OOS) tags
oos_tags = [item for item in df.tag.unique() if item not in tags_dict.keys()]
['reinforcement-learning', 'time-series']
# Samples with OOS tags
oos_indices = df[df.tag.isin(oos_tags)].index
id created_on title description tag
3 15 2020-02-28 23:55:26 Awesome Monte Carlo Tree Search A curated list of Monte Carlo tree search pape... reinforcement-learning
37 121 2020-03-24 04:56:38 Deep Reinforcement Learning in TensorFlow2 deep-rl-tf2 is a repository that implements a ... reinforcement-learning
67 218 2020-04-06 11:29:57 Distributional RL using TensorFlow2 ๐Ÿณ Implementation of various Distributional Rei... reinforcement-learning
74 239 2020-04-06 18:39:48 Prophet: Forecasting At Scale Tool for producing high quality forecasts for ... time-series
95 277 2020-04-07 00:30:33 Curriculum for Reinforcement Learning Curriculum learning applied to reinforcement l... reinforcement-learning
# Replace this tag with other
df.tag = df.tag.apply(lambda x: "other" if x in oos_tags else x)
# OOS samples should be "other"
id created_on title description tag
3 15 2020-02-28 23:55:26 Awesome Monte Carlo Tree Search A curated list of Monte Carlo tree search pape... other
37 121 2020-03-24 04:56:38 Deep Reinforcement Learning in TensorFlow2 deep-rl-tf2 is a repository that implements a ... other
67 218 2020-04-06 11:29:57 Distributional RL using TensorFlow2 ๐Ÿณ Implementation of various Distributional Rei... other
74 239 2020-04-06 18:39:48 Prophet: Forecasting At Scale Tool for producing high quality forecasts for ... other
95 277 2020-04-07 00:30:33 Curriculum for Reinforcement Learning Curriculum learning applied to reinforcement l... other

We're also going to restrict the mapping to only tags that are above a certain frequency threshold. The tags that don't have enough projects will not have enough samples to model their relationships.

# Minimum frequency required for a tag
min_tag_freq = 75
tags = Counter(df.tag.values)
# Tags that just made / missed the cut
@widgets.interact(min_tag_freq=(0, tags.most_common()[0][1]))
def separate_tags_by_freq(min_tag_freq=min_tag_freq):
    tags_above_freq = Counter(tag for tag in tags.elements()
                                    if tags[tag] >= min_tag_freq)
    tags_below_freq = Counter(tag for tag in tags.elements()
                                    if tags[tag] < min_tag_freq)
    print ("Most popular tags:\n", tags_above_freq.most_common(3))
    print ("\nTags that just made the cut:\n", tags_above_freq.most_common()[-3:])
    print ("\nTags that just missed the cut:\n", tags_below_freq.most_common(3))
Most popular tags:
 [('natural-language-processing', 388), ('computer-vision', 356), ('other', 87)]

Tags that just made the cut:
 [('computer-vision', 356), ('other', 87), ('mlops', 79)]

Tags that just missed the cut:
 [('graph-learning', 45)]
def filter(tag, include=[]):
    """Determine if a given tag is to be included."""
    if tag not in include:
        tag = None
    return tag
# Filter tags that have fewer than <min_tag_freq> occurrences
tags_above_freq = Counter(tag for tag in tags.elements()
                          if (tags[tag] >= min_tag_freq))
df.tag = df.tag.apply(filter, include=list(tags_above_freq.keys()))
# Fill None with other
df.tag = df.tag.fillna("other")
# Remove projects with no relevant tags
df = df[df.tag.notnull()]
print (f"{len(df)} projects")
955 projects

We'll save our clean, labeled data as a separate asset so we don't have to repeat these steps later on and so we don't alter the integrity of our raw data assets.

# Save clean labeled data
with open("labeled_projects.json", "w") as fp:
    json.dump(df.to_dict("records"), fp, indent=4)


We could have used the user provided tags as our labels but what if the user added a wrong tag or forgot to add a relevant one. To remove this dependency on the user to provide the gold standard labels, we can leverage labeling tools and platforms. These tools allow for quick and organized labeling of the dataset to ensure its quality. And instead of starting from scratch and asking our labeler to provide all the relevant tags for a given project, we can provide the author's original tags and ask the labeler to add / remove as necessary. The specific labeling tool may be something that needs to be custom built or leverages something from the ecosystem.

As our platform grows, so too will our dataset and labeling needs so it's imperative to use the proper tooling that supports the workflows we'll depend on.


  • Labelbox: the data platform for high quality training and validation data for AI applications.
  • Scale AI: data platform for AI that provides high quality training data.
  • Label Studio: a multi-type data labeling and annotation tool with standardized output format.
  • Universal Data Tool: collaborate and label any type of data, images, text, or documents in an easy web interface or desktop app.
  • Prodigy: recipes for the Prodigy, our fully scriptable annotation tool.
  • Superintendent: an ipywidget-based interactive labelling tool for your data to enable active learning.

Natural language processing

  • Doccano: an open source text annotation tool for text classification, sequence labeling and sequence to sequence tasks.
  • BRAT: a rapid annotation tool for all your textual annotation needs.

Computer vision

  • LabelImg: a graphical image annotation tool and label object bounding boxes in images.
  • CVAT: a free, online, interactive video and image annotation tool for computer vision.
  • VoTT: an electron app for building end-to-end object detection models from images and videos.
  • a free to use online tool for labelling photos.
  • remo: an app for annotations and images management in computer vision.
  • Labelai: an online tool designed to label images, useful for training AI models.


  • Audino: an open source audio annotation tool for voice activity detection (VAD), diarization, speaker identification, automated speech recognition, emotion recognition tasks, etc.
  • audio-annotator: a JavaScript interface for annotating and labeling audio files.
  • EchoML: a web app to play, visualize, and annotate your audio files for machine learning.


  • MedCAT: a medical concept annotation tool that can extract information from Electronic Health Records (EHRs) and link it to biomedical ontologies like SNOMED-CT and UMLS.

Generalized labeling solutions

What criteria should we use to evaluate what labeling platform to use?

Show answer

It's important to pick a generalized platform that has all the major labeling features for your data modality with the capability to easily customize the experience.

  • how easy is it to connect to our data sources (DB, QA, etc.)?
  • how easy was it to make changes (new features, labeling paradigms)?
  • how securely is our data treated (on-prem, trust, etc.)

However, as an industry trend, this balance between generalization and specificity is difficult to strike. So many teams put in the upfront effort to create bespoke labeling platforms or used industry specific, niche, labeling tools.

Active learning

Even with a powerful labeling tool and established workflows, it's easy to see how involved and expensive labeling can be. Therefore, many teams employ active learning to iteratively label the dataset and evaluate the model.

  1. Label a small, initial dataset to train a model.
  2. Ask the trained model to predict on some unlabeled data.
  3. Determine which new data points to label from the unlabeled data based on:
    • entropy over the predicted class probabilities
    • samples with lowest predicted, calibrated, confidence (uncertainty sampling)
    • discrepancy in predictions from an ensemble of trained models
  4. Repeat until the desired performance is achieved.

This can be significantly more cost-effective and faster than labeling the entire dataset.

active learning


  • modAL: a modular active learning framework for Python.
  • libact: pool-based active learning in Python.
  • ALiPy: active learning python toolbox, which allows users to conveniently evaluate, compare and analyze the performance of active learning methods.

Weak supervision

If we had samples that needed labeling or if we simply wanted to validate existing labels, we can use weak supervision to generate labels as opposed to hand labeling all of them. We could utilize weak supervision via labeling functions to label our existing and new data, where we can create constructs based on keywords, pattern expressions, knowledge bases, etc. And we can add to the labeling functions over time and even mitigate conflicts amongst the different labeling functions. We'll use these labeling functions to create and evaluate slices of our data in the evaluation lesson.

from snorkel.labeling import labeling_function

def contains_tensorflow(text):
    condition = any(tag in text.lower() for tag in ("tensorflow", "tf"))
    return "tensorflow" if condition else None

An easy way to validate our labels (before modeling) is to use the aliases in our auxillary datasets to create labeling functions for the different classes. Then we can look for false positives and negatives to identify potentially mislabeled samples. We'll actually implement a similar kind of inspection approach, but using a trained model as a heuristic, in our dashboards lesson.


Labeling isn't just a one time event or something we repeat identically. As new data is available, we'll want to strategically label the appropriate samples and improve slices of our data that are lacking in quality. In fact, there's an entire workflow related to labeling that is initiated when we want to iterate. We'll learn more about this iterative labeling process in our continual learning and data-centric AI lessons.


To cite this lesson, please use:

    author       = {Goku Mohandas},
    title        = { Labeling - Made With ML },
    howpublished = {\url{}},
    year         = {2021}