Skip to content

Data Labeling

Labeling our data with intention before using it to construct our ML systems.
Goku Mohandas
· ·
Repository ยท Notebook

๐Ÿ“ฌ  Receive new lessons straight to your inbox (once a month) and join 30K+ developers in learning how to responsibly deliver value with ML.


Labeling (or annotation) is the process of identifying the inputs and outputs that are worth modeling (not just what could be modeled).

  • use objective as a guide to determine the necessary signals
  • explore creating new signals (via combining features, collecting new data, etc.)
  • iteratively add more features to control complexity and effort


Be careful not to include features in the dataset that will not be available during inference time, causing data leakage.

What else can we learn?

It's not just about identifying and labeling our initial dataset. What else can we learn from it?

Show answer

It's also the phase where we can use our deep understanding of the problem, processes, constraints and domain expertise to:

- augment the training data split
- enhance using auxiliary data
- simplify using constraints
- remove noisy samples
- figure out where labeling process can be improved


Regardless of whether we have a custom labeling platform or we choose a generalized platform, the process to manage labeling and all it's related workflows (QA, data import/export, etc.) follow a similar approach.

Preliminary steps

  • Decide what needs to be labeled
    • consult with domain experts to ensure you're labeling the appropriate signals
    • consolidate the appropriate labels for your task (consider hierarchical labeling)
  • Design the interface where labeling will be done
    • intuitive, data modality dependent and quick (keybindings are a must!)
    • avoid option paralysis by allowing labeler to dig deeper or suggest likely labels
    • account for measuring and resolving inter-labeler discrepancy
  • Compose highly detailed labeling instructions for annotators
    • examples of each labeling scenario
    • course of action for discrepancies
labeling view
Multi-label text classification for our task using Prodigy (labeling + QA)

Workflow setup

  • Establish data import and export pipelines
    • need to know when new data is ready for annotation
    • need to know when annotated data is ready for QA, modeling, etc.
  • Create a quality assurance (QA) workflow
    • separate from labeling workflow
    • yet still communicates with labeling workflow for relabeling errors
labeling workflow

Iterative setup

  • Implement strategies to reduce labeling efforts
    • determine subsets of the data to label next using active learning
    • auto-label entire or parts of a data point using weak supervision
    • focus constrained labeling efforts on long tail of edge cases over time

Check out the data-centric AI lesson to learn more about the nuances of how labeling plays a crucial part of the data-driven development process.


  • projects.json: projects with title, description and tags (cleaned by mods).
  • tags.json: tags used in dropdown to aid autocompletion.

Recall that our objective was to augment authors to add the appropriate tags for their project so the community can discover them. So we want to use the metadata provided in each project to determine what the relevant tags are. We'll want to start with the highly influential features and iteratively experiment with additional features.

Load data

We'll first load our dataset from the JSON file.

from collections import Counter, OrderedDict
import ipywidgets as widgets
import itertools
import json
import pandas as pd
from urllib.request import urlopen
# Load projects
url = ""
projects = json.loads(urlopen(url).read())
print (json.dumps(projects[-305], indent=2))

  "id": 324,
  "title": "AdverTorch",
  "description": "A Toolbox for Adversarial Robustness Research",
  "tags": [

Now we can load our data into a Pandas DataFrame.

# Create dataframe
df = pd.DataFrame(projects)
print (f"{len(df)} projects")

id created_on title description tags
0 1 2020-02-17 06:30:41 Machine Learning Basics A practical set of notebooks on machine learni... [code, tutorial, keras, pytorch, tensorflow, d...
1 2 2020-02-17 06:41:45 Deep Learning with Electronic Health Record (E... A comprehensive look at recent machine learnin... [article, tutorial, deep-learning, health, ehr]
2 3 2020-02-20 06:07:59 Automatic Parking Management using computer vi... Detecting empty and parked spaces in car parki... [code, tutorial, video, python, machine-learni...
3 4 2020-02-20 06:21:57 Easy street parking using region proposal netw... Get a text on your phone whenever a nearby par... [code, tutorial, python, pytorch, machine-lear...
4 5 2020-02-20 06:29:18 Deep Learning based parking management system ... Fastai provides easy to use wrappers to quickl... [code, tutorial, fastai, deep-learning, parkin...

The reason we want to iteratively add more features is because it introduces more complexity and effort. We may have additional data about each feature such as author info, html from links in the description, etc. While these may have meaningful signal, we want to slowly introduce these after we close the loop.

Over time, our dataset will grow and we'll need to label new data. So far, we had a team of moderators clean the existing data but we'll need to establish proper workflow to make this process easier and reliable. Typically, we'll use collaborative UIs where annotators can fix errors, etc. and then use a tool like Airflow or KubeFlow Pipelines for workflow / pipeline orchestration to know when new data is ready to be labeled and also when it's ready to be used for QA and eventually, modeling.

Auxiliary data

We're also going to be using an auxiliary dataset which contains a collection of all the tags with their aliases and parent/child relationships. This auxiliary dataset was used by our application to automatically add the relevant parent tags when the child tags were present.

# Load tags
url = ""
tags = json.loads(urlopen(url).read())
tags_dict = {}
for item in tags:
    key = item.pop("tag")
    tags_dict[key] = item
print (f"{len(tags_dict)} tags")
def display_tag_details(tag="question-answering"):
    print (json.dumps(tags_dict[tag], indent=2))
"question-answering": {
    "aliases": [
    "parents": [

Data imbalance

With our datasets, we may often notice a data imbalance problem where a range of continuous values (regression) or certain classes (classification) may have insufficient amounts of data to learn from. This becomes a major issue when training because the model will learn to generalize to the data available and perform poorly on regions where the data is sparse. There are several techniques to mitigate data imbalance, including resampling (oversampling from minority classes / undersampling from majority classes), account for the data distributions via the loss function (since that drives the learning process), etc.

How can we do better?

The techniques above indirectly address data imbalance by manipulating parts of the data / system. What's the best solution to data imbalance?

Show answer

While these data imbalance mitigation techniques will allow our model to perform decently, the best long term approach is to directly address the imbalance issue. Identify which areas of the data need more samples and go collect them! This becomes a more more robust approach compared to focusing the model to learn from repeated samples or ignoring samples.


We could have used the user provided tags as our labels but what if the user added a wrong tag or forgot to add a relevant one. To remove this dependency on the user to provide the gold standard labels, we can leverage labeling tools and platforms. These tools allow for quick and organized labeling of the dataset to ensure its quality. And instead of starting from scratch and asking our labeler to provide all the relevant tags for a given project, we can provide the author's original tags and ask the labeler to add / remove as necessary. The specific labeling tool may be something that needs to be custom built or leverages something from the ecosystem.


  • Scale AI: the data platform for high quality training and validation data for AI applications.
  • Label Studio: a multi-type data labeling and annotation tool with standardized output format.
  • Universal Data Tool: collaborate and label any type of data, images, text, or documents in an easy web interface or desktop app.
  • Prodigy: recipes for the Prodigy, our fully scriptable annotation tool.
  • Superintendent: an ipywidget-based interactive labelling tool for your data to enable active learning.

Natural language processing

  • Doccano: an open source text annotation tool for text classification, sequence labeling and sequence to sequence tasks.
  • BRAT: a rapid annotation tool for all your textual annotation needs.

Computer vision

  • LabelImg: a graphical image annotation tool and label object bounding boxes in images.
  • CVAT: a free, online, interactive video and image annotation tool for computer vision.
  • VoTT: an electron app for building end-to-end object detection models from images and videos.
  • a free to use online tool for labelling photos.
  • remo: an app for annotations and images management in computer vision.
  • Labelai: an online tool designed to label images, useful for training AI models.


  • Audino: an open source audio annotation tool for voice activity detection (VAD), diarization, speaker identification, automated speech recognition, emotion recognition tasks, etc.
  • audio-annotator: a JavaScript interface for annotating and labeling audio files.
  • EchoML: a web app to play, visualize, and annotate your audio files for machine learning.


  • MedCAT: a medical concept annotation tool that can extract information from Electronic Health Records (EHRs) and link it to biomedical ontologies like SNOMED-CT and UMLS.

Generalized labeling solutions

What criteria should we use to evaluate what labeling platform to use?

Show answer

It's important to pick a generalized platform that has all the major labeling features for your data modality with the capability to easily customize the experience.

  • how easy is it to connect to our data sources (DB, QA, etc.)?
  • how easy was it to make changes (new features, labeling paradigms)?
  • how securely is our data treated (on-prem, trust, etc.)

However, as an industry trend, this balance between generalization and specificity is difficult to strike. So many teams put in the upfront effort to create bespoke labeling platforms or used industry specific, niche, labeling tools.

Active learning

Even with a powerful labeling tool and established workflows, it's easy to see how involved and expensive labeling can be. Therefore, many teams employ active learning to iteratively label the dataset and evaluate the model.

  1. Label a small, initial dataset to train a model.
  2. Ask the trained model to predict on some unlabeled data.
  3. Determine which new data points to label from the unlabeled data based on:
    • entropy over the predicted class probabilities
    • samples with lowest predicted, calibrated, confidence (uncertainty sampling)
    • discrepancy in predictions from an ensemble of trained models
  4. Repeat until the desired performance is achieved.

This can be significantly more cost-effective and faster than labeling the entire dataset.

active learning


  • modAL: a modular active learning framework for Python.
  • libact: pool-based active learning in Python.
  • ALiPy: active learning python toolbox, which allows users to conveniently evaluate, compare and analyze the performance of active learning methods.

Weak supervision

If we had samples that needed labeling or if we simply wanted to validate existing labels, we can use weak supervision to generate labels as opposed to hand labeling all of them. We could utilize weak supervision via labeling functions to label our existing and new data, where we can create constructs based on keywords, pattern expressions, knowledge bases, etc. And we can add to the labeling functions over time and even mitigate conflicts amongst the different labeling functions. We'll use these labeling functions to create and evaluate slices of our data in the evaluation lesson.

from snorkel.labeling import labeling_function

def contains_tensorflow(text):
    condition = any(tag in text.lower() for tag in ("tensorflow", "tf"))
    return "tensorflow" if condition else None

An easy way to validate our labels (before modeling) is to use the aliases in our auxillary datasets to create labeling functions for the different classes. Then we can look for false positives and negatives to identify potentially mislabeled samples. We'll actually implement a similar kind of inspection approach, but using a trained model as a heuristic, in our dashboards lesson.


Labeling isn't just a one time event or something we repeat identically. As new data is available, we'll want to strategically label the appropriate samples and improve slices of our data that are lacking in quality. In fact, there's an entire workflow related to labeling that is initiated when we want to iterate. We'll learn more about this iterative labeling process in our continual learning and data-centric AI lessons.


To cite this lesson, please use:

    author       = {Goku Mohandas},
    title        = { Labeling - Made With ML },
    howpublished = {\url{}},
    year         = {2021}