Skip to content

Data Labeling

Labeling our data with intention before using it to construct our ML systems.
Goku Mohandas
· ·
Repository ยท Notebook

๐Ÿ“ฌ  Receive new lessons straight to your inbox (once a month) and join 30K+ developers in learning how to responsibly deliver value with ML.


Labeling (or annotation) is the process of identifying the inputs and outputs that are worth modeling (not just what could be modeled).

  • use objective as a guide to determine the necessary signals
  • explore creating new signals (via combining data, collecting new data, etc.)
  • iteratively add more features to control complexity and effort


Be careful not to include features in the dataset that will not be available during inference time, causing data leakage.

It's also the phase where we can use our deep understanding of the problem, processes, constraints and domain expertise to:

  • augment the training data split
  • enhance using auxiliary data
  • simplify using constraints

And it isn't just about identifying and labeling our initial dataset but also involves thinking about how to make the labeling process more efficient as our dataset grows.

  • who will labeling new (streaming) data
  • what tools will be used to accelerate the labeling process (ie. labeling functions)
  • what workflows will be established to track the labeling process


You should have overlaps where different annotators are working on the same samples. A meaningful inter-labeler discrepancy (>2%) indicates that the labeling task is subjective and requires more explicit labeling criteria.


  • projects.json: projects with title, description and tags (cleaned by mods).
  • tags.json: tags used in dropdown to aid autocompletion.


We'll have a small GitHub Action that runs on a schedule (cron) to constantly update these datasets over time. We'll learn about how these work when we get to the CI/CD lesson.

Recall that our objective was to augment authors to add the appropriate tags for their project so the community can discover them. So we want to use the metadata provided in each project to determine what the relevant tags are. We'll want to start with the highly influential features and iteratively experiment with additional features.

Load data

We'll first load our dataset from the JSON file.

from collections import Counter, OrderedDict
import ipywidgets as widgets
import itertools
import json
import pandas as pd
from urllib.request import urlopen
# Load projects
url = ""
projects = json.loads(urlopen(url).read())
print (json.dumps(projects[-305], indent=2))

  "id": 324,
  "title": "AdverTorch",
  "description": "A Toolbox for Adversarial Robustness Research",
  "tags": [

Now we can load our data into a Pandas DataFrame.

# Create dataframe
df = pd.DataFrame(projects)
print (f"{len(df)} projects")

id created_on title description tags
0 1 2020-02-17 06:30:41 Machine Learning Basics A practical set of notebooks on machine learni... [code, tutorial, keras, pytorch, tensorflow, d...
1 2 2020-02-17 06:41:45 Deep Learning with Electronic Health Record (E... A comprehensive look at recent machine learnin... [article, tutorial, deep-learning, health, ehr]
2 3 2020-02-20 06:07:59 Automatic Parking Management using computer vi... Detecting empty and parked spaces in car parki... [code, tutorial, video, python, machine-learni...
3 4 2020-02-20 06:21:57 Easy street parking using region proposal netw... Get a text on your phone whenever a nearby par... [code, tutorial, python, pytorch, machine-lear...
4 5 2020-02-20 06:29:18 Deep Learning based parking management system ... Fastai provides easy to use wrappers to quickl... [code, tutorial, fastai, deep-learning, parkin...

The reason we want to iteratively add more features is because it introduces more complexity and effort. For example, extracting the relevant HTML from the URLs is not trivial but recall that we want to close the loop with a simple solution first. We're going to use just the title and description because we hypothesize that the project's core concepts will be there whereas the details may have many other keywords.


Over time, our dataset will grow and we'll need to label new data. So far, we had a team of moderators clean the existing data but we'll need to establish proper workflow to make this process easier and reliable. Typically, we'll use collaborative UIs where annotators can fix errors, etc. and then use a tool like Airflow or KubeFlow Pipelines for workflow / pipeline orchestration to know when new data is ready to be labeled and also when it's ready to be used for modeling.

Auxiliary data

We're also going to be using an auxiliary dataset which contains a collection of all the tags with their aliases and parent/child relationships. This auxiliary dataset was used by our application to automatically add the relevant parent tags when the child tags were present.

# Load tags
url = ""
tags = json.loads(urlopen(url).read())
tags_dict = {}
for item in tags:
    key = item.pop("tag")
    tags_dict[key] = item
print (f"{len(tags_dict)} tags")
def display_tag_details(tag='question-answering'):
    print (json.dumps(tags_dict[tag], indent=2))
"question-answering": {
    "aliases": [
    "parents": [


We could have used the user provided tags as our labels but what if the user added a wrong tag or forgot to add a relevant one. To remove this dependency on the user to provide the gold standard labels, we can leverage labeling tools and platforms. These tools allow for quick and organized labeling of the dataset to ensure its quality. And instead of starting from scratch and asking our labeler to provide all the relevant tags for a given project, we can provide the author's original tags and ask the labeler to add / remove as necessary. The specific labeling tool may be something that needs to be custom built or leverages something from the ecosystem.


  • Scale AI: the data platform for high quality training and validation data for AI applications.
  • Label Studio: a multi-type data labeling and annotation tool with standardized output format.
  • Universal Data Tool: collaborate and label any type of data, images, text, or documents in an easy web interface or desktop app.
  • Prodigy: recipes for the Prodigy, our fully scriptable annotation tool.
  • Superintendent: an ipywidget-based interactive labelling tool for your data to enable active learning.

Natural language processing

  • Doccano: an open source text annotation tool for text classification, sequence labeling and sequence to sequence tasks.
  • BRAT: a rapid annotation tool for all your textual annotation needs.

Computer vision

  • LabelImg: a graphical image annotation tool and label object bounding boxes in images.
  • CVAT: a free, online, interactive video and image annotation tool for computer vision.
  • VoTT: an electron app for building end-to-end object detection models from images and videos.
  • a free to use online tool for labelling photos.
  • remo: an app for annotations and images management in computer vision.
  • Labelai: an online tool designed to label images, useful for training AI models.


  • Audino: an open source audio annotation tool for voice activity detection (VAD), diarization, speaker identification, automated speech recognition, emotion recognition tasks, etc.
  • audio-annotator: a JavaScript interface for annotating and labeling audio files.
  • EchoML: a web app to play, visualize, and annotate your audio files for machine learning.


  • MedCAT: a medical concept annotation tool that can extract information from Electronic Health Records (EHRs) and link it to biomedical ontologies like SNOMED-CT and UMLS.

Active learning

Even with a powerful labeling tool and established workflows, it's easy to see how involved and expensive labeling can be. Therefore, many teams employ active learning to iteratively label the dataset and evaluate the model.

In active learning, you first provide a small number of labelled examples. The model is trained on this "seed" dataset. Then, the model "asks questions" by selecting the unlabeled data points it is unsure about, so the human can "answer" the questions by providing labels for those points. The model updates again and the process is repeated until the performance is good enough. By having the human iteratively teach the model, it's possible to make a better model, in less time, with much less labelled data.

  1. Label a small, initial dataset to train a model.
  2. Ask the trained model to predict on some unlabeled data.
  3. Determine which new data points to label from the unlabeled data based on:
    • entropy over the predicted class probabilities
    • samples with lowest predicted, calibrated, confidence
    • discrepancy in predictions from an ensemble of trained models
  4. Repeat until the desired performance is achieved


  • modAL: a modular active learning framework for Python.
  • libact: pool-based active learning in Python.
  • ALiPy: active learning python toolbox, which allows users to conveniently evaluate, compare and analyze the performance of active learning methods.

Labeling functions

We could utilize weak supervision via labeling functions to label our existing and new data. We can create constructs based on keywords, pattern expressions, knowledge bases and generalized models to create these labeling functions to label our data. We can add to the labeling functions over time and even mitigate conflicts amongst the different ones.

from snorkel.labeling import labeling_function

def contains_tensorflow(text):
    condition = any(tag in text.lower() for tag in ("tensorflow", "tf"))
    return "tensorflow" if condition else None


To cite this lesson, please use:

    author       = {Goku Mohandas},
    title        = { Labeling - Made With ML },
    howpublished = {\url{}},
    year         = {2021}