Accompanying video for this lesson. Subscribe for updates!


Annotation is the process of identifying the inputs and outputs that are worth modeling (not just what could be modeled).

It’s also the phase where we can use our deep understanding of the problem, processes, constraints and domain expertise to:

And it isn’t just about identifying and labeling our initial dataset but also involves thinking about how to make the annotation process more efficient as our dataset grows.

Watch from 0:00 for a video walkthrough of this section.


We’ll have a small GitHub Action that runs on a schedule (cron) to constantly update these datasets over time. We’ll learn about how these work when we get to the CI/CD lesson.

Recall that our objective was to augment authors to add the appropriate tags for their project so the community can discover them. So we want to use the metadata provided in each project to determine what the relevant tags are. We’ll want to start with the highly influential features and iteratively experiment with additional features:

  "id": 324,
  "title": "AdverTorch",
  "description": "A Toolbox for Adversarial Robustness Research",
  "tags": [

The reason we want to iteratively add more features is because it introduces more complexity and effort. For example, extracting the relevant HTML from the URLs is not trivial but recall that we want to close the loop with a simple solution first. We’re going to use just the title and description because we hypothesize that the project’s core concepts will be there whereas the details may have many other keywords.

We’re also going to be using a supplementary dataset which contains a collection of all the tags with their aliases and parent/child relationships.

"question-answering": {
    "aliases": [
    "parents": [

We’re going to include only these tags because they’re the tags we care about and we’ve allowed authors to add any tag they want (noise). We’ll also be excluding some general tags because they are automatically added when their children tags are present.

# Inclusion/exclusion criteria for tags
include = list(tags_dict.keys())
exclude = ['machine-learning', 'deep-learning',  'data-science',
           'neural-networks', 'python', 'r', 'visualization',
           'natural-language-processing', 'computer-vision']

Keep in mind that because we’re constraining the output space here, we’ll want to monitor the prevalence of new tags over time so we can capture them.

We’re also going to restrict the mapping to only tags that are above a certain frequency threshold. The tags that don’t have enough projects will not have enough samples to model their relationships.

Most popular tags:
 [('pytorch', 258), ('tensorflow', 213), ('transformers', 196), ('attention', 120), ('convolutional-neural-networks', 106)]

Tags that just made the cut:
 [('time-series', 34), ('flask', 34), ('node-classification', 33), ('question-answering', 32), ('pretraining', 30)]

Tags that just missed the cut:
 [('model-compression', 29), ('fastai', 29), ('graph-classification', 29), ('recurrent-neural-networks', 28), ('adversarial-learning', 28)]

Watch from 8:14 to see what all of this looks like in code.

Over time, our dataset will grow and we’ll need to label new data. So far, we had a team of moderators clean the existing data but we’ll need to establish proper workflow to make this process easier and reliable. Typically, we’ll use collaborative UIs where annotators can fix errors, etc. and then use a tool like Airflow for workflow management to know when new data is ready to be annotated and also when it’s ready to be used for modeling.

In the next section we’ll be performing exploratory data analysis (EDA) on our labeled dataset. However, the order of the annotation and EDA steps can be reversed depending on how well the problem is defined. If you’re unsure about what inputs and outputs are worth mapping, use can use EDA to figure it out.

Watch from 2:50 for a video walkthrough of this section.


[ annotation labeling ]