Annotating and labeling our data for exploration.
Repository Β· Notebook
π¬ Receive new lessons straight to your inbox (once a month) and join 20K+ developers in learning how to responsibly deliver value with ML.
Intuition
Annotation is the process of identifying the inputs and outputs that are worth modeling (not just what could be modeled).
- use objective as a guide to determine the necessary signals
- explore creating new signals (via combining data, collecting new data, etc.)
- iteratively add more features to control complexity and effort
Warning
Be careful not to include features in the dataset that will not be available during inference time, causing data leakage.
It's also the phase where we can use our deep understanding of the problem, processes, constraints and domain expertise to:
- augment the existing dataset
- enhance using auxiliary data
- simplify using constraints
And it isn't just about identifying and labeling our initial dataset but also involves thinking about how to make the annotation process more efficient as our dataset grows.
- who will annotate new (streaming) data
- what tools will be used to accelerate the annotation process (ie. labeling functions)
- what workflows will be established to track the annotation process
Note
You should have overlaps where different annotators are working on the same samples. A meaningful inter-labeler discrepancy (>2%) indicates that the annotation task is subjective and requires more explicit labeling criteria.
Application
Datasets
- projects.json: projects with title, description and tags (cleaned by mods).
- tags.json: tags used in dropdown to aid autocompletion.
Note
We'll have a small GitHub Action that runs on a schedule (cron) to constantly update these datasets over time. We'll learn about how these work when we get to the CI/CD lesson.
Recall that our objective was to augment authors to add the appropriate tags for their project so the community can discover them. So we want to use the metadata provided in each project to determine what the relevant tags are. We'll want to start with the highly influential features and iteratively experiment with additional features.
Load data
We'll first load our dataset from the JSON file.
1 2 3 4 5 6 |
|
1 2 3 4 |
|
{ "id": 324, "title": "AdverTorch", "description": "A Toolbox for Adversarial Robustness Research", "tags": [ "code", "library", "security", "adversarial-learning", "adversarial-attacks", "adversarial-perturbations" ] }
Now we can load our data into a Pandas DataFrame.
1 2 3 4 |
|
The reason we want to iteratively add more features is because it introduces more complexity and effort. For example, extracting the relevant HTML from the URLs is not trivial but recall that we want to close the loop with a simple solution first. We're going to use just the title and description because we hypothesize that the project's core concepts will be there whereas the details may have many other keywords.
Auxiliary data
We're also going to be using an auxiliary dataset which contains a collection of all the tags with their aliases and parent/child relationships.
1 2 3 4 5 6 7 8 |
|
400
1 2 3 |
|
"question-answering": { "aliases": [ "qa" ], "parents": [ "natural-language-processing" ] }
Features
We can also combine existing input features to create new meaningful signal (helping the model a bit). Here, we could use a project's title and description separately as features but we'll combine them to create one input feature.
1 2 |
|
Data augmentation
Depending on the tasks, there are many data augmentation libraries:
- Natural language processing (NLP)
- NLPAug: data augmentation for NLP.
- TextAttack: a framework for adversarial attacks, data augmentation, and model training in NLP.
- TextAugment: text augmentation library.
- Computer vision (CV)
- Imgaug: image augmentation for machine learning experiments.
- Albumentations: fast image augmentation library.
- Augmentor: image augmentation library in Python for machine learning.
- Kornia.augmentation: a module to perform data augmentation in the GPU.
- SOLT: data augmentation library for Deep Learning, which supports images, segmentation masks, labels and key points.
- Other
- Snorkel: system for generating training data with weak supervision.
- DeltaPyβ β : tabular data augmentation and feature engineering.
- Audiomentations: a Python library for audio data augmentation.
- Tsaug: a Python package for time series augmentation.
Regardless of what tool we use, it's important to validate that we're not just augmenting for the sake of augmentation. For example, in many NLP data augmentation scenarios, the adjectives are replaced with other adjectives. We need to ensure that this generalized change doesn't affect key aspects of our dataset. For more fine-grained data augmentation, we can use concepts like transformation functions to apply specific types of augmentation to a subset of our dataset.
Constraints
In the same vain, we can also reduce the size of our data by placing constraints as to what data is worth annotation or labeling. Here we decide to filter tags above a certain frequency threshold because those with fewer samples won't be adequate for training.
1 2 3 4 |
|
We're going to include only these tags because they're the tags we care about and we've allowed authors to add any tag they want (noise). We'll also be excluding some general tags because they are automatically added when their children tags are present.
1 2 3 4 |
|
Note
Since we're constraining the output space here, we'll want to monitor the prevalence of new tags over time so we can capture them.
1 2 3 |
|
We're also going to restrict the mapping to only tags that are above a certain frequency threshold. The tags that don't have enough projects will not have enough samples to model their relationships.
1 2 3 4 5 6 7 8 9 |
|
Most popular tags: [('natural-language-processing', 429), ('computer-vision', 388), ('pytorch', 258), ('tensorflow', 213), ('transformers', 196)] Tags that just made the cut: [('time-series', 34), ('flask', 34), ('node-classification', 33), ('question-answering', 32), ('pretraining', 30)] Tags that just missed the cut: [('model-compression', 29), ('fastai', 29), ('graph-classification', 29), ('recurrent-neural-networks', 28), ('adversarial-learning', 28)]
1 2 3 4 5 |
|
1 2 3 |
|
1444 projects
Workflows
Over time, our dataset will grow and we'll need to label new data. So far, we had a team of moderators clean the existing data but we'll need to establish proper workflow to make this process easier and reliable. Typically, we'll use collaborative UIs where annotators can fix errors, etc. and then use a tool like Airflow or KubeFlow for worflow orchestration to know when new data is ready to be annotated and also when it's ready to be used for modeling.
Labeling functions
We could utilize weak supervision via labeling functions to label our existing (validate) and new data. We can create constructs based on keywords, pattern expressions, knowledge bases and generalized models to create these labeling functions to label our data, even with conflicts amongst the different labeling functions.
Note
In the next section we'll be performing exploratory data analysis (EDA) on our labeled dataset. However, the order of the annotation
and EDA
steps can be reversed depending on how well the problem is defined. If you're unsure about what inputs and outputs are worth mapping, use can use EDA to figure it out.
Resources
- Human in the Loop: Deep Learning without Wasteful Labelling
- Harnessing Organizational Knowledge for Machine Learning
To cite this lesson, please use:
1 2 3 4 5 6 |
|