π¬ Receive new lessons straight to your inbox (once a month) and join 30K+ developers in learning how to responsibly deliver value with ML.
Intuition
We'll often want to increase the size and diversity of our training data split through data augmentation. It involves using the existing samples to generate synthetic, yet realistic, examples.
Split the dataset. We want to split our dataset first because many augmentation techniques will cause a form of data leak if we allow the generated samples to be placed across different data splits.
For example, some augmentation involves generating synonyms for certain key tokens in a sentence. If we allow the generated sentences from the same origin sentence to go into different splits, we could be potentially leaking samples with nearly identical embedding representations across our different splits.
Augment the training split. We want to apply data augmentation on only the training set because our validation and testing splits should be used to provide an accurate estimate on actual data points.
Inspect and validate. It's useless to augment just for the same of increasing our training sample size if the augmented data samples are not probable inputs that our model could encounter in production.
The exact method of data augmentation depends largely on the type of data and the application. Here are a few ways different modalities of data can be augmented:
While the transformations on some data modalities, such as images, are easy to inspect and validate, others may introduce silent errors. For example, shifting the order of tokens in text can significantly alter the meaning (βthis is really coolβ β βis this really coolβ). Therefore, itβs important to measure the noise that our augmentation policies will introduce and do have granular control over the transformations that take place.
Libraries
Depending on the feature types and tasks, there are many data augmentation libraries which allow us to extend our training data.
importnlpaug.augmenter.wordasnaw# Load tokenizers and transformerssubstitution=naw.ContextualWordEmbsAug(model_path="distilbert-base-uncased",action="substitute")insertion=naw.ContextualWordEmbsAug(model_path="distilbert-base-uncased",action="insert")text="Conditional image generation using Variational Autoencoders and GANs."
automated logic verification using variational transform and gans.
Substitution doesn't seem like a great idea for us because there are certain keywords that provide strong signal for our tags so we don't want to alter those. Also, note that these augmentations are NOT deterministic and will vary every time we run them. Let's try insertion...
simplified conditional nonlinear image generation models using inverse variational autoencoders and gans.
A little better but still quite fragile and now it can potentially insert key words that can influence false positive tags to appear. Maybe instead of substituting or inserting new tokens, let's try simply swapping machine learning related keywords with their aliases from our auxiliary data. We'll use Snorkel's transformation functions to easily achieve this.
# Tags that could be singular or pluralcan_be_singular=['animations','cartoons','autoencoders','...'data streams','support vector machines','variational autoencoders']can_be_plural=['annotation','data annotation','continuous integration',...'vqa','visualization','data visualization']
12345
# Add to flattened dictfortagincan_be_singular:flat_tags_dict[inflect.singular_noun(tag)]=flat_tags_dict[tag]fortagincan_be_plural:flat_tags_dict[inflect.plural_noun(tag)]=flat_tags_dict[tag]
1234567
# Doesn't perfectly match (ex. singlar tag to singlar alias)# But good enough for data augmentation for char-level tokenization# Could've also used stemming before swapping aliasesprint(flat_tags_dict["gan"])print(flat_tags_dict["gans"])print(flat_tags_dict["generative adversarial network"])print(flat_tags_dict["generative adversarial networks"])
@transformation_function()defswap_aliases(x):"""Swap ML keywords with their aliases."""# Find all matchesmatches=[]fori,taginenumerate(flat_tags_dict):match=find_word(tag,x.text)ifmatch:matches.append(match)# Swap a random match with a random aliasiflen(matches):match=random.choice(matches)tag=x.text[match.start():match.end()]x.text=f"{x.text[:match.start()]}{random.choice(flat_tags_dict[tag])}{x.text[match.end():]}"returnx
12345
# Swapforiinrange(3):sample_df=pd.DataFrame([{"text":"a survey of reinforcement learning for nlp tasks."}])sample_df.text=sample_df.text.apply(preprocess,lower=True,stem=False)print(swap_aliases(sample_df.iloc[0]).text)
# Undesired behavior (needs contextual insight)foriinrange(3):sample_df=pd.DataFrame([{"text":"Autogenerate your CV to apply for jobs using NLP."}])sample_df.text=sample_df.text.apply(preprocess,lower=True,stem=False)print(swap_aliases(sample_df.iloc[0]).text)
autogenerate vision apply jobs using nlp
autogenerate cv apply jobs using natural language processing
autogenerate cv apply jobs using nlproc
Now we'll define a augmentation policy to apply our transformation functions with certain rules (how many samples to generate, whether to keep the original data point, etc.)
# Transformation function (TF) policypolicy=ApplyOnePolicy(n_per_original=5,keep_original=True)tf_applier=PandasTFApplier([swap_aliases],policy)train_df_augmented=tf_applier.apply(train_df)train_df_augmented.drop_duplicates(subset=["text"],inplace=True)train_df_augmented.head()
text
tags
0
google stock price prediction using alpha vant...
[flask]
0
google stock price inference using alpha vanta...
[flask]
1
pifuhd high resolution 3d human digitization r...
[computer-vision]
1
pifuhd high resolution three dimensional human...
[computer-vision]
1
pifuhd high resolution 3 dimensional human dig...
[computer-vision]
1
len(train_df),len(train_df_augmented)
(1001, 1981)
For now, we'll skip the data augmentation because it's quite fickle and empirically it doesn't improvement performance much. But we can see how this can be very effective once we can control what type of vocabulary to augment on and what exactly to augment with.
Warning
Regardless of what method we use, it's important to validate that we're not just augmenting for the sake of augmentation. We can do this by executing any existing data validation tests and even creating specific tests to apply on augmented data.