Data Augmentation

Assessing data augmentation on our training data split to increase the number of quality training samples.

Goku Mohandas

Subscribe to our newsletter

📬 Receive new lessons straight to your inbox (once a month) and join 40K+ developers in learning how to responsibly deliver value with ML.

Intuition

We'll often want to increase the size and diversity of our training data split through data augmentation. It involves using the existing samples to generate synthetic, yet realistic, examples.

Split the dataset. We want to split our dataset first because many augmentation techniques will cause a form of data leak if we allow the generated samples to be placed across different data splits.

For example, some augmentation involves generating synonyms for certain key tokens in a sentence. If we allow the generated sentences from the same origin sentence to go into different splits, we could be potentially leaking samples with nearly identical embedding representations across our different splits.
Augment the training split. We want to apply data augmentation on only the training set because our validation and testing splits should be used to provide an accurate estimate on actual data points.
Inspect and validate. It's useless to augment just for the same of increasing our training sample size if the augmented data samples are not probable inputs that our model could encounter in production.

The exact method of data augmentation depends largely on the type of data and the application. Here are a few ways different modalities of data can be augmented:

Data Augmentation with Snorkel

General: normalization, smoothing, random noise, synthetic oversampling (SMOTE), etc.
Natural language processing (NLP): substitutions (synonyms, tfidf, embeddings, masked models), random noise, spelling errors, etc.
Computer vision (CV): crop, flip, rotate, pad, saturate, increase brightness, etc.

Warning

While the transformations on some data modalities, such as images, are easy to inspect and validate, others may introduce silent errors. For example, shifting the order of tokens in text can significantly alter the meaning (“this is really cool” → “is this really cool”). Therefore, it’s important to measure the noise that our augmentation policies will introduce and do have granular control over the transformations that take place.

Libraries

Depending on the feature types and tasks, there are many data augmentation libraries which allow us to extend our training data.

Natural language processing (NLP)

NLPAug: data augmentation for NLP.
TextAttack: a framework for adversarial attacks, data augmentation, and model training in NLP.
TextAugment: text augmentation library.

Computer vision (CV)

Imgaug: image augmentation for machine learning experiments.
Albumentations: fast image augmentation library.
Augmentor: image augmentation library in Python for machine learning.
Kornia.augmentation: a module to perform data augmentation in the GPU.
SOLT: data augmentation library for Deep Learning, which supports images, segmentation masks, labels and key points.

Other

Snorkel: system for generating training data with weak supervision.
DeltaPy⁠⁠: tabular data augmentation and feature engineering.
Audiomentations: a Python library for audio data augmentation.
Tsaug: a Python package for time series augmentation.

Implementation

Let's use the nlpaug library to augment our dataset and assess the quality of the generated samples.

pip install nlpaug==1.1.0 transformers==3.0.2 -q
pip install snorkel==0.9.8 -q

import nlpaug.augmenter.word as naw

# Load tokenizers and transformers
substitution = naw.ContextualWordEmbsAug(model_path="distilbert-base-uncased", action="substitute")
insertion = naw.ContextualWordEmbsAug(model_path="distilbert-base-uncased", action="insert")
text = "Conditional image generation using Variational Autoencoders and GANs."

# Substitutions
substitution.augment(text)

hierarchical risk mapping using variational signals and gans.

Substitution doesn't seem like a great idea for us because there are certain keywords that provide strong signal for our tags so we don't want to alter those. Also, note that these augmentations are NOT deterministic and will vary every time we run them. Let's try insertion...

# Insertions
insertion.augment(text)

automated conditional inverse image generation algorithms using multiple variational autoencoders and gans.

A little better but still quite fragile and now it can potentially insert key words that can influence false positive tags to appear. Maybe instead of substituting or inserting new tokens, let's try simply swapping machine learning related keywords with their aliases. We'll use Snorkel's transformation functions to easily achieve this.

# Replace dashes from tags & aliases
def replace_dash(x):
    return x.replace("-", " ")

# Aliases
aliases_by_tag = {
    "computer-vision": ["cv", "vision"],
    "mlops": ["production"],
    "natural-language-processing": ["nlp", "nlproc"]
}

# Flatten dict
flattened_aliases = {}
for tag, aliases in aliases_by_tag.items():
    tag = replace_dash(x=tag)
    if len(aliases):
        flattened_aliases[tag] = aliases
    for alias in aliases:
        _aliases = aliases + [tag]
        _aliases.remove(alias)
        flattened_aliases[alias] = _aliases

print (flattened_aliases["natural language processing"])
print (flattened_aliases["nlp"])

['nlp', 'nlproc']
['nlproc', 'natural language processing']

For now we'll use tags and aliases as they are in aliases_by_tag but we could account for plurality of tags using the inflect package or apply stemming before replacing aliases, etc.

# We want to match with the whole word only
print ("gan" in "This is a gan.")
print ("gan" in "This is gandalf.")

# \b matches spaces
def find_word(word, text):
    word = word.replace("+", "\+")
    pattern = re.compile(fr"\b({word})\b", flags=re.IGNORECASE)
    return pattern.search(text)

# Correct behavior (single instance)
print (find_word("gan", "This is a gan."))
print (find_word("gan", "This is gandalf."))

<re.Match object; span=(10, 13), match='gan'>
None

Now let's use snorkel's transformation_function to systematically apply this transformation to our data.

from snorkel.augmentation import transformation_function

@transformation_function()
def swap_aliases(x):
    """Swap ML keywords with their aliases."""
    # Find all matches
    matches = []
    for i, tag in enumerate(flattened_aliases):
        match = find_word(tag, x.text)
        if match:
            matches.append(match)
    # Swap a random match with a random alias
    if len(matches):
        match = random.choice(matches)
        tag = x.text[match.start():match.end()]
        x.text = f"{x.text[:match.start()]}{random.choice(flattened_aliases[tag])}{x.text[match.end():]}"
    return x

# Swap
for i in range(3):
    sample_df = pd.DataFrame([{"text": "a survey of reinforcement learning for nlp tasks."}])
    sample_df.text = sample_df.text.apply(clean_text, lower=True, stem=False)
    print (swap_aliases(sample_df.iloc[0]).text)

# Undesired behavior (needs contextual insight)
for i in range(3):
    sample_df = pd.DataFrame([{"text": "Autogenerate your CV to apply for jobs using NLP."}])
    sample_df.text = sample_df.text.apply(clean_text, lower=True, stem=False)
    print (swap_aliases(sample_df.iloc[0]).text)

autogenerate vision apply jobs using nlp
autogenerate cv apply jobs using natural language processing
autogenerate cv apply jobs using nlproc

Now we'll define a augmentation policy to apply our transformation functions with certain rules (how many samples to generate, whether to keep the original data point, etc.)

from snorkel.augmentation import ApplyOnePolicy, PandasTFApplier

# Transformation function (TF) policy
policy = ApplyOnePolicy(n_per_original=5, keep_original=True)
tf_applier = PandasTFApplier([swap_aliases], policy)
train_df_augmented = tf_applier.apply(train_df)
train_df_augmented.drop_duplicates(subset=["text"], inplace=True)
train_df_augmented.head()

	text	tags
0	laplacian pyramid reconstruction refinement se...	computer-vision
1	extract stock sentiment news headlines project...	natural-language-processing
2	big bad nlp database collection 400 nlp datasets...	natural-language-processing
2	big bad natural language processing database c...	natural-language-processing
2	big bad nlproc database collection 400 nlp dat...	natural-language-processing

len(train_df), len(train_df_augmented)

(668, 913)

For now, we'll skip the data augmentation because it's quite fickle and empirically it doesn't improvement performance much. But we can see how this can be very effective once we can control what type of vocabulary to augment on and what exactly to augment with.

Warning

Regardless of what method we use, it's important to validate that we're not just augmenting for the sake of augmentation. We can do this by executing any existing data validation tests and even creating specific tests to apply on augmented data.

Upcoming live cohorts

Sign up for our upcoming live cohort, where we'll provide live lessons + QA, compute (GPUs) and community to learn everything in one day.

To cite this content, please use:

@article{madewithml,
    author       = {Goku Mohandas},
    title        = { Data Augmentation - Made With ML },
    howpublished = {\url{https://madewithml.com/}},
    year         = {2023}
}