Skip to content

Data Augmentation


Assessing data augmentation on our training data split to increase the number of quality training samples.
Goku Mohandas
· ·
Repository Β· Notebook

πŸ“¬  Receive new lessons straight to your inbox (once a month) and join 30K+ developers in learning how to responsibly deliver value with ML.

Intuition

We'll often want to increase the size and diversity of our training data split through data augmentation. It involves using the existing samples to generate synthetic, yet realistic, examples.

  1. Split the dataset. We want to split our dataset first because many augmentation techniques will cause a form of data leak if we allow the generated samples to be placed across different data splits.

    For example, some augmentation involves generating synonyms for certain key tokens in a sentence. If we allow the generated sentences from the same origin sentence to go into different splits, we could be potentially leaking samples with nearly identical embedding representations across our different splits.

  2. Augment the training split. We want to apply data augmentation on only the training set because our validation and testing splits should be used to provide an accurate estimate on actual data points.

  3. Inspect and validate. It's useless to augment just for the same of increasing our training sample size if the augmented data samples are not probable inputs that our model could encounter in production.

The exact method of data augmentation depends largely on the type of data and the application. Here are a few ways different modalities of data can be augmented:

  • General: normalization, smoothing, random noise, etc. can be used for audio, tabular and other forms of data.
  • Natural language processing (NLP): substitutions (synonyms, tfidf, embeddings, masked models), random noise, spelling errors, etc.
  • Computer vision (CV): crop, flip, rotate, pad, saturate, increase brightness, etc.

Libraries

Depending on the feature types and tasks, there are many data augmentation libraries which allow us to extend our training data.

Natural language processing (NLP)

  • NLPAug: data augmentation for NLP.
  • TextAttack: a framework for adversarial attacks, data augmentation, and model training in NLP.
  • TextAugment: text augmentation library.

Computer vision (CV)

  • Imgaug: image augmentation for machine learning experiments.
  • Albumentations: fast image augmentation library.
  • Augmentor: image augmentation library in Python for machine learning.
  • Kornia.augmentation: a module to perform data augmentation in the GPU.
  • SOLT: data augmentation library for Deep Learning, which supports images, segmentation masks, labels and key points.

Other

  • Snorkel: system for generating training data with weak supervision.
  • DeltaPy⁠⁠: tabular data augmentation and feature engineering.
  • Audiomentations: a Python library for audio data augmentation.
  • Tsaug: a Python package for time series augmentation.

Application

Let's use the nlpaug library to augment our dataset and assess the quality of the generated samples.

1
2
3
!python -m pip install --upgrade pip
!pip install nlpaug==1.1.0 transformers==3.0.2 -q
!pip install snorkel==0.9.6 -q --use-feature=2020-resolver
1
2
3
4
5
6
import nlpaug.augmenter.word as naw

# Load tokenizers and transformers
substitution = naw.ContextualWordEmbsAug(model_path="distilbert-base-uncased", action="substitute")
insertion = naw.ContextualWordEmbsAug(model_path="distilbert-base-uncased", action="insert")
text = "Conditional image generation using Variational Autoencoders and GANs."
1
2
3
# Substitutions
augmented_text = substitution.augment(text)
print (augmented_text)
automated logic verification using variational transform and gans.

Substitution doesn't seem like a great idea for us because there are certain keywords that provide strong signal for our tags so we don't want to alter those. Also, note that these augmentations are NOT deterministic and will vary every time we run them. Let's try insertion...

1
2
3
# Insertions
augmented_text = insertion.augment(text)
print (augmented_text)
simplified conditional nonlinear image generation models using inverse variational autoencoders and gans.

A little better but still quite fragile and now it can potentially insert key words that can influence false positive tags to appear. Maybe instead of substituting or inserting new tokens, let's try simply swapping machine learning related keywords with their aliases from our auxiliary data. We'll use Snorkel's transformation functions to easily achieve this.

1
2
3
import inflect
from snorkel.augmentation import transformation_function
inflect = inflect.engine()
1
2
3
4
5
# Inflect
print (inflect.singular_noun("graphs"))
print (inflect.singular_noun("graph"))
print (inflect.plural_noun("graph"))
print (inflect.plural_noun("graphs"))
graph
False
graphs
graphss
1
2
def replace_dash(x):
    return x.replace("-", " ")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
flat_tags_dict = {}
for tag, info in tags_dict.items():
    tag = tag.replace("-", " ")
    aliases = list(map(replace_dash, info["aliases"]))
    if len(aliases):
        flat_tags_dict[tag] = aliases
    for alias in aliases:
        _aliases = aliases + [tag]
        _aliases.remove(alias)
        flat_tags_dict[alias] = _aliases
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
# Tags that could be singular or plural
can_be_singular = [
    'animations',
    'cartoons',
    'autoencoders',
    '...
    'data streams',
    'support vector machines',
    'variational autoencoders'
]
can_be_plural = [
    'annotation',
    'data annotation',
    'continuous integration',
    ...
    'vqa',
    'visualization',
    'data visualization'
]
1
2
3
4
5
# Add to flattened dict
for tag in can_be_singular:
    flat_tags_dict[inflect.singular_noun(tag)] = flat_tags_dict[tag]
for tag in can_be_plural:
    flat_tags_dict[inflect.plural_noun(tag)] = flat_tags_dict[tag]
1
2
3
4
5
6
7
# Doesn't perfectly match (ex. singlar tag to singlar alias)
# But good enough for data augmentation for char-level tokenization
# Could've also used stemming before swapping aliases
print (flat_tags_dict["gan"])
print (flat_tags_dict["gans"])
print (flat_tags_dict["generative adversarial network"])
print (flat_tags_dict["generative adversarial networks"])
['generative adversarial networks']
['generative adversarial networks']
['gan']
['gan']
1
2
3
# We want to match with the whole word only
print ("gan" in "This is a gan.")
print ("gan" in "This is gandalf.")
True
True
1
2
3
4
def find_word(word, text):
    word = word.replace("+", "\+")
    pattern = re.compile(fr"\b({word})\b", flags=re.IGNORECASE)
    return pattern.search(text)
1
2
3
# Correct behavior (single instance)
print (find_word("gan", "This is a gan."))
print (find_word("gan", "This is gandalf."))

None
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
@transformation_function()
def swap_aliases(x):
    """Swap ML keywords with their aliases."""

    # Find all matches
    matches = []
    for i, tag in enumerate(flat_tags_dict):
        match = find_word(tag, x.text)
        if match:
            matches.append(match)

    # Swap a random match with a random alias
    if len(matches):
        match = random.choice(matches)
        tag = x.text[match.start():match.end()]
        x.text = f"{x.text[:match.start()]}{random.choice(flat_tags_dict[tag])}{x.text[match.end():]}"
    return x
1
2
3
4
5
# Swap
for i in range(3):
    sample_df = pd.DataFrame([{"text": "a survey of reinforcement learning for nlp tasks."}])
    sample_df.text = sample_df.text.apply(preprocess, lower=True, stem=False)
    print (swap_aliases(sample_df.iloc[0]).text)
survey reinforcement learning nlproc tasks
survey rl nlp tasks
survey rl nlp tasks
1
2
3
4
5
# Undesired behavior (needs contextual insight)
for i in range(3):
    sample_df = pd.DataFrame([{"text": "Autogenerate your CV to apply for jobs using NLP."}])
    sample_df.text = sample_df.text.apply(preprocess, lower=True, stem=False)
    print (swap_aliases(sample_df.iloc[0]).text)
autogenerate vision apply jobs using nlp
autogenerate cv apply jobs using natural language processing
autogenerate cv apply jobs using nlproc

Now we'll define a augmentation policy to apply our transformation functions with certain rules (how many samples to generate, whether to keep the original data point, etc.)

1
from snorkel.augmentation import ApplyOnePolicy, PandasTFApplier
1
2
3
4
5
6
# Transformation function (TF) policy
policy = ApplyOnePolicy(n_per_original=5, keep_original=True)
tf_applier = PandasTFApplier([swap_aliases], policy)
train_df_augmented = tf_applier.apply(train_df)
train_df_augmented.drop_duplicates(subset=["text"], inplace=True)
train_df_augmented.head()
text tags
0 google stock price prediction using alpha vant... [flask]
0 google stock price inference using alpha vanta... [flask]
1 pifuhd high resolution 3d human digitization r... [computer-vision]
1 pifuhd high resolution three dimensional human... [computer-vision]
1 pifuhd high resolution 3 dimensional human dig... [computer-vision]
1
len(train_df), len(train_df_augmented)
(1001, 1981)

For now, we'll skip the data augmentation because it's quite fickle and empirically it doesn't improvement performance much. But we can see how this can be very effective once we can control what type of vocabulary to augment on and what exactly to augment with.

Warning

Regardless of what method we use, it's important to validate that we're not just augmenting for the sake of augmentation. We can do this by executing any existing data validation tests and even creating specific tests to apply on augmented data.

Resources