Tips for Successfully Training Transformers on Small Datasets
It turns out that you can easily train transformers on small datasets when you use tricks (and have the patience to train a very long time).
transformers small-datasets training ptb wikitext-2 dropout embeddings data-augmentation natural-language-processing tutorial
Resource links
Top collections
Details
Objectives & Highlights

• The most dramatic performance gain comes from discrete embedding dropout: You embed as usual, but now with a probability p you zero the entire word vector. This is akin to masked language modeling but the goal is not to predict the mask — just regular LM with uncertain context. • The second most important factor is regular input dropout: You take the embeddings and dropout elements with probability p. This also has a data augmentation effect very similar to dropping out random pixels for images. What is a good way to think about this?

Don't forget to tag @TimDettmers in your comment.

Authors
Share this project
Similar projects
Cycle Text-To-Image GAN with BERT
Image generation from their respective captions, building on state-of-the-art GAN architectures.
T5 fine-tuning
A colab notebook to showcase how to fine-tune T5 model on various NLP tasks (especially non text-2-text tasks with text-2-text approach)
Jukebox: A Generative Model for Music
We’re introducing Jukebox, a neural net that generates music, including rudimentary singing, as raw audio in a variety of genres and artist styles.
Rethinking Batch Normalization in Transformers
We found that NLP batch statistics exhibit large variance throughout training, which leads to poor BN performance.