Tips for Successfully Training Transformers on Small Datasets
It turns out that you can easily train transformers on small datasets when you use tricks (and have the patience to train a very long time).
transformers small-datasets training ptb wikitext-2 dropout embeddings data-augmentation natural-language-processing tutorial article code

• The most dramatic performance gain comes from discrete embedding dropout: You embed as usual, but now with a probability p you zero the entire word vector. This is akin to masked language modeling but the goal is not to predict the mask — just regular LM with uncertain context. • The second most important factor is regular input dropout: You take the embeddings and dropout elements with probability p. This also has a data augmentation effect very similar to dropping out random pixels for images. What is a good way to think about this?

Don't forget to tag @TimDettmers in your comment, otherwise they may not be notified.

Authors
Share this project
Similar projects
Linear Attention Transformer
A fully featured Transformer that mixes (QKᵀ)V local attention with Q(KᵀV) global attention (scales linearly with respect to sequence length).
PyTorch Transformers Tutorials
A set of annotated Jupyter notebooks, that give user a template to fine-tune transformers model to downstream NLP tasks such as classification, NER etc.
Anti-Patterns in NLP (8 types of NLP idiots)
A talk which discusses the recurring industrial problems in making NLP solutions.
Transformers - Hugging Face
🤗 Transformers: State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch.
Top collections