A Survey of Long-Term Context in Transformers
Over the past two years the NLP community has developed a veritable zoo of methods to combat expensive multi-head self-attention.
transformers multi-head-attention attention natural-language-processing tutorial article

In this post we'll focus on six promising approaches: • Sparse Transformers • Adaptive Span Transformers • Transformer-XL • Compressive Transformers • Reformer • Routing Transformer

Don't forget to tag @madisonmay in your comment, otherwise they may not be notified.

Authors
Machine Learning Architect at @IndicoDataSolutions
Share this project
Similar projects
Talking-Heads Attention
A variation on multi-head attention which includes linear projections across the attention-heads dimension, immediately before and after the softmax ...
VirTex: Learning Visual Representations from Textual Annotations
We train CNN+Transformer from scratch from COCO, transfer the CNN to 6 downstream vision tasks, and exceed ImageNet features despite using 10x fewer ...
PyTorch Transformers Tutorials
A set of annotated Jupyter notebooks, that give user a template to fine-tune transformers model to downstream NLP tasks such as classification, NER etc.
The Transformer … “Explained”?
An intuitive explanation of the Transformer by motivating it through the lens of CNNs, RNNs, etc.
Top collections