The Reformer - Pushing the Limits of Language Modeling
An in-depth understanding of each of the key features of the Reformer.
reformer transformers language-modeling natural-language-processing code notebook arxiv:2001.04451 paper tutorial research

The memory improvements can be attributed to 4 features which the Reformer authors introduced to the transformer world:

  • Reformer Self-Attention Layer - How to efficiently implement self-attention without being restricted to a local context?
  • Chunked Feed Forward Layers - How to get a better time-memory trade-off for large feed forward layers?
  • Reversible Residual Layers - How to drastically reduce memory consumption in training by a smart residual architecture?
  • Axial Positional Encodings - How to make positional encodings usable for extremely large input sequences?

The goal of this blog post is to give the reader an in-depth understanding of each of the four Reformer features mentioned above. While the explanations are focused on the Reformer, the reader should get a better intuition under which circumstances each of the four features can be effective for other transformer models as well. The four sections are only loosely connected, so they can very well be read individually.

Don't forget to tag @patrickvonplaten in your comment.

Authors community post
Share this project
Similar projects
Sparse Sinkhorn Attention
A new efficient and sparse method for learning to attend based on differentiable sorting of internal representations.
NLP Research Highlights — Issue #1
First quarterly issue of the natural language processing (NLP) Research Highlights series.