A Survey of Long-Term Context in Transformers
Over the past two years the NLP community has developed a veritable zoo of methods to combat expensive multi-head self-attention.
Talking-Heads Attention
A variation on multi-head attention which includes linear projections across the attention-heads dimension, immediately before and after the softmax ...
All about attention in neural networks. Soft attention, attention maps, local and global attention and multi-head attention.
