Rethinking Batch Normalization in Transformers
We found that NLP batch statistics exhibit large variance throughout training, which leads to poor BN performance.
normalization batch-normalization power-normalization transformers natural-language-processing tutorial research paper arxiv:2003.07845

Ever wondered why BN is not used in NLP? We found that NLP batch statistics exhibit large variance throughout training, which leads to poor BN performance. To address this, we propose Power Norm that achieves SOTA vs. LN/BN.

Don't forget to tag @sIncerass in your comment, otherwise they may not be notified.

Authors
NLP Beginner 🥑
Share this project
Similar projects
Why Batch Norm Causes Exploding Gradients
Our beloved Batch Norm can actually cause exploding gradients, at least at initialization time.
Normalization Techniques for Training Very Deep Neural Networks
How can we efficiently train very deep neural network architectures? What are the best in-layer normalization options? Read on and find out.
EvoNorms: Evolving Normalization-Activation Layers
We use evolution to design new layers called EvoNorms, which outperform BatchNorm-ReLU on many tasks.
EvoNorm layers in TensorFlow 2
Presents implementations of EvoNormB0 and EvoNormS0 layers as proposed in Evolving Normalization-Activation Layers by Liu et al.