Rethinking Batch Normalization in Transformers
We found that NLP batch statistics exhibit large variance throughout training, which leads to poor BN performance.
normalization batch-normalization power-normalization transformers natural-language-processing tutorial research paper arxiv:2003.07845

Ever wondered why BN is not used in NLP? We found that NLP batch statistics exhibit large variance throughout training, which leads to poor BN performance. To address this, we propose Power Norm that achieves SOTA vs. LN/BN.

Don't forget to tag @sIncerass in your comment, otherwise they may not be notified.

Authors
NLP Beginner 🥑
Share this project
Similar projects
EvoNorm layers in TensorFlow 2
Presents implementations of EvoNormB0 and EvoNormS0 layers as proposed in Evolving Normalization-Activation Layers by Liu et al.
EvoNorms: Evolving Normalization-Activation Layers
We use evolution to design new layers called EvoNorms, which outperform BatchNorm-ReLU on many tasks.
Why Batch Norm Causes Exploding Gradients
Our beloved Batch Norm can actually cause exploding gradients, at least at initialization time.
Gradient Centralization
Optimization technique that operates directly on gradients by centralizing their vectors to zero mean.