Rethinking Batch Normalization in Transformers
We found that NLP batch statistics exhibit large variance throughout training, which leads to poor BN performance.
power-normalization batch-normalization transformers natural-language-processing tutorial
Objectives & Highlights

Ever wondered why BN is not used in NLP? We found that NLP batch statistics exhibit large variance throughout training, which leads to poor BN performance. To address this, we propose Power Norm that achieves SOTA vs. LN/BN.

Don't forget to add the tag @sIncerass in your comments.

If you are @sIncerass, you can sign up to gain ownership of this project and edit this page.
Share this project