Rethinking Batch Normalization in Transformers
We found that NLP batch statistics exhibit large variance throughout training, which leads to poor BN performance.
normalization batch-normalization power-normalization transformers
