Why Batch Norm Causes Exploding Gradients
Our beloved Batch Norm can actually cause exploding gradients, at least at initialization time.
normalization batch-normalization exploding-gradients weights-initialization deep-learning research tutorial article paper

Inserting Batch Norm into a network means that in the forward pass each neuron is divided by its standard deviation, σ, computed over a minibatch of samples. In the backward pass, gradients are divided by the same σ. In ReLU nets, we’ll show that this standard deviation is less than 1, in fact we can approximate σ ≈ √((π−1)/π)) ≈ 0.82. Since this occurs at every layer, gradient norms in early layers are amplified by roughly 1.21 ≈ (1/0.82) in every layer.

Don't forget to tag @kyleluther in your comment, otherwise they may not be notified.

Share this project
Similar projects
Rethinking Batch Normalization in Transformers
We found that NLP batch statistics exhibit large variance throughout training, which leads to poor BN performance.
EvoNorms: Evolving Normalization-Activation Layers
We use evolution to design new layers called EvoNorms, which outperform BatchNorm-ReLU on many tasks.
Normalization Techniques for Training Very Deep Neural Networks
How can we efficiently train very deep neural network architectures? What are the best in-layer normalization options? Read on and find out.
EvoNorm layers in TensorFlow 2
Presents implementations of EvoNormB0 and EvoNormS0 layers as proposed in Evolving Normalization-Activation Layers by Liu et al.
Top collections