Inserting Batch Norm into a network means that in the forward pass each neuron is divided by its standard deviation, σ, computed over a minibatch of samples. In the backward pass, gradients are divided by the same σ. In ReLU nets, we’ll show that this standard deviation is less than 1, in fact we can approximate σ ≈ √((π−1)/π)) ≈ 0.82. Since this occurs at every layer, gradient norms in early layers are amplified by roughly 1.21 ≈ (1/0.82) in every layer.
Don't forget to tag @kyleluther in your comment, otherwise they may not be notified.