π Gradient Magnitude Through Layers (Log Scale)
π How to read this chart: Gradients flow backward during backpropagation.
- Right side (L10) = Output layer β where gradients START (both methods begin equal)
- Left side (L0) = First layer β where gradients END UP (this is what matters!)
Watch how the lines diverge as you move left. Uniform's gradient decays exponentially while Normal's stays healthier. The bigger the gap at L0, the worse uniform performs for training early layers.
Uniform Layer-by-Layer
Normal Layer-by-Layer
π Pre-activation (z) Distribution Across All Layers
What is this? Before each neuron applies tanh, it computes z = Ξ£(w Γ x). This "z" is the pre-activation.
Why it matters: tanh(z) saturates when |z| is large. At |z|=1, tanh'β0.42. At |z|=2, tanh'β0.07. We want z near 0.
Stats explained: Avg |z| = average magnitude. Std Dev = spread of values. Lower values = tighter distribution = better gradient flow.
Colors: Red = Uniform, Teal = Normal