🧠 Neural Network Weight Initialization

Why randn() beats rand() β€” demonstrated with gradient flow through multiple layers

Key Insight: Neural networks learn by sending error signals backward through layers (called "backpropagation"). When a network has many layers, these signals get multiplied at each layer. This means small weaknesses compound β€” if each layer weakens the signal by just 10%, after 10 layers you've lost 0.910 = 65% of your signal! That's why weight initialization matters: randn() (normal distribution) keeps signals healthier than rand() (uniform distribution).
10 Hidden layers between input & output
Normal/Uniform ratio when Neurons=50 and Features=64, varying Layers from 1 to 20. Above 1x = normal wins, below 1x = uniform wins.
50 Neurons per hidden layer
Normal/Uniform ratio when Layers=10 and Features=64, varying Neurons from 1 to 100. Above 1x = normal wins, below 1x = uniform wins.
64 Input features to the network
Normal/Uniform ratio when Layers=10 and Neurons=50, varying Features from 1 to 128. Above 1x = normal wins, below 1x = uniform wins.

πŸ“Š UNIFORM (rand)

β€”
Final Layer Gradient
VS

🎯 NORMAL (randn)

β€”
Final Layer Gradient
πŸ‘† Click "Run 1x" to start

πŸ“‰ Gradient Magnitude Through Layers (Log Scale)

πŸ“– How to read this chart: Gradients flow backward during backpropagation.

  • Right side (L10) = Output layer β€” where gradients START (both methods begin equal)
  • Left side (L0) = First layer β€” where gradients END UP (this is what matters!)

Watch how the lines diverge as you move left. Uniform's gradient decays exponentially while Normal's stays healthier. The bigger the gap at L0, the worse uniform performs for training early layers.

Uniform Layer-by-Layer

Normal Layer-by-Layer

πŸ“Š Pre-activation (z) Distribution Across All Layers

What is this? Before each neuron applies tanh, it computes z = Ξ£(w Γ— x). This "z" is the pre-activation.

Why it matters: tanh(z) saturates when |z| is large. At |z|=1, tanh'β‰ˆ0.42. At |z|=2, tanh'β‰ˆ0.07. We want z near 0.

Stats explained: Avg |z| = average magnitude. Std Dev = spread of values. Lower values = tighter distribution = better gradient flow.

Colors: Red = Uniform, Teal = Normal

πŸ“Š Statistical Summary (Multiple Runs)

Simulations
0
Uniform Avg Final Grad
β€”
Normal Avg Final Grad
β€”
Normal Wins
β€”

πŸ”¬ Why This Happens: The Mathematics

In backpropagation, gradients flow backwards through layers. At each layer with tanh activation, the gradient is multiplied by the local derivative:

βˆ‚L/βˆ‚Wlayer k = βˆ‚L/βˆ‚output Γ— ∏i=k+1n (Wi Γ— tanh'(zi))

The problem with uniform initialization: Weights at the edges (near Β±1) are just as likely as weights near 0. This creates pre-activations with higher variance, pushing more neurons into saturation where tanh'(z) β†’ 0.

Why normal (randn) works better: The bell curve concentrates most weights near zero. This keeps pre-activations smaller, neurons stay in the linear region of tanh, and gradients flow more freely.

Uniform: Higher variance in pre-activations β†’ More saturation β†’ Weaker gradients per layer β†’ Exponential decay

Normal: Controlled pre-activations β†’ Less saturation β†’ Healthier gradients β†’ Better gradient flow

The compounding effect: Even if normal only gives 20% better gradients per layer, after 10 layers: 1.210 = 6.2x stronger final gradient!

Computing...