Deep Network: rand() vs randn() - The Real Difference

Layers: 10 Hidden layers between input & output

Normal/Uniform ratio when Neurons=50 and Features=64, varying Layers from 1 to 20. Above 1x = normal wins, below 1x = uniform wins.

Neurons: 50 Neurons per hidden layer

Normal/Uniform ratio when Layers=10 and Features=64, varying Neurons from 1 to 100. Above 1x = normal wins, below 1x = uniform wins.

Features: 64 Input features to the network

Normal/Uniform ratio when Layers=10 and Neurons=50, varying Features from 1 to 128. Above 1x = normal wins, below 1x = uniform wins.

📉 Gradient Magnitude Through Layers (Log Scale)

📖 How to read this chart: Gradients flow backward during backpropagation.

Right side (L10) = Output layer — where gradients START (both methods begin equal)
Left side (L0) = First layer — where gradients END UP (this is what matters!)

Watch how the lines diverge as you move left. Uniform's gradient decays exponentially while Normal's stays healthier. The bigger the gap at L0, the worse uniform performs for training early layers.

Uniform Layer-by-Layer

Normal Layer-by-Layer

📊 Pre-activation (z) Distribution Across All Layers

What is this? Before each neuron applies tanh, it computes z = Σ(w × x). This "z" is the pre-activation.

Why it matters: tanh(z) saturates when |z| is large. At |z|=1, tanh'≈0.42. At |z|=2, tanh'≈0.07. We want z near 0.

Stats explained: Avg |z| = average magnitude. Std Dev = spread of values. Lower values = tighter distribution = better gradient flow.

Colors: Red = Uniform, Teal = Normal

📊 Statistical Summary (Multiple Runs)

Simulations

Uniform Avg Final Grad

—

Normal Avg Final Grad

—

Normal Wins

—

🔬 Why This Happens: The Mathematics

In backpropagation, gradients flow backwards through layers. At each layer with tanh activation, the gradient is multiplied by the local derivative:

∂L/∂W_{layer k} = ∂L/∂output × ∏_i=k+1ⁿ (W_i × tanh'(z_i))

The problem with uniform initialization: Weights at the edges (near ±1) are just as likely as weights near 0. This creates pre-activations with higher variance, pushing more neurons into saturation where tanh'(z) → 0.

Why normal (randn) works better: The bell curve concentrates most weights near zero. This keeps pre-activations smaller, neurons stay in the linear region of tanh, and gradients flow more freely.

Uniform: Higher variance in pre-activations → More saturation → Weaker gradients per layer → Exponential decay

Normal: Controlled pre-activations → Less saturation → Healthier gradients → Better gradient flow

The compounding effect: Even if normal only gives 20% better gradients per layer, after 10 layers: 1.2¹⁰ = 6.2x stronger final gradient!

🧠 Neural Network Weight Initialization