ML Normalization; A Primer

Posted Nov 13, 2023

By Abu Shahid

10 min read

TL;DR: Before normalization, training deep networks was like trying to stack cards in a hurricane—one small change would topple everything. BatchNorm (2015) changed the game by stabilizing training. LayerNorm (2016) made Transformers possible. RMSNorm (2019) proved simpler can be better. This is the story of how three techniques revolutionized deep learning.

These paper reviews are written more for me and less for others. LLMs have been used in formatting

The Dark Ages: When Deep Networks Refused to Train

The Vanishing Gradient Nightmare

Picture this: You’re in 2014, trying to train a 20-layer neural network. You set up your architecture, hit “train,” and watch as your loss… does absolutely nothing. For hours. Sometimes days.

Welcome to the pre-normalization era where training deep networks was more alchemy than science.

flowchart LR
    A["Layer 1<br/>Gradient: 1.0"] --> B["Layer 5<br/>Gradient: 0.1"] 
    B --> C["Layer 10<br/>Gradient: 0.01"]
    C --> D["Layer 15<br/>Gradient: 0.001"]
    D --> E["Layer 20<br/>Gradient: 0.0001 💀"]
    
    style A fill:#4caf50,stroke:#2e7d32,stroke-width:2px,color:#fff
    style B fill:#ff9800,stroke:#f57c00,stroke-width:2px,color:#fff
    style C fill:#ff5722,stroke:#d84315,stroke-width:2px,color:#fff
    style D fill:#795548,stroke:#5d4037,stroke-width:2px,color:#fff
    style E fill:#424242,stroke:#212121,stroke-width:2px,color:#fff

The Internal Covariate Shift Problem

Here’s what was happening: every time the first layer’s weights updated, the input distribution to the second layer changed. This forced layer 2 to constantly readjust. Layer 3 then had to adapt to layer 2’s changes, creating a cascade of instability.

It was like trying to aim at a target while riding a roller coaster in an earthquake.

The Band-Aid Solutions

Before normalization, practitioners relied on:

Method	What It Did	Why It Failed
Careful Initialization	Xavier/He initialization	Only helped at start, not during training
Tiny Learning Rates	0.001 or smaller	Training took forever
Gradient Clipping	Manually cap gradient magnitude	Treated symptoms, not the disease
Shallow Networks	Stick to 3-5 layers	Limited representational power

Historical Note: ResNet (2015) was partially motivated by the difficulty of training deep networks. Skip connections were a clever workaround, but normalization was the real solution.

BatchNorm: The 2015 Revolution That Changed Everything

Ioffe and Szegedy’s Breakthrough

In 2015, Sergey Ioffe and Christian Szegedy dropped a paper that would transform deep learning forever. Their insight was deceptively simple:

“What if we normalize the inputs to each layer to have zero mean and unit variance?”

The Mathematical Breakthrough

For a mini-batch $\mathcal{B} = {x_1, x_2, …, x_m}$:

\[\hat{x}_i = \frac{x_i - \mu_{\mathcal{B}}}{\sqrt{\sigma^2_{\mathcal{B}} + \epsilon}}\]

Where:

$\mu_{\mathcal{B}} = \frac{1}{m} \sum_{i=1}^m x_i$ (batch mean)
$\sigma^2{\mathcal{B}} = \frac{1}{m} \sum{i=1}^m (x_i - \mu_{\mathcal{B}})^2$ (batch variance)

But here’s the genius part—they added learnable parameters:

\[y_i = \gamma \hat{x}_i + \beta\]

This allows the network to undo the normalization if needed. Brilliant!

Why This Worked Like Magic

graph LR
    A["Unstable Activations<br/>Mean: random<br/>Variance: random"] --> B["BatchNorm<br/>Normalize → Scale → Shift"]
    B --> C["Stable Activations<br/>Mean: learnable<br/>Variance: learnable"]
    
    style A fill:#ffebee,stroke:#c62828,stroke-width:2px,color:#000
    style B fill:#fff3e0,stroke:#f57c00,stroke-width:2px,color:#000
    style C fill:#e8f5e8,stroke:#2e7d32,stroke-width:2px,color:#000

The immediate benefits:

10x higher learning rates became possible
Deeper networks (50+ layers) started working
Faster convergence - networks trained in hours, not days
Less initialization sensitivity - networks became robust

The Implementation Reality

  
class BatchNorm2d(nn.Module):
    def __init__(self, num_features, momentum=0.9, eps=1e-5):
        super().__init__()
        # Learnable parameters
        self.gamma = nn.Parameter(torch.ones(num_features))
        self.beta = nn.Parameter(torch.zeros(num_features))
        
        # Running statistics (for inference)
        self.register_buffer('running_mean', torch.zeros(num_features))
        self.register_buffer('running_var', torch.ones(num_features))
        
        self.momentum = momentum
        self.eps = eps
    
    def forward(self, x):
        if self.training:
            # Use batch statistics
            batch_mean = x.mean([0, 2, 3])
            batch_var = x.var([0, 2, 3], unbiased=False)
            
            # Update running statistics
            self.running_mean = self.momentum * self.running_mean + (1 - self.momentum) * batch_mean
            self.running_var = self.momentum * self.running_var + (1 - self.momentum) * batch_var
            
            x_norm = (x - batch_mean.view(1, -1, 1, 1)) / torch.sqrt(batch_var.view(1, -1, 1, 1) + self.eps)
        else:
            # Use running statistics
            x_norm = (x - self.running_mean.view(1, -1, 1, 1)) / torch.sqrt(self.running_var.view(1, -1, 1, 1) + self.eps)
        
        return self.gamma.view(1, -1, 1, 1) * x_norm + self.beta.view(1, -1, 1, 1)

The Dark Side of BatchNorm

But BatchNorm wasn’t perfect. It introduced new problems:

The Batch Size Dependency:

Small batches → noisy statistics → poor performance
Large batches → accurate statistics but less regularization
Sweet spot: 16-32 for most applications

The Train-Test Discrepancy:

Training: Uses batch statistics (stochastic)
Inference: Uses running averages (deterministic)
This gap could cause mysterious performance drops

LayerNorm: The 2016 NLP Game Changer

The RNN Problem

By 2016, attention was shifting to sequence models and RNNs. But BatchNorm had a fatal flaw for NLP:

Variable sequence lengths made batch statistics meaningless. How do you normalize across batches when sentences have different lengths?

Ba, Kiros, and Hinton’s Solution

Jimmy Lei Ba and his colleagues had a different idea:

“Instead of normalizing across the batch dimension, why not normalize across the feature dimension?”

flowchart TB
   subgraph BN ["BatchNorm"]
       A1["Batch 1: Hello world<br/>2 tokens"]
       A2["Batch 2: The quick brown fox<br/>4 tokens"]
       A3["Batch 3: AI is revolutionary technology<br/>4 tokens"]
       A4["❌ Cannot compute meaningful<br/>batch statistics"]
   end
   
   subgraph LN ["LayerNorm"]
       B1["Each sentence normalized<br/>independently"]
       B2["✅ Works with any<br/>sequence length"]
       B3["✅ Works with any<br/>batch size"]
   end
   
   style A4 fill:#ffebee,stroke:#c62828,stroke-width:2px,color:#000
   style B2 fill:#e8f5e8,stroke:#2e7d32,stroke-width:2px,color:#000
   style B3 fill:#e8f5e8,stroke:#2e7d32,stroke-width:2px,color:#000

The Mathematical Shift

For LayerNorm, we normalize within each example:

\[\mu = \frac{1}{H} \sum_{i=1}^H x_i\] \[\sigma = \sqrt{\frac{1}{H} \sum_{i=1}^H (x_i - \mu)^2 + \epsilon}\] \[\hat{x}_i = \frac{x_i - \mu}{\sigma}\] \[y_i = \gamma_i \hat{x}_i + \beta_i\]

Key difference: Statistics computed per example, per layer, not across the batch.

Why This Enabled Transformers

LayerNorm was absolutely crucial for the Transformer revolution:

  
class TransformerBlock(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.attention = MultiHeadAttention(d_model, n_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.feed_forward = FeedForward(d_model)
        self.norm2 = nn.LayerNorm(d_model)
    
    def forward(self, x):
        # Pre-norm architecture (modern preference)
        attn_out = self.attention(self.norm1(x))
        x = x + attn_out  # Residual connection
        
        ff_out = self.feed_forward(self.norm2(x))
        x = x + ff_out    # Residual connection
        return x

Without LayerNorm:

Attention weights would become extreme
Deep transformers wouldn’t train
No BERT, no GPT, no ChatGPT

RMSNorm: The 2019 Simplification That Shocked Everyone

Zhang and Sennrich’s Radical Question

In 2019, researchers asked a provocative question:

“What if the re-centering step in LayerNorm is unnecessary? What if we only need the re-scaling?”

This seemed crazy. Surely you need both mean and variance normalization, right?

Wrong.

The Minimalist Approach

RMSNorm (Root Mean Square Normalization) strips LayerNorm down to its essence:

\[\text{RMS}(x) = \sqrt{\frac{1}{n} \sum_{i=1}^n x_i^2}\] \[\hat{x}_i = \frac{x_i}{\text{RMS}(x)}\] \[y_i = \gamma_i \hat{x}_i\]

What’s missing?

❌ No mean subtraction ($\mu = 0$)
❌ No bias parameter ($\beta$ removed)
✅ Only RMS normalization and scaling

The Shocking Results

Performance: RMSNorm matched or exceeded LayerNorm across tasks Speed: 10-15% faster computation
Memory: 50% fewer parameters (no $\beta$) Simplicity: Easier to implement and debug

  
class RMSNorm(nn.Module):
    def __init__(self, d_model, eps=1e-8):
        super().__init__()
        self.scale = nn.Parameter(torch.ones(d_model))
        self.eps = eps
    
    def forward(self, x):
        # Compute RMS (no mean subtraction!)
        rms = torch.sqrt(torch.mean(x**2, dim=-1, keepdim=True) + self.eps)
        
        # Scale only (no shift!)
        x_norm = x / rms
        return self.scale * x_norm

Why This Works: The Intuition

The insight: In many contexts, centering hurts more than it helps.

Mean information can be valuable - removing it loses signal
Scale normalization provides most of the optimization benefits
Simpler computation = fewer numerical errors

Modern Adoption

RMSNorm has been adopted by major models:

LLaMA: Meta’s large language models
PaLM: Google’s 540B parameter model
GPT-NeoX: EleutherAI’s models
T5: Recent variants use RMSNorm

The Normalization Dimension Wars

Understanding What We’re Normalizing

The dimension choice fundamentally changes what normalization does:

graph TB
    subgraph "Input Tensor [B, C, H, W]"
        A["Batch (B)<br/>Normalize across examples"]
        B["Channel (C)<br/>Normalize across features"] 
        C["Spatial (H,W)<br/>Normalize across positions"]
    end
    
    A --> D["BatchNorm<br/>Good for CNNs"]
    B --> E["LayerNorm<br/>Good for Transformers"] 
    C --> F["InstanceNorm<br/>Good for Style Transfer"]
    
    style D fill:#e3f2fd,stroke:#1976d2,stroke-width:2px,color:#000
    style E fill:#e8f5e8,stroke:#2e7d32,stroke-width:2px,color:#000
    style F fill:#fff3e0,stroke:#f57c00,stroke-width:2px,color:#000

The Great Comparison

Normalization	Normalizes Across	Best For	Key Benefit
BatchNorm	Batch + Spatial	CNNs, Computer Vision	Training stability, higher LR
LayerNorm	Features	Transformers, NLP	Batch-size independence
InstanceNorm	Spatial (per channel)	Style transfer, GANs	Per-image statistics
GroupNorm	Channel groups + Spatial	Small batches, Object detection	Batch-size robustness
RMSNorm	Features (no centering)	Large language models	Computational efficiency

The Performance Impact: Numbers That Matter

Training Speed Improvements

Before vs After Normalization:

Network Depth	Without Normalization	With BatchNorm	Speedup
10 layers	24 hours	4 hours	6x faster
50 layers	Doesn’t converge	8 hours	∞ → finite
152 layers	Impossible	12 hours	Enables deep nets

Common Pitfalls and How to Avoid Them

BatchNorm Gotchas

❌ Mistake 1: Wrong Eval Mode

  
# WRONG: Forget to switch modes
model.train()  # BatchNorm uses batch stats during inference
accuracy = evaluate(model, test_loader)  # Wrong results!

# CORRECT: Proper mode switching
model.eval()   # BatchNorm uses running stats during inference
accuracy = evaluate(model, test_loader)

❌ Mistake 2: Small Batch Catastrophe

  
# WRONG: BatchNorm with tiny batches
batch_size = 2  # BatchNorm statistics will be terrible
model = ResNet50(batch_norm=True)

# CORRECT: Use GroupNorm for small batches
if batch_size < 8:
    model = ResNet50(norm_layer=nn.GroupNorm)

❌ Mistake 3: Forgetting Running Stats

  
# WRONG: Not letting running stats converge
for epoch in range(5):  # Too few epochs
    train(model)
# Running statistics haven't stabilized!

# CORRECT: Sufficient training
for epoch in range(50):  # Let stats converge
    train(model)

The Future: What’s Next for Normalization?

Normalization-Free Networks: Recent breakthrough: NFNets (Normalization-Free Networks) achieve comparable performance without any normalization layers!
Learnable Normalization
Hardware-Aware Normalization

TPU-optimized and mobile-optimized normalization techniques are emerging:

Quantized normalization: Lower precision for mobile deployment
Fused operations: Combine normalization with other operations
Sparse normalization: Skip normalization for specific patterns

The Decision Framework: Choosing Your Normalization

flowchart TD
    A["What's your domain?"] --> B["Computer Vision"]
    A --> C["NLP/Sequences"]
    A --> D["Audio/Time Series"]
    
    B --> E["Batch size always > 4?"]
    E -->|Yes| F["BatchNorm<br/>✅ Standard choice"]
    E -->|No| G["GroupNorm<br/>✅ Small batch robust"]
    
    C --> H["Need maximum efficiency?"]
    H -->|Yes| I["RMSNorm<br/>✅ 15% faster"]
    H -->|No| J["LayerNorm<br/>✅ Well-tested"]
    
    D --> K["Variable length sequences?"]
    K -->|Yes| L["LayerNorm<br/>✅ Length independent"]
    K -->|No| M["Consider BatchNorm<br/>if fixed-length"]
    
    style F fill:#e3f2fd,stroke:#1976d2,stroke-width:2px,color:#000
    style G fill:#fff3e0,stroke:#f57c00,stroke-width:2px,color:#000
    style I fill:#e8f5e8,stroke:#2e7d32,stroke-width:2px,color:#000
    style J fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#000
    style L fill:#fff3e0,stroke:#f57c00,stroke-width:2px,color:#000

Conclusion: The Normalization Revolution

From the dark ages of careful initialization and tiny learning rates to today’s effortless training of 100+ layer networks, normalization techniques have fundamentally transformed deep learning.

The Evolution Timeline:

2015: BatchNorm revolutionized computer vision
2016: LayerNorm enabled the Transformer revolution
2019: RMSNorm proved simpler can be better
2024: Normalization-free networks emerging

Key Takeaways:

🎯 Domain Matters: CV → BatchNorm, NLP → LayerNorm/RMSNorm
⚡ Efficiency Counts: RMSNorm’s simplicity wins at scale
🔧 Placement Affects Training: Pre-norm usually beats post-norm
📊 Batch Size Influences Choice: Small batches need GroupNorm
🚀 Future is Adaptive: Networks learning their own normalization

The Bigger Picture:

Normalization didn’t just solve training instability—it democratized deep learning. Before BatchNorm, training deep networks required expertise, intuition, and luck. After normalization, anyone could train stable, deep networks with standard recipes.

What started as a solution to internal covariate shift became the foundation for:

ResNets enabling 1000+ layer networks
Transformers powering the AI revolution
Large language models with billions of parameters
Stable training at unprecedented scales

The next time you effortlessly train a deep network, remember: you’re standing on the shoulders of normalization giants who turned the art of deep learning into a reliable science.

ML Normalization; A Primer

The Dark Ages: When Deep Networks Refused to Train

The Vanishing Gradient Nightmare

The Internal Covariate Shift Problem

The Band-Aid Solutions

BatchNorm: The 2015 Revolution That Changed Everything

Ioffe and Szegedy’s Breakthrough

The Mathematical Breakthrough

Why This Worked Like Magic

The Implementation Reality

The Dark Side of BatchNorm

LayerNorm: The 2016 NLP Game Changer

The RNN Problem

Ba, Kiros, and Hinton’s Solution

The Mathematical Shift

Why This Enabled Transformers

RMSNorm: The 2019 Simplification That Shocked Everyone

Zhang and Sennrich’s Radical Question

The Minimalist Approach

The Shocking Results

Why This Works: The Intuition

Modern Adoption

The Normalization Dimension Wars

Understanding What We’re Normalizing

The Great Comparison

The Performance Impact: Numbers That Matter

Training Speed Improvements

Common Pitfalls and How to Avoid Them

BatchNorm Gotchas

❌ Mistake 1: Wrong Eval Mode

❌ Mistake 2: Small Batch Catastrophe

❌ Mistake 3: Forgetting Running Stats

The Future: What’s Next for Normalization?

The Decision Framework: Choosing Your Normalization

Conclusion: The Normalization Revolution

Further Reading & References

Foundational Papers

Trending Tags