Layer Normalization
A normalization technique that normalizes the inputs across the features for each individual example (rather than across the batch). It stabilizes training in transformers and RNNs.
Why It Matters
Layer normalization is a critical component in transformers. Unlike batch normalization, it works the same during training and inference, simplifying deployment.
Example
After each transformer layer computes its output, layer normalization adjusts the values to have zero mean and unit variance across the feature dimension.
Think of it like...
Like a sound engineer who normalizes audio levels for each track independently, ensuring consistent volume regardless of the source.
Related Terms
Batch Normalization
A technique that normalizes the inputs to each layer in a neural network by adjusting and scaling them to have zero mean and unit variance. This stabilizes and accelerates the training process.
Transformer
A neural network architecture introduced in 2017 that uses self-attention mechanisms to process sequential data in parallel rather than sequentially. Transformers are the foundation of modern LLMs like GPT, Claude, and Gemini.
Residual Connection
A shortcut that allows the input to a layer to bypass one or more layers and be added directly to the output. This enables training of much deeper networks by ensuring gradient flow.