Transformer Architecture
The full stack of components that make up a transformer model: multi-head self-attention, feed-forward networks, layer normalization, residual connections, and positional encodings.
Why It Matters
The transformer architecture is the foundation of virtually all modern AI. Understanding its components is essential for anyone working with LLMs.
Example
The original 'Attention Is All You Need' paper described an encoder-decoder transformer with 6 layers each, 8 attention heads, and 512-dimensional embeddings.
Think of it like...
Like the blueprint of a skyscraper showing all the structural elements — steel beams (attention), floors (layers), elevators (skip connections), and the foundation (embeddings).
Related Terms
Transformer
A neural network architecture introduced in 2017 that uses self-attention mechanisms to process sequential data in parallel rather than sequentially. Transformers are the foundation of modern LLMs like GPT, Claude, and Gemini.
Self-Attention
A mechanism where each element in a sequence attends to all other elements to compute a representation, determining how much focus to place on each part of the input. It is the core innovation of the transformer.
Multi-Head Attention
An extension of attention where multiple attention mechanisms (heads) run in parallel, each learning to focus on different types of relationships in the data. The outputs are then combined.
Residual Connection
A shortcut that allows the input to a layer to bypass one or more layers and be added directly to the output. This enables training of much deeper networks by ensuring gradient flow.
Layer Normalization
A normalization technique that normalizes the inputs across the features for each individual example (rather than across the batch). It stabilizes training in transformers and RNNs.