Attention Window
The range of tokens that an attention mechanism can attend to in a single computation. Different attention patterns (local, global, sliding) use different window sizes.
Why It Matters
Attention window design determines the tradeoff between context length and computational efficiency — a key architectural decision for long-context models.
Example
A sliding window attention of 4,096 tokens means each token can attend to 4,096 nearby tokens, with global attention tokens providing coverage of the full sequence.
Think of it like...
Like the field of vision when driving — you focus on what is nearby (local attention) while periodically checking mirrors for the broader picture (global attention).
Related Terms
Attention Mechanism
A component in neural networks that allows the model to focus on the most relevant parts of the input when producing each part of the output. It assigns different weights to different input elements based on their relevance.
Context Window
The maximum amount of text (measured in tokens) that a language model can process in a single interaction. It includes both the input prompt and the generated output. Larger context windows allow models to handle longer documents.
Sparse Attention
A variant of attention where each token only attends to a subset of other tokens rather than all of them, reducing computational cost from O(n²) to O(n√n) or O(n log n).
Flash Attention
An optimized implementation of the attention mechanism that reduces memory usage and increases speed by tiling the computation and avoiding materializing the full attention matrix in memory.
Transformer
A neural network architecture introduced in 2017 that uses self-attention mechanisms to process sequential data in parallel rather than sequentially. Transformers are the foundation of modern LLMs like GPT, Claude, and Gemini.