Flash Attention
An optimized implementation of the attention mechanism that reduces memory usage and increases speed by tiling the computation and avoiding materializing the full attention matrix in memory.
Why It Matters
Flash Attention made long-context models practical by reducing the memory from O(n²) to O(n) and speeding up training by 2-4x. It is now standard in LLM training.
Example
Training a model with 128K context window that would require 1TB of memory with standard attention, but only 16GB with Flash Attention.
Think of it like...
Like a smart chef who prepares ingredients in small batches instead of laying everything out at once — same final dish, but uses far less counter space.
Related Terms
Attention Mechanism
A component in neural networks that allows the model to focus on the most relevant parts of the input when producing each part of the output. It assigns different weights to different input elements based on their relevance.
Context Window
The maximum amount of text (measured in tokens) that a language model can process in a single interaction. It includes both the input prompt and the generated output. Larger context windows allow models to handle longer documents.
Transformer
A neural network architecture introduced in 2017 that uses self-attention mechanisms to process sequential data in parallel rather than sequentially. Transformers are the foundation of modern LLMs like GPT, Claude, and Gemini.