Distributed Training
Splitting model training across multiple GPUs or machines to handle larger models or datasets and reduce training time. Techniques include data parallelism and model parallelism.
Why It Matters
Distributed training is how frontier models are built — no single GPU can train a model like GPT-4. It requires orchestrating thousands of GPUs working in concert.
Example
Training a 70B parameter model across 256 GPUs using data parallelism (each GPU processes different batches) and model parallelism (each GPU holds part of the model).
Think of it like...
Like building a skyscraper with multiple construction crews working on different floors simultaneously — much faster than one crew doing everything sequentially.
Related Terms
GPU
Graphics Processing Unit — originally designed for rendering graphics, GPUs excel at the parallel mathematical operations needed for training and running AI models. They are the primary hardware for modern AI.
Compute
The computational resources (processing power, memory, time) required to train or run AI models. Compute is measured in FLOPs (floating-point operations) and is a primary constraint and cost in AI development.
Data Parallelism
A distributed training approach where the training data is split across multiple GPUs, each holding a complete copy of the model. Gradients are averaged across GPUs after each batch.
Model Parallelism
A distributed training approach where the model itself is split across multiple GPUs, with each GPU holding and computing a different portion of the model.