Machine Learning

Distributed Training

Splitting model training across multiple GPUs or machines to handle larger models or datasets and reduce training time. Techniques include data parallelism and model parallelism.

Why It Matters

Distributed training is how frontier models are built — no single GPU can train a model like GPT-4. It requires orchestrating thousands of GPUs working in concert.

Example

Training a 70B parameter model across 256 GPUs using data parallelism (each GPU processes different batches) and model parallelism (each GPU holds part of the model).

Think of it like...

Like building a skyscraper with multiple construction crews working on different floors simultaneously — much faster than one crew doing everything sequentially.

Related Terms