Machine Learning

Data Parallelism

A distributed training approach where the training data is split across multiple GPUs, each holding a complete copy of the model. Gradients are averaged across GPUs after each batch.

Why It Matters

Data parallelism is the simplest and most common way to speed up training. It scales nearly linearly — 8 GPUs can train roughly 8x faster.

Example

Splitting a batch of 256 images across 8 GPUs (32 per GPU), each computing gradients independently, then averaging to update the shared model.

Think of it like...

Like a teacher giving different practice problems to each student but using the same textbook — everyone learns from different data but contributes to the same understanding.

Related Terms