Gradient Accumulation
A technique that simulates larger batch sizes by accumulating gradients over multiple forward passes before performing a single weight update. This enables large effective batch sizes on limited hardware.
Why It Matters
Gradient accumulation lets you train with large batch sizes on a single GPU, achieving training dynamics similar to expensive multi-GPU setups.
Example
Running 8 forward passes with batch size 32 and accumulating gradients before updating, giving an effective batch size of 256 on hardware that can only handle 32.
Think of it like...
Like filling a swimming pool with a garden hose — each hose-fill is small, but by accumulating many fills, you achieve the same result as a fire hose.
Related Terms
Batch Size
The number of training examples processed together before the model updates its parameters. Batch size affects training speed, memory usage, and how smoothly the model learns.
Gradient Descent
An optimization algorithm used to minimize the error (loss) of a model by iteratively adjusting parameters in the direction that reduces the loss most quickly. It is the primary method for training machine learning models.
Distributed Training
Splitting model training across multiple GPUs or machines to handle larger models or datasets and reduce training time. Techniques include data parallelism and model parallelism.