Compute-Optimal Training
Allocating a fixed compute budget optimally between model size and training data quantity, based on scaling law research like the Chinchilla findings.
Why It Matters
Compute-optimal training prevents wasting millions on undertrained large models or overtrained small ones — getting the most capability per dollar.
Example
Given a $10M compute budget, determining whether to train a 30B model on 600B tokens or a 10B model on 2T tokens — Chinchilla scaling says the latter wins.
Think of it like...
Like optimizing a fixed travel budget between flight quality and hotel quality — the best trip comes from balancing both, not spending everything on one.
Related Terms
Chinchilla Scaling
Research by DeepMind showing that many LLMs were significantly undertrained — for a given compute budget, training a smaller model on more data yields better performance.
Scaling Laws
Empirical findings showing predictable relationships between model performance and factors like model size (parameters), dataset size, and compute budget. Performance improves as a power law with these factors.
Compute
The computational resources (processing power, memory, time) required to train or run AI models. Compute is measured in FLOPs (floating-point operations) and is a primary constraint and cost in AI development.
Parameter
Any learnable value in a machine learning model that is adjusted during training. Parameters include weights and biases in neural networks. Model size is often described by parameter count.