Artificial Intelligence

Quantization

The process of reducing the precision of a model's numerical weights (e.g., from 32-bit to 8-bit or 4-bit), making the model smaller and faster while accepting a small trade-off in accuracy.

Why It Matters

Quantization can reduce model size by 4-8x and speed up inference, making it possible to run large models on edge devices, phones, and consumer hardware.

Example

Converting a 70B parameter model from FP16 (140GB) to INT4 (35GB) so it can run on a single GPU, with only minimal accuracy loss.

Think of it like...

Like converting a high-resolution photo to a smaller file — you lose some fine detail but the image is still perfectly usable and takes much less space.

Related Terms