Artificial Intelligence

Speculative Decoding

A technique that uses a small, fast model to draft multiple tokens ahead, then uses the large model to verify them in parallel. It speeds up inference without changing output quality.

Why It Matters

Speculative decoding can speed up LLM inference by 2-3x with no quality loss — one of the most impactful serving optimizations available.

Example

A 1B parameter draft model quickly generating 10 candidate tokens, then the 70B main model verifying all 10 in one pass — much faster than generating 10 tokens one at a time.

Think of it like...

Like a junior associate drafting a contract for a senior partner to review — the senior only needs to check and approve rather than write from scratch.

Related Terms