Artificial Intelligence

Continuous Batching

A serving technique where new requests are added to an in-progress batch as existing requests complete, maximizing GPU utilization rather than waiting for an entire batch to finish.

Why It Matters

Continuous batching can improve LLM serving throughput by 2-5x, directly reducing per-query costs and improving response times.

Example

Instead of waiting for 32 requests to accumulate before processing, the server starts immediately and adds new requests to the batch as earlier ones finish generating.

Think of it like...

Like a conveyor belt sushi restaurant versus a sit-down restaurant — dishes are continuously served as they are ready, rather than waiting for the entire table's order.

Related Terms