Continuous Batching
A serving technique where new requests are added to an in-progress batch as existing requests complete, maximizing GPU utilization rather than waiting for an entire batch to finish.
Why It Matters
Continuous batching can improve LLM serving throughput by 2-5x, directly reducing per-query costs and improving response times.
Example
Instead of waiting for 32 requests to accumulate before processing, the server starts immediately and adds new requests to the batch as earlier ones finish generating.
Think of it like...
Like a conveyor belt sushi restaurant versus a sit-down restaurant — dishes are continuously served as they are ready, rather than waiting for the entire table's order.
Related Terms
Model Serving
The infrastructure and process of deploying trained ML models to production where they can receive requests and return predictions in real time. It includes scaling, load balancing, and version management.
Throughput
The number of requests or predictions a model can process in a given time period. High throughput means the system can serve many users simultaneously.
Latency
The time delay between sending a request to an AI model and receiving the response. In ML systems, latency includes data preprocessing, model inference, and network transmission time.
Inference Optimization
Techniques for making AI model inference faster, cheaper, and more efficient. This includes quantization, batching, caching, speculative decoding, and hardware optimization.
GPU
Graphics Processing Unit — originally designed for rendering graphics, GPUs excel at the parallel mathematical operations needed for training and running AI models. They are the primary hardware for modern AI.