Throughput
The number of requests or predictions a model can process in a given time period. High throughput means the system can serve many users simultaneously.
Why It Matters
Throughput determines how many users your AI application can support and directly impacts infrastructure costs and scalability.
Example
A model serving system processing 1,000 requests per second, or an LLM generating 100 tokens per second per user across 50 concurrent sessions.
Think of it like...
Like a highway's capacity — it is not about how fast one car goes (latency) but how many cars can pass through per hour (throughput).
Related Terms
Latency
The time delay between sending a request to an AI model and receiving the response. In ML systems, latency includes data preprocessing, model inference, and network transmission time.
Model Serving
The infrastructure and process of deploying trained ML models to production where they can receive requests and return predictions in real time. It includes scaling, load balancing, and version management.
Inference
The process of using a trained model to make predictions on new, previously unseen data. Inference is what happens when an AI model is deployed and actively serving results to users.