Streaming
Delivering LLM output token-by-token as it is generated rather than waiting for the complete response. Streaming dramatically improves perceived latency and user experience.
Why It Matters
Streaming reduces perceived wait time from seconds to milliseconds. Users see text appearing immediately rather than staring at a loading spinner.
Example
ChatGPT showing words appearing one at a time as they are generated, letting users start reading within 200ms rather than waiting 5 seconds for the complete response.
Think of it like...
Like reading a news ticker as it scrolls versus waiting for the entire news broadcast to finish before seeing anything.
Related Terms
Inference
The process of using a trained model to make predictions on new, previously unseen data. Inference is what happens when an AI model is deployed and actively serving results to users.
Latency
The time delay between sending a request to an AI model and receiving the response. In ML systems, latency includes data preprocessing, model inference, and network transmission time.
API
Application Programming Interface — a set of rules and protocols that allow different software applications to communicate with each other. In AI, APIs let developers integrate AI capabilities into their applications.
Model Serving
The infrastructure and process of deploying trained ML models to production where they can receive requests and return predictions in real time. It includes scaling, load balancing, and version management.