Model Serving
The infrastructure and process of deploying trained ML models to production where they can receive requests and return predictions in real time. It includes scaling, load balancing, and version management.
Why It Matters
Model serving determines the user experience — latency, reliability, and cost. A perfectly trained model is worthless if it cannot be served efficiently.
Example
Using a platform like AWS SageMaker or a framework like vLLM to host an LLM that handles thousands of concurrent user requests with sub-second response times.
Think of it like...
Like running a restaurant kitchen — you need to efficiently take orders, prepare dishes (run inference), and serve them quickly without the kitchen backing up.
Related Terms
Inference
The process of using a trained model to make predictions on new, previously unseen data. Inference is what happens when an AI model is deployed and actively serving results to users.
Latency
The time delay between sending a request to an AI model and receiving the response. In ML systems, latency includes data preprocessing, model inference, and network transmission time.
Throughput
The number of requests or predictions a model can process in a given time period. High throughput means the system can serve many users simultaneously.
API
Application Programming Interface — a set of rules and protocols that allow different software applications to communicate with each other. In AI, APIs let developers integrate AI capabilities into their applications.
Deployment
The process of making a trained ML model available for use in production applications. Deployment involves packaging the model, setting up serving infrastructure, and establishing monitoring.
MLOps
Machine Learning Operations — the set of practices that combine ML, DevOps, and data engineering to deploy and maintain ML models in production reliably and efficiently.