Artificial Intelligence

Model Serving

The infrastructure and process of deploying trained ML models to production where they can receive requests and return predictions in real time. It includes scaling, load balancing, and version management.

Why It Matters

Model serving determines the user experience — latency, reliability, and cost. A perfectly trained model is worthless if it cannot be served efficiently.

Example

Using a platform like AWS SageMaker or a framework like vLLM to host an LLM that handles thousands of concurrent user requests with sub-second response times.

Think of it like...

Like running a restaurant kitchen — you need to efficiently take orders, prepare dishes (run inference), and serve them quickly without the kitchen backing up.

Related Terms

Inference

The process of using a trained model to make predictions on new, previously unseen data. Inference is what happens when an AI model is deployed and actively serving results to users.

Latency

The time delay between sending a request to an AI model and receiving the response. In ML systems, latency includes data preprocessing, model inference, and network transmission time.

Throughput

The number of requests or predictions a model can process in a given time period. High throughput means the system can serve many users simultaneously.

API

Application Programming Interface — a set of rules and protocols that allow different software applications to communicate with each other. In AI, APIs let developers integrate AI capabilities into their applications.

Deployment

The process of making a trained ML model available for use in production applications. Deployment involves packaging the model, setting up serving infrastructure, and establishing monitoring.

MLOps

Machine Learning Operations — the set of practices that combine ML, DevOps, and data engineering to deploy and maintain ML models in production reliably and efficiently.

Back to Glossary