Retrieval Latency
The time it takes for a retrieval system to search through stored documents or embeddings and return relevant results. Measured in milliseconds, it is a critical component of RAG system performance.
Why It Matters
Retrieval latency directly impacts user experience in RAG applications. Users expect instant results — even 500ms of retrieval delay is noticeable.
Example
A vector database returning the top 10 most relevant document chunks from 50 million embeddings in 12 milliseconds using HNSW indexing.
Think of it like...
Like the speed of a search engine — whether it takes 10 milliseconds or 2 seconds to return results fundamentally changes the user experience.
Related Terms
Latency
The time delay between sending a request to an AI model and receiving the response. In ML systems, latency includes data preprocessing, model inference, and network transmission time.
Vector Database
A specialized database designed to store, index, and search high-dimensional vector embeddings efficiently. It enables fast similarity searches across millions or billions of vectors.
Approximate Nearest Neighbor
An algorithm that finds vectors approximately closest to a query vector, trading perfect accuracy for dramatic speed improvements. ANN makes vector search practical at scale.
Retrieval-Augmented Generation
A technique that enhances LLM outputs by first retrieving relevant information from external knowledge sources and then using that information as context for generation. RAG combines the power of search with the fluency of language models.
Inference
The process of using a trained model to make predictions on new, previously unseen data. Inference is what happens when an AI model is deployed and actively serving results to users.