AI Glossary
The definitive dictionary for AI, Machine Learning, and Governance terminology. From Flash Attention to RAG — look up any term.
A
Adversarial Attack
An input deliberately crafted to fool an AI model into making incorrect predictions. Adversarial examples often look normal to humans but cause models to fail spectacularly.
Agent Memory
Systems that give AI agents persistent storage for facts, preferences, and conversation history across sessions. Memory enables agents to build cumulative knowledge over time.
Agentic AI
AI systems designed to operate with high autonomy — planning, executing, and adapting without constant human oversight. Agentic AI emphasizes independent action-taking to accomplish user goals.
Agentic Memory Systems
Architectures for managing different types of memory in AI agents — working memory for current tasks, episodic memory for past interactions, and semantic memory for accumulated knowledge.
Agentic RAG
An advanced RAG pattern where an AI agent dynamically decides what to retrieve, how to refine queries, and when to search again based on the quality of initial results.
Agentic Workflow
A multi-step process where an AI agent autonomously plans, executes, evaluates, and iterates on tasks, making decisions at each step rather than following a fixed pipeline.
AI Agent
An AI system that can autonomously plan, reason, and take actions to accomplish goals. Unlike simple chatbots, agents can use tools, make decisions, execute multi-step workflows, and adapt their approach based on results.
AI Alignment Tax
The performance cost of making AI models safer and more aligned with human values. Safety training sometimes reduces raw capability on certain tasks.
AI Chip
A semiconductor designed specifically for artificial intelligence workloads, optimized for the mathematical operations (matrix multiplication, convolution) that neural networks require.
AI Coding Assistant
An AI tool that helps developers write, debug, review, and refactor code through natural language interaction and code completion. Modern coding assistants use LLMs fine-tuned on code.
AI Memory
Systems that give AI models the ability to retain and recall information across conversations or sessions. Memory enables persistent context, user preferences, and accumulated knowledge.
AI Orchestration Layer
The middleware that coordinates AI model calls, tool execution, memory management, and error handling in complex AI applications. It manages the flow between components.
Approximate Nearest Neighbor
An algorithm that finds vectors approximately closest to a query vector, trading perfect accuracy for dramatic speed improvements. ANN makes vector search practical at scale.
Artificial General Intelligence
A hypothetical AI system with human-level cognitive abilities across all domains — able to reason, learn, plan, and understand any intellectual task that a human can. AGI does not yet exist.
Artificial Intelligence
The broad field of computer science focused on creating systems capable of performing tasks that typically require human intelligence. This includes learning, reasoning, problem-solving, perception, and language understanding.
Artificial Superintelligence
A theoretical AI system that vastly surpasses human intelligence across all domains including creativity, problem-solving, and social intelligence. ASI remains purely hypothetical.
ASIC
Application-Specific Integrated Circuit — a chip designed for a single specific purpose. In AI, ASICs like Google's TPUs are designed exclusively for neural network operations.
Attention Head
A single attention computation within multi-head attention. Each head independently computes attention scores, allowing different heads to specialize in different types of relationships.
Attention Map
A visualization showing which parts of the input an AI model focuses on when making predictions. Attention maps reveal the model's internal focus patterns.
Attention Mechanism
A component in neural networks that allows the model to focus on the most relevant parts of the input when producing each part of the output. It assigns different weights to different input elements based on their relevance.
Attention Score
The numerical value representing how much one token should focus on another token in the attention mechanism. Higher scores mean stronger relationships between tokens.
Attention Sink
A phenomenon in transformers where the first few tokens in a sequence receive disproportionately high attention scores regardless of their content, acting as 'sinks' for excess attention.
Attention Window
The range of tokens that an attention mechanism can attend to in a single computation. Different attention patterns (local, global, sliding) use different window sizes.
Autonomous Agent Framework
A software framework providing the infrastructure for building AI agents including planning, memory, tool integration, error handling, and multi-agent coordination.
Autonomous AI
AI systems capable of making decisions and taking actions independently without continuous human guidance. Autonomous AI can plan, execute, and adapt to changing circumstances on its own.
Autonomous Vehicle
A vehicle that can navigate and operate without human input using AI systems for perception (cameras, lidar), decision-making, and control. Self-driving technology uses computer vision, sensor fusion, and planning.
B
Backdoor Attack
A type of data poisoning where a model is trained to behave maliciously when a specific trigger pattern is present in the input, while behaving normally otherwise.
Beam Search
A search algorithm used in text generation that explores multiple possible output sequences simultaneously, keeping the top-scoring candidates at each step. It finds higher-quality outputs than greedy decoding.
Benchmark
A standardized test or dataset used to evaluate and compare the performance of AI models. Benchmarks provide consistent metrics that allow fair comparisons between different approaches.
Benchmark Contamination
When a model's training data inadvertently includes test data from benchmarks, leading to inflated performance scores that do not reflect true capability.
BERT
Bidirectional Encoder Representations from Transformers — a language model developed by Google that reads text in both directions simultaneously. BERT excels at understanding language rather than generating it.
Black Box
A model or system whose internal workings are not visible or understandable to the user — you can see the inputs and outputs but not the reasoning in between. Most deep learning models are considered black boxes.
BM25
Best Matching 25 — a widely used ranking function for keyword-based information retrieval. BM25 scores documents based on query term frequency, document length, and corpus statistics.
Byte-Pair Encoding
A subword tokenization algorithm that starts with individual characters and iteratively merges the most frequent pairs to create a vocabulary of subword units. It balances vocabulary size with handling of rare words.
C
Capability Elicitation
Techniques for discovering and activating latent capabilities in AI models — abilities that exist but are not obvious from standard testing or usage.
Chain-of-Thought
A prompting technique where the model is encouraged to show its step-by-step reasoning process before arriving at a final answer. This improves accuracy on complex reasoning tasks.
Chatbot
An AI application designed to simulate conversation with human users through text or voice. Modern chatbots use LLMs to provide natural, contextually aware responses.
ChatGPT
OpenAI's consumer-facing AI chatbot powered by GPT models. ChatGPT brought LLMs to the mainstream when it launched in November 2022, reaching 100 million users in two months.
Chinchilla Scaling
Research by DeepMind showing that many LLMs were significantly undertrained — for a given compute budget, training a smaller model on more data yields better performance.
Chunking
The process of breaking large documents into smaller pieces (chunks) before creating embeddings for use in RAG systems. Chunk size and strategy significantly impact retrieval quality.
CI/CD for ML
Continuous Integration and Continuous Deployment applied to machine learning — automating the testing, validation, and deployment of ML models whenever code or data changes.
Claude
Anthropic's family of AI assistants known for their focus on safety, helpfulness, and honesty. Claude models are designed with Constitutional AI principles for safer, more reliable AI interactions.
CLIP
Contrastive Language-Image Pre-training — an OpenAI model trained to understand the relationship between images and text. CLIP can match images to text descriptions without being trained on specific image categories.
Closed Source AI
AI models where the architecture, weights, and training details are proprietary and not publicly available. Users access them only through APIs or products controlled by the developer.
Code Generation
The AI capability of producing functional source code from natural language descriptions, specifications, or partial code. Modern LLMs can generate code in dozens of programming languages.
Cognitive Architecture
A framework or blueprint for building AI systems that mimics aspects of human cognition, including perception, memory, reasoning, learning, and action.
Compute
The computational resources (processing power, memory, time) required to train or run AI models. Compute is measured in FLOPs (floating-point operations) and is a primary constraint and cost in AI development.
Compute-Optimal Training
Allocating a fixed compute budget optimally between model size and training data quantity, based on scaling law research like the Chinchilla findings.
Computer Vision
A field of AI that trains computers to interpret and understand visual information from the world — images, videos, and real-time camera feeds. It enables machines to 'see' and make decisions based on what they see.
Confidence Score
A numerical value (typically 0-1) indicating how certain a model is about its prediction. Higher scores indicate greater confidence in the output.
Constrained Generation
Techniques that force LLM output to conform to specific formats, schemas, or grammars. This ensures outputs are always valid JSON, SQL, or match a defined structure.
Constraint Satisfaction
The problem of finding values for variables that satisfy a set of constraints. In AI, it is used in scheduling, planning, and configuration tasks.
Context Management
Strategies for efficiently using an LLM's limited context window, including what information to include, how to compress it, and when to summarize or truncate.
Context Window
The maximum amount of text (measured in tokens) that a language model can process in a single interaction. It includes both the input prompt and the generated output. Larger context windows allow models to handle longer documents.
Continuous Batching
A serving technique where new requests are added to an in-progress batch as existing requests complete, maximizing GPU utilization rather than waiting for an entire batch to finish.
Conversational AI
AI technology that enables natural, multi-turn conversations between humans and machines. It combines NLU, dialog management, and NLG to maintain coherent, contextual interactions.
Counterfactual Explanation
An explanation of an AI decision that describes what would need to change in the input for the model to produce a different output. It answers 'What if?' questions about predictions.
CUDA
Compute Unified Device Architecture — NVIDIA's parallel computing platform that enables GPU programming for AI workloads. CUDA is the dominant software ecosystem for AI computation.
D
DALL-E
A text-to-image AI model created by OpenAI that generates original images from text descriptions. DALL-E can create realistic images, art, and conceptual visualizations from natural language prompts.
Denoising
The process of removing noise from data to recover the underlying clean signal. In generative AI, denoising is the core mechanism of diffusion models.
Dense Retrieval
Information retrieval using learned vector embeddings to find semantically similar documents. Called 'dense' because document representations are dense numerical vectors with no zero values.
Deployment
The process of making a trained ML model available for use in production applications. Deployment involves packaging the model, setting up serving infrastructure, and establishing monitoring.
Deterministic Output
When an AI model produces the same output every time for the same input. Achieved by setting temperature to 0 and using fixed random seeds.
Diffusion Model
A type of generative AI model that creates data by starting with random noise and gradually removing it, step by step, until a coherent output (like an image) emerges. This process is called denoising.
Digital Twin
A virtual replica of a physical system, process, or object that uses real-time data and AI to simulate, predict, and optimize the behavior of its physical counterpart.
Document Processing
AI-powered extraction and understanding of information from documents including PDFs, images, forms, and scanned papers. It combines OCR, NLP, and computer vision.
E
Edge Inference
Running AI models directly on local devices (phones, IoT sensors, cameras) rather than sending data to the cloud. This reduces latency, preserves privacy, and works without internet connectivity.
Embedding
A numerical representation of data (text, images, etc.) as a vector of numbers in a high-dimensional space. Similar items are placed closer together in this space, enabling machines to understand semantic relationships.
Embedding Dimension
The number of numerical values in a vector embedding. Higher dimensions can capture more nuanced relationships but require more storage and computation.
Embedding Drift
Changes in embedding vector distributions over time as the underlying data, vocabulary, or user behavior shifts. Drift degrades retrieval quality in RAG and search systems.
Embedding Model
A specialized model designed to convert text, images, or other data into vector embeddings. Embedding models are optimized for producing meaningful numerical representations rather than generating text.
Embedding Space
The high-dimensional geometric space in which embeddings exist. In this space, the distance and direction between points encode semantic relationships between the items they represent.
Embeddings as a Service
Cloud APIs that convert text or other data into vector embeddings without requiring users to host or manage embedding models themselves.
Emergent Behavior
Capabilities that appear in large AI models that were not explicitly trained for and were not present in smaller versions. Emergent abilities seem to appear suddenly at certain scale thresholds.
Encoder-Decoder
An architecture where the encoder compresses input into a fixed representation and the decoder generates output from that representation. This structure is used in translation, summarization, and image captioning.
Evaluation
The systematic process of measuring an AI model's performance, safety, and reliability using various metrics, benchmarks, and testing methodologies.
Evaluation Framework
A structured system for measuring AI model performance across multiple dimensions including accuracy, safety, fairness, robustness, and user satisfaction.
Evaluation Harness
A standardized testing framework for running AI models through suites of benchmarks and evaluation tasks. It ensures consistent, reproducible evaluation across models.
Expert System
An early AI system that mimics human expertise in a specific domain using a knowledge base of rules and facts. Expert systems were the dominant AI approach in the 1980s.
F
Federated Inference
Running AI model inference across multiple distributed devices or locations, rather than centralizing it in one place. Each device processes its own data locally.
Few-Shot Learning
A technique where a model learns to perform a task from only a few examples provided in the prompt. Instead of training on thousands of examples, the model generalizes from just 2-5 demonstrations.
Few-Shot Prompting
A prompt engineering technique where a small number of input-output examples are provided before the actual query, demonstrating the desired format and behavior to the model.
Fine-Tuning vs RAG
The strategic decision between customizing a model's weights (fine-tuning) or providing external knowledge at inference time (RAG). Each approach has different strengths and use cases.
Flash Attention
An optimized implementation of the attention mechanism that reduces memory usage and increases speed by tiling the computation and avoiding materializing the full attention matrix in memory.
FLOPS
Floating Point Operations Per Second — a measure of computing speed that quantifies how many mathematical calculations a processor can perform each second. Used to measure AI hardware performance.
Foundation Model
A large AI model trained on broad data at scale that can be adapted to a wide range of downstream tasks. Foundation models serve as the base upon which specialized applications are built.
Frontier Model
The most capable and advanced AI models available at any given time, typically characterized by the highest performance across multiple benchmarks. These models push the boundaries of AI capabilities.
Function Calling
A capability where an LLM can generate structured output to invoke specific functions or APIs. The model decides which function to call and what parameters to pass based on the user's request.
G
Gemini
Google DeepMind's family of multimodal AI models designed to understand and generate text, code, images, audio, and video. Gemini is Google's flagship AI model series.
Generative Adversarial Network
A framework where two neural networks compete — a generator creates fake data and a discriminator tries to tell real from fake. This adversarial process drives both networks to improve, producing increasingly realistic outputs.
Generative AI
AI systems that can create new content — text, images, music, code, video — rather than just analyzing or classifying existing data. These models learn patterns from training data and generate novel outputs that resemble the original data.
GGUF
A file format for storing quantized language models designed for efficient CPU inference. GGUF is the standard format used by llama.cpp and is popular for local LLM deployment.
GPT
Generative Pre-trained Transformer — a family of large language models developed by OpenAI. GPT models are trained to predict the next token in a sequence and can generate coherent, contextually relevant text across many tasks.
GPU
Graphics Processing Unit — originally designed for rendering graphics, GPUs excel at the parallel mathematical operations needed for training and running AI models. They are the primary hardware for modern AI.
GraphRAG
A RAG approach that uses knowledge graphs rather than vector databases for retrieval. It combines graph traversal with LLM generation to answer questions requiring multi-hop reasoning.
Greedy Decoding
A simple text generation strategy where the model always selects the most probable next token at each step. It is fast but can produce repetitive or suboptimal outputs.
Grounding
The practice of connecting AI model outputs to verifiable sources of information, ensuring responses are based on factual data rather than the model's potentially unreliable internal knowledge.
Guardrail Model
A separate, specialized AI model that monitors the inputs and outputs of a primary LLM to detect and block harmful, off-topic, or policy-violating content.
H
Hallucination
When an AI model generates information that sounds plausible and confident but is factually incorrect, fabricated, or not grounded in its training data or provided context. The model essentially 'makes things up'.
Hallucination Detection
Methods and systems for automatically identifying when an AI model has generated false or unsupported information. Detection can compare outputs against source documents or use consistency checks.
Hallucination Rate
The frequency at which an AI model generates incorrect or fabricated information. It is typically measured as a percentage of responses containing hallucinations.
Hardware Acceleration
Using specialized hardware (GPUs, TPUs, FPGAs, ASICs) to speed up AI computation compared to general-purpose CPUs. Accelerators are optimized for the specific math operations used in neural networks.
Hugging Face
The leading open-source platform for sharing and discovering AI models, datasets, and applications. Hugging Face hosts the Transformers library and a community hub with thousands of pre-trained models.
Human Evaluation
Using human judges to assess AI model quality on subjective dimensions like helpfulness, coherence, creativity, and safety that automated metrics cannot fully capture.
Human-in-the-Loop
A system design where humans are integrated into the AI workflow to provide oversight, make decisions, correct errors, or handle edge cases that the AI cannot reliably manage alone.
Hybrid Search
A search approach that combines keyword-based (lexical) search with semantic (vector) search to get the benefits of both — exact matching for specific terms and meaning-based matching for conceptual queries.
I
Image Classification
A computer vision task that assigns a category label to an entire image. The model determines what the main subject of the image is from a predefined set of categories.
Image Segmentation
A computer vision task that assigns a label to every pixel in an image, dividing it into meaningful regions. It identifies not just what objects are present but their exact shapes and boundaries.
In-Context Learning
An LLM's ability to learn new tasks from examples or instructions provided within the prompt, without any weight updates or fine-tuning. The model adapts its behavior based on the context given.
Inference
The process of using a trained model to make predictions on new, previously unseen data. Inference is what happens when an AI model is deployed and actively serving results to users.
Inference Optimization
Techniques for making AI model inference faster, cheaper, and more efficient. This includes quantization, batching, caching, speculative decoding, and hardware optimization.
Information Extraction
The task of automatically extracting structured information (entities, relationships, events) from unstructured text documents.
Instruction Following
An LLM's ability to accurately understand and execute user instructions, including complex multi-step directives with specific constraints on format, tone, length, and content.
Instruction Hierarchy
A framework for prioritizing different levels of instructions when they conflict — system prompts typically override user prompts, which override context from retrieved documents.
Instructor Embedding
An embedding approach where you provide instructions that describe the task alongside the text, producing task-specific embeddings from a single model.
K
Knowledge Cutoff
The date after which a language model has no training data. The model cannot reliably answer questions about events that occurred after its knowledge cutoff.
KV Cache
Key-Value Cache — a mechanism that stores previously computed attention key and value vectors during autoregressive generation, avoiding redundant computation for tokens already processed.
L
LangChain
A popular open-source framework for building applications powered by language models. It provides tools for prompt management, chains, agents, memory, and integration with external tools and data sources.
Large Language Model
A type of AI model trained on massive amounts of text data that can understand and generate human-like text. LLMs use transformer architecture and typically have billions of parameters, enabling them to perform a wide range of language tasks.
Latency
The time delay between sending a request to an AI model and receiving the response. In ML systems, latency includes data preprocessing, model inference, and network transmission time.
Leaderboard
A ranking of AI models by performance on specific benchmarks. Leaderboards drive competition and provide quick comparisons but can encourage gaming and narrow optimization.
Llama
A family of open-weight large language models released by Meta. Llama models are available for download and customization, making them the most widely adopted open-source LLM family.
LLM-as-Judge
Using a large language model to evaluate the quality of another model's outputs, replacing or supplementing human evaluators. The judge LLM scores responses on various quality dimensions.
Long Context
The ability of AI models to process very large amounts of input text — typically 100K tokens or more — enabling analysis of entire books, codebases, or document collections.
M
Machine Translation
The use of AI to automatically translate text or speech from one language to another. Modern neural machine translation uses transformer models and achieves near-human quality for many language pairs.
Mistral
A French AI company and their family of efficient, high-performance open-weight language models. Mistral models are known for strong performance relative to their size.
Mixture of Agents
An architecture where multiple different AI models collaborate on a task, with each model contributing its strengths. A routing or aggregation layer combines their outputs.
Mixture of Depths
A transformer architecture where different tokens use different numbers of layers, allowing the model to spend more computation on complex tokens and less on simple ones.
Mixture of Experts
An architecture where a model consists of multiple specialized sub-networks (experts) and a gating mechanism that routes each input to only the most relevant experts. Only a fraction of the total parameters are active per input.
Mixture of Modalities
AI architectures that natively process and generate multiple data types within a single unified model, rather than using separate models connected together.
MLOps
Machine Learning Operations — the set of practices that combine ML, DevOps, and data engineering to deploy and maintain ML models in production reliably and efficiently.
Model Collapse
A phenomenon where AI models trained on AI-generated content progressively lose quality and diversity, eventually producing repetitive, low-quality outputs. Each generation of model degrades further.
Model Context Protocol
An open protocol that standardizes how AI models connect to external tools, data sources, and services. MCP provides a universal interface for LLMs to access context from any compatible system.
Model Drift
The gradual degradation of a model's predictive performance over time as the real-world environment changes. Model drift can be caused by data drift, concept drift, or both.
Model Evaluation Pipeline
An automated system that runs a comprehensive suite of evaluations on AI models, generating reports on accuracy, safety, bias, robustness, and other quality dimensions.
Model Hub
A platform for hosting, discovering, and sharing pre-trained AI models. Model hubs provide standardized access to thousands of models across different tasks and architectures.
Model Interpretability Tool
Software tools that help understand how ML models make predictions, including feature importance, attention visualization, counterfactual explanations, and decision path analysis.
Model Monitoring
The practice of continuously tracking an ML model's performance, predictions, and input data in production to detect degradation, drift, or anomalies after deployment.
Model Registry
A centralized repository for storing, versioning, and managing trained ML models along with their metadata (metrics, parameters, lineage). It serves as the system of record for models.
Model Serving
The infrastructure and process of deploying trained ML models to production where they can receive requests and return predictions in real time. It includes scaling, load balancing, and version management.
Model Size
The number of parameters in a model, typically expressed in millions (M) or billions (B). Model size correlates loosely with capability but also determines compute and memory requirements.
Model Weights
The collection of all learned parameter values in a neural network. Model weights are what you download when you get a pre-trained model — they encode everything the model learned.
Multi-Agent System
An architecture where multiple AI agents collaborate, each with specialized roles or capabilities, to accomplish complex tasks that no single agent could handle alone.
Multi-Head Attention
An extension of attention where multiple attention mechanisms (heads) run in parallel, each learning to focus on different types of relationships in the data. The outputs are then combined.
Multilingual AI
AI models capable of understanding and generating text in multiple languages. Modern LLMs often support 50-100+ languages, though performance varies significantly across languages.
Multimodal AI
AI systems that can process and generate multiple types of data — text, images, audio, video — within a single model. Multimodal models understand the relationships between different data types.
Multimodal Embedding
Embeddings that map different data types (text, images, audio) into the same vector space, enabling cross-modal search and comparison.
Multimodal RAG
Retrieval-augmented generation that works across multiple data types — retrieving and reasoning over text, images, tables, and charts to answer questions that require multimodal understanding.
Multimodal Search
Search systems that can query across different data types — finding images with text, videos with audio descriptions, or documents that contain specific visual elements.
N
Named Entity Recognition
The NLP task of identifying and classifying named entities in text into predefined categories such as person names, organizations, locations, dates, monetary values, and more.
Narrow AI
AI systems designed and trained for a specific task or narrow set of tasks. All current AI systems are narrow AI — they excel in their domain but cannot generalize outside it.
Natural Language Generation
The AI capability of producing human-readable text from structured data, internal representations, or prompts. NLG is the output side of language AI — turning machine understanding into human words.
Natural Language Inference
The NLP task of determining the logical relationship between two sentences — whether one entails, contradicts, or is neutral with respect to the other.
Natural Language Processing
The branch of AI that deals with the interaction between computers and human language. NLP enables machines to read, understand, generate, and make sense of human language in a useful way.
Natural Language Understanding
The ability of an AI system to comprehend the meaning, intent, and context of human language, going beyond surface-level word matching to grasp semantics, pragmatics, and implied meaning.
Neuro-Symbolic AI
Approaches that combine neural networks (pattern recognition, learning from data) with symbolic AI (logical reasoning, knowledge representation) to get the strengths of both.
O
Object Detection
A computer vision task that identifies and locates specific objects within an image or video, providing both the object class and its position (usually as a bounding box).
Observability
The ability to understand the internal state and behavior of an AI system through its external outputs, including logging, tracing, and monitoring of LLM calls and agent actions.
ONNX
Open Neural Network Exchange — an open format for representing machine learning models that enables interoperability between different ML frameworks and deployment targets.
Open Source AI
AI models and tools released with open licenses that allow anyone to use, modify, and distribute them. Open-source AI democratizes access and enables community-driven improvement.
Optical Character Recognition
Technology that converts images of text (typed, handwritten, or printed) into machine-readable digital text. Modern OCR uses deep learning for high accuracy even on difficult inputs.
Orchestration
The coordination and management of multiple AI components, tools, and services to accomplish complex workflows. Orchestration handles routing, sequencing, error handling, and resource allocation.
P
Parallel Function Calling
The ability of an LLM to invoke multiple tool calls simultaneously in a single response, rather than sequentially. This enables faster task completion for independent operations.
Planning
An AI agent's ability to break down complex goals into a sequence of steps and determine the best order of actions to accomplish a task. Planning involves reasoning about dependencies, priorities, and contingencies.
Playground
An interactive web interface where users can experiment with AI models by adjusting parameters, testing prompts, and seeing results in real time without writing code.
Positional Encoding
A technique used in transformers to inject information about the position of each token in a sequence. Since transformers process all tokens in parallel, they need explicit position information.
Prompt Attack Surface
The total set of potential vulnerabilities in an LLM application that can be exploited through prompt-based attacks, including injection, leaking, and jailbreaking vectors.
Prompt Caching
A technique that stores and reuses the processed form of frequently used prompt prefixes, avoiding redundant computation. It speeds up inference and reduces costs for repeated prompts.
Prompt Chaining
A technique where the output of one LLM call becomes the input for the next, creating a pipeline of prompts that together accomplish a complex task.
Prompt Compression
Techniques for reducing the token count of prompts while preserving their essential meaning, enabling more efficient use of context windows and reducing API costs.
Prompt Engineering
The practice of designing and optimizing input prompts to get the best possible output from AI models. It involves crafting instructions, providing examples, and structuring queries to guide the model toward desired responses.
Prompt Injection
A security vulnerability where malicious input is crafted to override or manipulate an LLM's system prompt or instructions, causing it to behave in unintended ways.
Prompt Injection Defense
Techniques and strategies for protecting LLM applications from prompt injection attacks, including input sanitization, output filtering, and architectural defenses.
Prompt Leaking
When a user successfully extracts a system's hidden system prompt through clever questioning. Prompt leaking reveals proprietary instructions, business logic, and safety configurations.
Prompt Library
A curated collection of tested, optimized prompts organized by use case. Prompt libraries accelerate development by providing proven starting points for common tasks.
Prompt Management
The practice of versioning, testing, and managing prompts used in LLM applications. It treats prompts as code that needs proper lifecycle management.
Prompt Optimization
Systematic techniques for improving prompt effectiveness, including automated prompt search, A/B testing of prompt variants, and iterative refinement based on output quality metrics.
Prompt Template
A pre-defined structure for formatting prompts to AI models, with placeholders for dynamic content. Templates ensure consistent, optimized prompt formatting across applications.
Prompt Versioning
Tracking different versions of prompts over time, including changes, performance metrics, and rollback capabilities. Essential for managing prompts in production AI applications.
Q
Quantization
The process of reducing the precision of a model's numerical weights (e.g., from 32-bit to 8-bit or 4-bit), making the model smaller and faster while accepting a small trade-off in accuracy.
Question Answering
An NLP task where the model provides direct answers to questions, either from a given context passage (extractive QA) or from general knowledge (open-domain QA).
R
RAG Pipeline
The complete end-to-end system for retrieval-augmented generation, including document ingestion, chunking, embedding, indexing, retrieval, reranking, prompt construction, and generation.
Reasoning
An AI model's ability to think logically, make inferences, draw conclusions, and solve problems that require multi-step thought. Reasoning goes beyond pattern matching to genuine logical analysis.
Recommendation System
An AI system that predicts and suggests items a user might be interested in based on their behavior, preferences, and similarities to other users.
Relation Extraction
The NLP task of identifying and classifying semantic relationships between entities mentioned in text. It extracts structured facts from unstructured text.
Reranking
A second-stage ranking process that takes initial search results and reorders them using a more sophisticated model. Reranking improves precision by applying deeper analysis to a smaller candidate set.
Retrieval
The process of finding and extracting relevant information from a large collection of documents or data in response to a query. In AI systems, retrieval is often the first step before generation.
Retrieval Evaluation
Methods for measuring how well a retrieval system finds relevant documents. Key metrics include recall at K, mean reciprocal rank, and normalized discounted cumulative gain.
Retrieval Latency
The time it takes for a retrieval system to search through stored documents or embeddings and return relevant results. Measured in milliseconds, it is a critical component of RAG system performance.
Retrieval Quality
A measure of how relevant and accurate the documents retrieved by a search or RAG system are relative to the user's query. Poor retrieval quality is the leading cause of RAG failures.
Retrieval-Augmented Generation
A technique that enhances LLM outputs by first retrieving relevant information from external knowledge sources and then using that information as context for generation. RAG combines the power of search with the fluency of language models.
Retrieval-Augmented Reasoning
An advanced approach where an AI model interleaves retrieval with reasoning steps, fetching new information mid-reasoning rather than retrieving everything upfront.
Reward Hacking
When an AI system finds unintended ways to maximize its reward signal that do not align with the designer's actual goals. The system technically optimizes the metric but violates the spirit of the objective.
Robustness
The ability of an AI model to maintain reliable performance when faced with unexpected inputs, adversarial attacks, data distribution changes, or edge cases.
Role Prompting
A technique where the model is instructed to adopt a specific persona, expertise, or perspective in its responses. The assigned role shapes tone, depth, terminology, and reasoning approach.
S
Sampling Strategy
The method used to select the next token during text generation. Different strategies (greedy, top-k, top-p, temperature-based) produce different tradeoffs between quality and diversity.
Scaling Hypothesis
The theory that increasing model size, data, and compute will continue to improve AI capabilities predictably, and may eventually lead to artificial general intelligence.
Scaling Laws
Empirical findings showing predictable relationships between model performance and factors like model size (parameters), dataset size, and compute budget. Performance improves as a power law with these factors.
Self-Attention
A mechanism where each element in a sequence attends to all other elements to compute a representation, determining how much focus to place on each part of the input. It is the core innovation of the transformer.
Self-Consistency
A decoding strategy where the model generates multiple reasoning paths for the same question and selects the answer that appears most frequently across paths. It improves accuracy on reasoning tasks.
Semantic Caching
Caching LLM responses based on the semantic meaning of queries rather than exact string matching. Semantically similar questions return cached answers, reducing latency and cost.
Semantic Chunking
An intelligent chunking strategy for RAG that splits documents based on semantic meaning rather than fixed character counts, keeping coherent topics together.
Semantic Kernel
Microsoft's open-source SDK for integrating LLMs with programming languages. It provides a framework for orchestrating AI capabilities with conventional code.
Semantic Router
A system that routes user queries to appropriate handlers based on semantic meaning rather than keyword matching. It directs traffic in AI applications.
Semantic Search
Search that understands the meaning and intent behind a query rather than just matching keywords. It uses embeddings to find results that are conceptually related even if they use different words.
Semantic Similarity
A measure of how similar in meaning two pieces of text are, regardless of the specific words used. Semantic similarity captures conceptual relatedness rather than lexical overlap.
Sentence Embedding
A vector representation of an entire sentence or paragraph that captures its overall meaning. Sentence embeddings enable comparing the meanings of text passages.
Sentiment Analysis
The NLP task of identifying and classifying the emotional tone or opinion expressed in text as positive, negative, or neutral. Advanced systems detect nuanced emotions like frustration, excitement, or sarcasm.
Sequence-to-Sequence
A model architecture that transforms one sequence into another, where the input and output can be different lengths. It uses an encoder to process input and a decoder to generate output.
Singularity
A hypothetical future point at which AI self-improvement becomes so rapid that it triggers an intelligence explosion, leading to changes so profound they are impossible to predict.
Sparse Attention
A variant of attention where each token only attends to a subset of other tokens rather than all of them, reducing computational cost from O(n²) to O(n√n) or O(n log n).
Sparse Model
A neural network where most parameters are zero or inactive for any given input. Sparse models achieve high capacity with lower computational cost by only using relevant parameters.
Sparse Retrieval
Information retrieval using traditional keyword matching and term frequency methods (like BM25). Called 'sparse' because document representations have mostly zero values.
Speculative Decoding
A technique that uses a small, fast model to draft multiple tokens ahead, then uses the large model to verify them in parallel. It speeds up inference without changing output quality.
Speech-to-Text
AI technology that converts spoken audio into written text (also called automatic speech recognition or ASR). Modern systems handle accents, background noise, and multiple speakers.
Stable Diffusion
An open-source text-to-image diffusion model that generates detailed images from text descriptions. It works in a compressed latent space, making it more efficient than pixel-level diffusion.
Streaming
Delivering LLM output token-by-token as it is generated rather than waiting for the complete response. Streaming dramatically improves perceived latency and user experience.
Structured Output
The ability of an LLM to generate responses in a specific format like JSON, XML, or a defined schema. Structured output makes AI responses parseable by other software systems.
Summarization
The NLP task of condensing a longer text into a shorter version while preserving the key information and main points. Summarization can be extractive (selecting key sentences) or abstractive (generating new text).
Swarm Intelligence
Collective behavior emerging from the interaction of multiple simple agents that together produce sophisticated solutions. Inspired by natural swarms like ant colonies, bee hives, and bird flocks.
Symbolic AI
An approach to AI that represents knowledge using symbols and rules, and reasons by manipulating those symbols logically. Symbolic AI dominated before the deep learning era.
Synthetic Benchmark
A benchmark composed of artificially generated or carefully curated evaluation tasks designed to test specific AI capabilities, rather than using naturally occurring data.
Synthetic Evaluation
Using AI models to evaluate other AI models, generating test cases and scoring outputs automatically. This scales evaluation beyond what human evaluation alone can achieve.
System Prompt
Hidden instructions provided to an LLM that define its behavior, personality, constraints, and capabilities for a conversation. System prompts set the rules of engagement before the user interacts.
T
Temperature
A parameter that controls the randomness or creativity of an LLM's output. Lower temperatures (closer to 0) make outputs more deterministic and focused; higher temperatures increase randomness and creativity.
Test-Time Compute
Allocating additional computation during inference (not training) to improve output quality. Techniques include chain-of-thought, self-consistency, and iterative refinement.
Text Classification
The NLP task of assigning predefined categories or labels to text documents. It is one of the most common and commercially important NLP applications.
Text Mining
The process of deriving meaningful patterns, trends, and insights from large collections of text data using NLP and statistical techniques.
Text-to-Image
AI models that generate visual images from natural language text descriptions (prompts). This technology converts written descriptions into original images, illustrations, or photorealistic visuals.
Text-to-Speech
AI technology that converts written text into natural-sounding human speech. Modern TTS systems can generate voices with realistic intonation, emotion, and even clone specific voices.
Throughput
The number of requests or predictions a model can process in a given time period. High throughput means the system can serve many users simultaneously.
Token
The basic unit of text that language models process. A token can be a word, part of a word, or a punctuation mark. Text is broken into tokens before being fed into an LLM, and the model generates output one token at a time.
Token Limit
The maximum number of tokens a model can process in a single request, including both the input prompt and the generated output. Exceeding the limit results in truncated input or errors.
Tokenization
The process of breaking text into smaller units (tokens) for processing by NLP models. Tokenization can split text into words, subwords, or characters depending on the method used.
Tokenization Strategy
The approach and rules for how text is split into tokens. Different strategies (word-level, subword, character-level) make different tradeoffs between vocabulary size and sequence length.
Tokenizer
A component that converts raw text into tokens (numerical representations) that a language model can process. Different tokenizers split text differently, affecting model performance and efficiency.
Tokenizer Efficiency
How effectively a tokenizer represents text — measured by the average number of tokens needed to represent a given amount of text. More efficient tokenizers produce fewer tokens for the same content.
Tokenizer Training
The process of building a tokenizer's vocabulary from a corpus of text. The tokenizer learns which subword units to use based on frequency patterns in the training corpus.
Tokenizer Vocabulary
The complete set of tokens (words, subwords, characters) that a tokenizer can recognize and map to numerical IDs. Vocabulary size affects model efficiency and multilingual capability.
Tool Use
The ability of an AI model to interact with external tools, APIs, and systems to accomplish tasks beyond text generation. Tools extend the model's capabilities to include search, calculation, code execution, and more.
Top-k Sampling
A text generation method where the model only considers the k most likely next tokens at each step, ignoring all others. This limits the pool of candidates to the most probable options.
Top-p Sampling
A text generation method (also called nucleus sampling) where the model considers only the smallest set of tokens whose cumulative probability exceeds the threshold p. This balances diversity and quality.
TPU
Tensor Processing Unit — Google's custom-designed chip specifically optimized for machine learning workloads. TPUs are designed for matrix operations that are fundamental to neural network computation.
Transformer
A neural network architecture introduced in 2017 that uses self-attention mechanisms to process sequential data in parallel rather than sequentially. Transformers are the foundation of modern LLMs like GPT, Claude, and Gemini.
Transformer Architecture
The full stack of components that make up a transformer model: multi-head self-attention, feed-forward networks, layer normalization, residual connections, and positional encodings.
Tree of Thought
A prompting framework where the model explores multiple reasoning branches, evaluates intermediate states, and can backtrack from dead ends — like a deliberate tree search through thought space.
V
Variational Autoencoder
A generative model that learns a compressed, lower-dimensional representation (latent space) of input data and can generate new data by sampling from this learned space.
Vector Search
The process of finding the most similar vectors in a vector database to a given query vector. It enables retrieving semantically similar content at scale.
Vision-Language Model
An AI model that can process both visual and textual inputs, understanding images and generating text about them. VLMs combine computer vision with language understanding.
Voice Cloning
AI technology that creates a synthetic replica of a specific person's voice from a small sample of their speech. Cloned voices can speak any text in the original person's vocal characteristics.
W
Weights and Biases
A popular MLOps platform for experiment tracking, model monitoring, dataset versioning, and collaboration in machine learning development.
Whisper
OpenAI's open-source automatic speech recognition model that can transcribe and translate speech in multiple languages with high accuracy.
Z
Zero-Shot Classification
Classifying text into categories that the model was never explicitly trained on, using only the category names or descriptions as guidance.
Zero-Shot Learning
A model's ability to perform a task it was never explicitly trained on or shown examples of. The model applies its general knowledge and reasoning to handle entirely new task types.
Zero-Shot Prompting
Giving an LLM a task instruction without any examples, relying entirely on the model's pre-trained knowledge and instruction-following ability to perform the task.