AI Glossary

The definitive dictionary for AI, Machine Learning, and Governance terminology. From Flash Attention to RAG — look up any term.

D

DALL-E

A text-to-image AI model created by OpenAI that generates original images from text descriptions. DALL-E can create realistic images, art, and conceptual visualizations from natural language prompts.

Artificial Intelligence

Data Annotation Pipeline

An end-to-end workflow for producing labeled training data, from task design through annotator training, quality assurance, and delivery of labeled datasets.

Data Science

Data Augmentation

Techniques for artificially expanding a training dataset by creating modified versions of existing data. This helps models generalize better, especially when training data is limited.

Data Science

Data Drift

A change in the statistical properties of the input data over time compared to the data the model was trained on. When data drifts, model predictions become less reliable.

Data Science

Data Engineering

The practice of designing, building, and maintaining the systems and infrastructure that collect, store, and prepare data for analysis and machine learning.

Data Science

Data Governance

The overall management of data availability, usability, integrity, and security in an organization. It includes policies, standards, and practices for how data is collected, stored, and used.

AI Governance

Data Labeling

The process of assigning meaningful tags or annotations to raw data so it can be used for supervised learning. Labels tell the model what the correct answer should be for each training example.

Data Science

Data Lake

A centralized repository that stores vast amounts of raw data in its native format until needed. Data lakes accept structured, semi-structured, and unstructured data at any scale.

Data Science

Data Lineage

The tracking of data's origins, transformations, and movements throughout its lifecycle. Data lineage answers the question 'Where did this data come from and what happened to it?'

Data Science

Data Mesh

A decentralized approach to data architecture where domain teams own and manage their own data as products, rather than centralizing all data in a single warehouse or lake.

Data Science

Data Parallelism

A distributed training approach where the training data is split across multiple GPUs, each holding a complete copy of the model. Gradients are averaged across GPUs after each batch.

Machine Learning

Data Pipeline

An automated workflow that extracts data from sources, transforms it through processing steps, and loads it into a destination for use. In ML, data pipelines ensure consistent data flow from raw sources to model training.

Data Science

Data Poisoning

A security attack where malicious data is injected into a training dataset to corrupt the model's behavior. Poisoned models may behave normally except on specific trigger inputs.

AI Governance

Data Preprocessing

The process of cleaning, transforming, and organizing raw data into a format suitable for machine learning. This includes handling missing values, encoding categories, scaling features, and removing outliers.

Data Science

Data Privacy

The right of individuals to control how their personal information is collected, used, stored, and shared. In AI, data privacy concerns arise from training data, user interactions, and model outputs.

AI Governance

Data Quality

The degree to which data is accurate, complete, consistent, timely, and fit for its intended use. Data quality directly impacts the reliability and performance of AI models.

Data Science

Data Warehouse

A structured, organized repository of cleaned and processed data optimized for analysis and reporting. Unlike data lakes, data warehouses store data in defined schemas.

Data Science

Decision Tree

A supervised learning algorithm that makes predictions by learning a series of if-then-else decision rules from the data. It creates a tree-like structure where each internal node tests a feature and each leaf provides a prediction.

Machine Learning

Deep Fake

AI-generated media (especially video and audio) that convincingly depicts real people saying or doing things they never actually said or did. Created using deep learning techniques.

AI Governance

Deep Learning

A specialized subset of machine learning that uses artificial neural networks with multiple layers (hence 'deep') to learn complex patterns in data. Deep learning excels at tasks like image recognition, speech processing, and natural language understanding.

Machine Learning

Denoising

The process of removing noise from data to recover the underlying clean signal. In generative AI, denoising is the core mechanism of diffusion models.

Artificial Intelligence

Dense Retrieval

Information retrieval using learned vector embeddings to find semantically similar documents. Called 'dense' because document representations are dense numerical vectors with no zero values.

Artificial Intelligence

Deployment

The process of making a trained ML model available for use in production applications. Deployment involves packaging the model, setting up serving infrastructure, and establishing monitoring.

Artificial Intelligence

Deterministic Output

When an AI model produces the same output every time for the same input. Achieved by setting temperature to 0 and using fixed random seeds.

Artificial Intelligence

Differential Privacy

A mathematical framework that provides provable privacy guarantees when analyzing or learning from data. It ensures that the output of any analysis is approximately the same whether or not any individual's data is included.

AI Governance

Diffusion Model

A type of generative AI model that creates data by starting with random noise and gradually removing it, step by step, until a coherent output (like an image) emerges. This process is called denoising.

Artificial Intelligence

Digital Twin

A virtual replica of a physical system, process, or object that uses real-time data and AI to simulate, predict, and optimize the behavior of its physical counterpart.

Artificial Intelligence

Dimensionality Reduction

Techniques that reduce the number of features (dimensions) in a dataset while preserving the most important information. This makes data easier to visualize, speeds up training, and can improve model performance.

Machine Learning

Distributed Training

Splitting model training across multiple GPUs or machines to handle larger models or datasets and reduce training time. Techniques include data parallelism and model parallelism.

Machine Learning

Document Processing

AI-powered extraction and understanding of information from documents including PDFs, images, forms, and scanned papers. It combines OCR, NLP, and computer vision.

Artificial Intelligence

DPO

Direct Preference Optimization — a simpler alternative to RLHF that directly optimizes a language model from human preference data without needing a separate reward model. It is more stable and easier to implement.

Machine Learning

Dropout

A regularization technique where random neurons are temporarily disabled (dropped out) during each training step. This forces the network to not rely too heavily on any single neuron and builds redundancy.

Machine Learning

Dual Use

Technology or research that can be applied for both beneficial and harmful purposes. Most AI capabilities are inherently dual-use, creating governance challenges.

AI Governance