Data Quality
The degree to which data is accurate, complete, consistent, timely, and fit for its intended use. Data quality directly impacts the reliability and performance of AI models.
Why It Matters
Data quality is the foundation of trustworthy AI. Models trained on high-quality data consistently outperform models trained on larger but lower-quality datasets.
Example
An audit revealing that a customer database has 15% duplicate records, 8% missing email addresses, and inconsistent date formats — all of which must be fixed before ML training.
Think of it like...
Like the ingredients in cooking — a five-star recipe with rotten ingredients produces a terrible dish. The best model architecture cannot compensate for bad data.
Related Terms
Data Preprocessing
The process of cleaning, transforming, and organizing raw data into a format suitable for machine learning. This includes handling missing values, encoding categories, scaling features, and removing outliers.
Training Data
The dataset used to teach a machine learning model. It contains examples (and often labels) that the model learns patterns from during the training process. The quality and quantity of training data directly impact model performance.
Data Labeling
The process of assigning meaningful tags or annotations to raw data so it can be used for supervised learning. Labels tell the model what the correct answer should be for each training example.
Data Pipeline
An automated workflow that extracts data from sources, transforms it through processing steps, and loads it into a destination for use. In ML, data pipelines ensure consistent data flow from raw sources to model training.