Synthetic Data
Artificially generated data that mimics the statistical properties and patterns of real data. It is created using algorithms, simulations, or generative models rather than collected from real-world events.
Why It Matters
Synthetic data solves privacy, scarcity, and bias problems. It enables ML development when real data is too sensitive, expensive, or rare to use.
Example
Generating realistic but fake medical records to train healthcare AI models without exposing actual patient data, or creating simulated driving scenarios for autonomous vehicles.
Think of it like...
Like using a flight simulator to train pilots — the simulated scenarios are realistic enough to build real skills without the risks or costs of actual flights.
Related Terms
Data Augmentation
Techniques for artificially expanding a training dataset by creating modified versions of existing data. This helps models generalize better, especially when training data is limited.
Generative Adversarial Network
A framework where two neural networks compete — a generator creates fake data and a discriminator tries to tell real from fake. This adversarial process drives both networks to improve, producing increasingly realistic outputs.
Training Data
The dataset used to teach a machine learning model. It contains examples (and often labels) that the model learns patterns from during the training process. The quality and quantity of training data directly impact model performance.
Differential Privacy
A mathematical framework that provides provable privacy guarantees when analyzing or learning from data. It ensures that the output of any analysis is approximately the same whether or not any individual's data is included.