Data Science

Synthetic Data Generation

The process of using algorithms, rules, or generative models to create artificial datasets that statistically mirror real data. Used when real data is scarce, sensitive, or biased.

Why It Matters

Synthetic data generation enables ML development in healthcare, finance, and defense where real data is too sensitive or scarce to use directly.

Example

Generating 100,000 synthetic patient records that match the statistical distribution of real hospital data — enabling ML model development without exposing actual patient information.

Think of it like...

Like a flight simulator that generates realistic flying scenarios — the scenarios are not real, but they are realistic enough to train real skills.

Related Terms