Data Science

Data Labeling

The process of assigning meaningful tags or annotations to raw data so it can be used for supervised learning. Labels tell the model what the correct answer should be for each training example.

Why It Matters

Labeled data is the fuel for supervised learning. The quality and consistency of labels directly determine model accuracy.

Example

Human annotators reviewing thousands of images and drawing bounding boxes around pedestrians for autonomous vehicle training, or marking emails as spam or not-spam.

Think of it like...

Like a teacher creating an answer key for a test — students (models) need the correct answers to learn from their mistakes.

Related Terms

Training Data

The dataset used to teach a machine learning model. It contains examples (and often labels) that the model learns patterns from during the training process. The quality and quantity of training data directly impact model performance.

Supervised Learning

A type of machine learning where the model is trained on labeled data — input-output pairs where the correct answer is provided. The model learns to map inputs to outputs and can then predict outputs for new, unseen inputs.

Annotation

The process of adding labels, tags, or metadata to raw data to make it suitable for supervised machine learning. Annotation can involve labeling images, transcribing audio, or tagging text.

Active Learning

A training strategy where the model identifies the most informative unlabeled examples and requests human labels only for those. This minimizes labeling effort by focusing on the examples that matter most.

Crowdsourcing

Using a large group of distributed workers (often through platforms like Amazon Mechanical Turk or Scale AI) to perform data annotation and labeling tasks.

Back to Glossary