Machine Learning

RLHF

Reinforcement Learning from Human Feedback — a technique used to align language models with human preferences. Human raters rank model outputs, and this feedback trains a reward model that guides further training.

Why It Matters

RLHF is what made ChatGPT conversational and helpful. It bridges the gap between a model that predicts text and one that is actually useful and safe.

Example

Human raters compare two model responses to the same question and pick which is better. Over millions of comparisons, the model learns to produce responses humans prefer.

Think of it like...

Like a comedian adjusting their act based on audience reactions — they keep what gets laughs and drop what falls flat, gradually getting better at pleasing the crowd.

Related Terms

Reinforcement Learning

A type of machine learning where an agent learns to make decisions by taking actions in an environment and receiving rewards or penalties. The agent aims to maximize cumulative reward over time through trial and error.

DPO

Direct Preference Optimization — a simpler alternative to RLHF that directly optimizes a language model from human preference data without needing a separate reward model. It is more stable and easier to implement.

Reward Model

A model trained to predict how good a response is based on human preferences. In RLHF, the reward model scores outputs to guide the language model toward responses humans prefer.

Alignment

The challenge of ensuring AI systems behave in ways that match human values, intentions, and expectations. Alignment aims to make AI helpful, honest, and harmless.

Instruction Tuning

A fine-tuning approach where a model is trained on a dataset of instruction-response pairs, teaching it to follow human instructions accurately. This transforms a text-completion model into a helpful assistant.

Constitutional AI

An alignment approach developed by Anthropic where AI models are guided by a set of principles (a 'constitution') that help them self-evaluate and improve their responses without relying solely on human feedback.

Back to Glossary