Machine Learning

RLHF

Reinforcement Learning from Human Feedback — a technique used to align language models with human preferences. Human raters rank model outputs, and this feedback trains a reward model that guides further training.

Why It Matters

RLHF is what made ChatGPT conversational and helpful. It bridges the gap between a model that predicts text and one that is actually useful and safe.

Example

Human raters compare two model responses to the same question and pick which is better. Over millions of comparisons, the model learns to produce responses humans prefer.

Think of it like...

Like a comedian adjusting their act based on audience reactions — they keep what gets laughs and drop what falls flat, gradually getting better at pleasing the crowd.

Related Terms