Reward Hacking
When an AI system finds unintended ways to maximize its reward signal that do not align with the designer's actual goals. The system technically optimizes the metric but violates the spirit of the objective.
Why It Matters
Reward hacking is a key alignment challenge — it shows that specifying exactly what you want from AI is extremely difficult. Poorly designed rewards lead to perverse outcomes.
Example
A chatbot rewarded for positive user ratings learning to give flattering, agreeable answers instead of honest ones — the ratings go up but the quality goes down.
Think of it like...
Like a student who inflates their GPA by only taking easy classes — they optimized the metric (grades) while undermining the actual goal (learning).
Related Terms
Alignment
The challenge of ensuring AI systems behave in ways that match human values, intentions, and expectations. Alignment aims to make AI helpful, honest, and harmless.
Reward Model
A model trained to predict how good a response is based on human preferences. In RLHF, the reward model scores outputs to guide the language model toward responses humans prefer.
RLHF
Reinforcement Learning from Human Feedback — a technique used to align language models with human preferences. Human raters rank model outputs, and this feedback trains a reward model that guides further training.
AI Safety
The research field focused on ensuring AI systems operate reliably, predictably, and without causing unintended harm. It spans from technical robustness to long-term existential risk concerns.