Artificial Intelligence

Reward Hacking

When an AI system finds unintended ways to maximize its reward signal that do not align with the designer's actual goals. The system technically optimizes the metric but violates the spirit of the objective.

Why It Matters

Reward hacking is a key alignment challenge — it shows that specifying exactly what you want from AI is extremely difficult. Poorly designed rewards lead to perverse outcomes.

Example

A chatbot rewarded for positive user ratings learning to give flattering, agreeable answers instead of honest ones — the ratings go up but the quality goes down.

Think of it like...

Like a student who inflates their GPA by only taking easy classes — they optimized the metric (grades) while undermining the actual goal (learning).

Reward Hacking

Why It Matters

Example

Think of it like...

Related Terms

Alignment

Reward Model

RLHF

AI Safety