Artificial Intelligence

LLM-as-Judge

Using a large language model to evaluate the quality of another model's outputs, replacing or supplementing human evaluators. The judge LLM scores responses on various quality dimensions.

Why It Matters

LLM-as-judge enables scalable, consistent evaluation at a fraction of the cost of human evaluation — 100x cheaper with reasonable correlation to human judgment.

Example

Using Claude to rate 10,000 chatbot responses on a 1-5 scale for helpfulness, accuracy, and safety — achieving 85% agreement with human raters at 1% of the cost.

Think of it like...

Like an experienced teacher using a well-trained teaching assistant to grade papers — not quite as nuanced as the teacher, but much faster and reasonably accurate.

LLM-as-Judge

Why It Matters

Example

Think of it like...

Related Terms

Evaluation

Human Evaluation

Synthetic Evaluation

Benchmark

RLHF