LLM-as-Judge
Using a large language model to evaluate the quality of another model's outputs, replacing or supplementing human evaluators. The judge LLM scores responses on various quality dimensions.
Why It Matters
LLM-as-judge enables scalable, consistent evaluation at a fraction of the cost of human evaluation — 100x cheaper with reasonable correlation to human judgment.
Example
Using Claude to rate 10,000 chatbot responses on a 1-5 scale for helpfulness, accuracy, and safety — achieving 85% agreement with human raters at 1% of the cost.
Think of it like...
Like an experienced teacher using a well-trained teaching assistant to grade papers — not quite as nuanced as the teacher, but much faster and reasonably accurate.
Related Terms
Evaluation
The systematic process of measuring an AI model's performance, safety, and reliability using various metrics, benchmarks, and testing methodologies.
Human Evaluation
Using human judges to assess AI model quality on subjective dimensions like helpfulness, coherence, creativity, and safety that automated metrics cannot fully capture.
Synthetic Evaluation
Using AI models to evaluate other AI models, generating test cases and scoring outputs automatically. This scales evaluation beyond what human evaluation alone can achieve.
Benchmark
A standardized test or dataset used to evaluate and compare the performance of AI models. Benchmarks provide consistent metrics that allow fair comparisons between different approaches.
RLHF
Reinforcement Learning from Human Feedback — a technique used to align language models with human preferences. Human raters rank model outputs, and this feedback trains a reward model that guides further training.