This paper proposes using large language models (LLMs) as judges to evaluate other LLM-based chatbots. [https://arxiv.org/pdf/2306.05685]
The key claim: a strong model such as GPT-4 can consistently match human preferences at above 80% agreement, verified through crowdsourced votes.
Introduction
Modern chatbots engage in multi-turn conversations, making traditional single-turn benchmarks inadequate to capture the model's performance under this situation. To ensure fairness and validate its results, the authors introduce a crowdsourcing setup with over 30,000 votes from human users.
Evaluation Methods: Pairwise vs. Single-Answer
1. Pairwise Comparison
- The judge sees two answers for a single question and decides: “Answer A,” “Answer B,” or “Tie.”
- This generates a win rate for each chatbot.
2. Single-Answer Grading
- The judge scores a single response on a numerical scale (e.g., 0–10).
- While it’s more scalable, it relies on a consistent grading standard.
Challenges
1. Position Bias (Favoring the First Answer)
- Solution: Swap the order of responses and re-evaluate. If results change significantly, the bias is detected and accounted for. Average the results across both orders, or discard inconsistent votes.
2. Verbosity Bias (Preferring Longer Responses)
- Solution: Use a “repetitive list” test—adding unnecessary details to see if the judge incorrectly favors the longer version. If so, prompt adjustments (e.g., explicitly instructing “conciseness is not a drawback”) are made to help redirect focus to substance over length.
3. Maintaining Consistency in Judgment: Ensure the LLM follows structured grading criteria by providing clear, step-by-step evaluation guidelines in the prompt.
4. Bias Reduction Through Aggregation: Collect multiple judgments from different prompts and combine them.
MT-Bench Setting
80 fixed questions in eight categories (writing, roleplay, reasoning, math, coding, extraction, STEM, and humanities). GPT-4 as the Main Judge.
- Table 7: Shows GPT-4’s high win rates against GPT-3.5, Vicuna-13B and LLaMA-13B across all 8 categories, underscoring its strong performance in multi-turn dialogue.
- Figure 4: Compares how different judges (GPT-4, GPT-3.5, and humans) align on ranking models. GPT-4’s evaluations agree with human preferences at over 80% consistency.
Does the Judge Favor Its Own Answers?
Yes, the authors consider self-enhancement bias—an LLM may rate its own outputs more positively. By comparing GPT-4’s judgments to human votes, they find most of its decisions still align with humans, but caution remains that some bias might persist if the judge and the answer are from the same model. GPT-4 favors itself with a 10% higher win rate; Claude-v1 favors itself with a 25% higher win rate. However, they also favor other models and GPT-3.5 does not favor itself.
Key Points on MT-Bench
- Fixed Benchmark: MT-Bench provides a standardized set of 80 multi-turn questions, ensuring fair model comparisons.
- Crowdsourced Validation: Over 30K human votes confirm that GPT-4’s judgments align well with human preferences.
- Limitations: The method provides relative rankings, not absolute scores, and self-enhancement biases still need further refinement.


