Thursday, February 29, 2024

A quick study of LLM-as-a-Judge in evaluating chatbots (MT-Bench)

This paper proposes using large language models (LLMs) as judges to evaluate other LLM-based chatbots. [https://arxiv.org/pdf/2306.05685]

The key claim: a strong model such as GPT-4 can consistently match human preferences at above 80% agreement, verified through crowdsourced votes.

Introduction

Modern chatbots engage in multi-turn conversations, making traditional single-turn benchmarks inadequate to capture the model's performance under this situation. To ensure fairness and validate its results, the authors introduce a crowdsourcing setup with over 30,000 votes from human users.

Evaluation Methods: Pairwise vs. Single-Answer

1. Pairwise Comparison

  • The judge sees two answers for a single question and decides: “Answer A,” “Answer B,” or “Tie.”
  • This generates a win rate for each chatbot.

2. Single-Answer Grading

  • The judge scores a single response on a numerical scale (e.g., 0–10).
  • While it’s more scalable, it relies on a consistent grading standard.

Challenges

1. Position Bias (Favoring the First Answer)

  • Solution: Swap the order of responses and re-evaluate. If results change significantly, the bias is detected and accounted for. Average the results across both orders, or discard inconsistent votes.

2. Verbosity Bias (Preferring Longer Responses)

  • Solution: Use a “repetitive list” test—adding unnecessary details to see if the judge incorrectly favors the longer version. If so, prompt adjustments (e.g., explicitly instructing “conciseness is not a drawback”) are made to help redirect focus to substance over length.

3. Maintaining Consistency in Judgment: Ensure the LLM follows structured grading criteria by providing clear, step-by-step evaluation guidelines in the prompt.

4. Bias Reduction Through Aggregation: Collect multiple judgments from different prompts and combine them.

MT-Bench Setting

80 fixed questions in eight categories (writing, roleplay, reasoning, math, coding, extraction, STEM, and humanities). GPT-4 as the Main Judge.

  • Table 7: Shows GPT-4’s high win rates against GPT-3.5, Vicuna-13B and LLaMA-13B across all 8 categories, underscoring its strong performance in multi-turn dialogue.
  • Figure 4: Compares how different judges (GPT-4, GPT-3.5, and humans) align on ranking models. GPT-4’s evaluations agree with human preferences at over 80% consistency.


Does the Judge Favor Its Own Answers?

Yes, the authors consider self-enhancement bias—an LLM may rate its own outputs more positively. By comparing GPT-4’s judgments to human votes, they find most of its decisions still align with humans, but caution remains that some bias might persist if the judge and the answer are from the same model. GPT-4 favors itself with a 10% higher win rate; Claude-v1 favors itself with a 25% higher win rate. However, they also favor other models and GPT-3.5 does not favor itself.

Key Points on MT-Bench

  • Fixed Benchmark: MT-Bench provides a standardized set of 80 multi-turn questions, ensuring fair model comparisons.
  • Crowdsourced Validation: Over 30K human votes confirm that GPT-4’s judgments align well with human preferences.
  • Limitations: The method provides relative rankings, not absolute scores, and self-enhancement biases still need further refinement.



Mixtral of Experts? Mixture of Experts? Try to explain Mixtral 8x7B

 What You Think MoE Is vs. What It Really Is

What You Think MoE Is a perfect ensemble of experts working in harmony, each contributing their unique skills to solve complex problems.
What It Really Is ... a collection of specialized sub-networks, guided by a gating mechanism that decides which experts to activate for each input.

The 2017 MoE Framework

The idea of MoE is not completely new. It is similar to ensemble learning, where multiple models work together to improve results. But Google researchers took this idea further in 2017 [https://arxiv.org/abs/1701.06538]. They introduced an MoE layer that consists of:

1. Experts: Many smaller models, each trained to handle different types of input.

2. A gating network: A special part of the model that decides which experts to activate.

For example, if there are 512 experts but only 2 are used per input, then the model remains large in total size, but each input only uses a small part of it. This keeps the computing cost low while increasing the model’s ability to handle complex tasks.


2024 Mixtral of Experts

In 2024, a new MoE model called Mixtral of Experts was introduced [https://arxiv.org/abs/2401.04088]. The name is a creative spin on Mixture of Experts. This model improves the way experts are selected and aims to make MoE work better in large AI systems. It continues the idea that bigger models do not have to be more expensive to run if only a few parts of them are used at a time.


How Experts Are Chosen (Routing Methods)


A key part of MoE is choosing which experts should handle each input. There are two main ways to do this:

  1. Token-based routing:
    • The model scores all the experts and picks the best ones for each input token (word or piece of data).
    • If we have a score vector \( \alpha = G(x) \), where \( G(x) \) is the gating network's output, we choose the top \( k \) experts and calculate the final output as: \[ \text{MoE}(x) = \sum_{j=1}^{k} \alpha_{i_j} \cdot E_{i_j}(x) \] where \( E_{i_j} \) is the expert function and \( i_j \) represents the chosen experts.
    • Problem: Some experts may get too much work while others are rarely used.
  2. Expert-based routing:
    • Instead of tokens choosing experts, experts choose which tokens to process.
    • Each expert has a fixed capacity, preventing some from being overloaded while others remain idle.

Understanding MoE with a Simple Visualization

A great way to understand MoE is by looking at how tokens are assigned to experts. In the following visualization, each token is directed to different experts. Instead of every expert working on all tokens, the model learns to send different types of input to the most relevant experts.

This method shows why MoE is efficient. Instead of loading and computing all model parameters at once, only a fraction is used per step, saving both memory and time.

Key Points on MoE

  • Design & Performance MoE takes a smart approach by selectively activating different expert models for each input. This leads to strong benchmark results while keeping computational costs manageable during training and inference.
  • Practical Trade-offs While effective in research settings, local deployment can be challenging due to memory requirements. Running multiple expert models simultaneously needs substantial GPU resources, especially when handling the input-output flow between experts.
  • Research Value The success of MoE reveals interesting insights about language model architecture. It shows that models can perform well using only a subset of parameters, suggesting promising directions for making future models more efficient.


s1: Simple Test-Time Scaling paper explained

  [ https://arxiv.org/pdf/2501.19393 ] Published on February 3rd , from researchers at Stanford, University of Washington, Allen Institute f...