Friday, February 14, 2025

s1: Simple Test-Time Scaling paper explained

 

[https://arxiv.org/pdf/2501.19393]
Published on February 3rd, from researchers at Stanford, University of Washington, Allen Institute for AI, and Contextual AI.

s1 introduces a simple yet effective method to boost model reasoning by scaling test-time compute. Its key innovation is a technique called budget forcing that regulates the length of the model’s reasoning process.

Scaling Test-Time Compute: Sequential vs. Parallel

1. Sequential Scaling via Budget Forcing (proposed method)

  The Idea:

  • Budget forcing intercepts the model’s attempt to finish reasoning. When it generates an end-of-thinking token, that token is replaced (e.g., with “Wait”), forcing the model to continue until a set token limit is reached, after which the final answer is produced.

  How It Works:

  • Prevent Early Stopping: When the model tries to stop thinking before reaching its token limit, we catch the end-of-thinking token and swap it with something like “Wait” so it keeps going.

  • Final Answer Trigger: After using up the token budget, add a marker like “Final Answer:” to tell the model to stop reasoning and give its final response.

2. Parallel Scaling via Majority Voting:  

  • Generates multiple candidate answers and then aggregates the results—often via majority voting—pick the most common answer. This uses more parallel compute rather than extending any single reasoning path.

As shown in Table 4

  • (a) Sequential Scaling with Budget Forcing: The model is prevented from stopping its reasoning. If it tries to end, we append “Wait,” effectively making it think 2×, 4×, or 6× longer. Performance improves as the model reasons more, but eventually flattens out (saturation).
  • (b) Parallel Scaling via Majority Voting: Run the base model multiple times—2, 4, 8, 16, 32, or 64 parallel samples—and use majority voting to decide on the final output. Although more samples can help, it does not always match the gains from sequential scaling.

Fine-Tuning with s1K

Curating the s1K Dataset

1. Initial Pool (59K): The authors start with 59,000 questions spanning math, science, and logic. They discard duplicates, poorly formatted samples, and trivial problems (solvable by smaller models).

2. Filtering Criteria:

  • Quality: Remove questions with formatting errors or unclear solutions, keeping only well-structured problems.
  • Difficulty: Exclude easy questions by checking whether Qwen2.5-7B or Qwen2.5-32B could already solve them. Longer or multi-step solutions also indicate higher difficulty.
  • Diversity: Ensure coverage across various domains (e.g., geometry, physics, number theory) to create a broad, representative set of problems.
3. Final Selection (1K): From the remaining pool, they choose 1,000 examples that strike the right balance of quality, difficulty, and diversity—forming the s1K dataset.

Data Format for Supervised Fine-Tuning (SFT)

Each sample includes:

  1. Question: The prompt or problem statement.
  2. Reasoning Trace: A detailed, multi-step chain of thought generated or validated for correctness.
  3. Final Answer: The concise solution, typically a short string or numeric result.

During fine-tuning, the model sees (Question + Reasoning Trace) as input and is trained to generate the Final Answer token sequence.

Fine-Tuning Process & Resource Efficiency

  • The base model is Qwen2.5-32B-Instruct.
  • They perform a Supervised Fine-Tuning (SFT) on s1K (question + reasoning + answer) samples.
  • This entire process takes only 26 minutes on 16 H100 GPUs, showcasing that even a small, carefully curated dataset can significantly enhance reasoning performance with minimal training time.

Key Points on s1

Single-Path Test-Time Scaling: A simple trick—preventing the model from stopping soon—enables effective scaling on one reasoning chain, outperforming parallel methods.

Efficient Data: The s1K dataset shows how a small, high-quality set of questions and model-generated reasoning traces can significantly boost reasoning performance.

Comment: Automated, high-quality data generation without human annotations is a promising trend. Better quality fine-tuning data and more effective loss functions (GRPO) may be the way forward instead of RLHF. RLHF is not what we need. It’s important to note that the success of s1K might be partly due to using the model’s own reasoning paths. When generating high-quality fine-tuning data, a model-specific factor could be key in future advancements.



Tuesday, November 5, 2024

LLaVA-OneVision quickly explained

LLaVA-OneVision (GitHub Link) is a new LLaVA-based multimodal large language model, developed independently from the original LLaVA team. If you're unfamiliar with LLaVA, consider checking the previous post on its basics. In this article, we'll explore LLaVA-OneVision and how it expands vision-language models to work across single images, multiple images, and videos.

* SigLIP vs. CLIP 

  • CLIP (from OpenAI) uses a contrastive learning approach, typically with a softmax-based final layer.
  • SigLIP (from Google) follows a similar contrastive idea but replaces the softmax with a sigmoid-based method, which can improve performance in some open-set tasks.

Both SigLIP and CLIP are strong vision encoders. LLaVA-OneVision uses SigLIP to encode images (or video frames), then passes the outputs through a simple projection layer and a large language model.

“Higher AnyRes” Strategy

Key Ideahandle high-resolution and unusual aspect ratios (Fig.2)

  • Split each image (or video frame) into multiple crops at a chosen resolution.
  • Encode each crop into visual tokens using the SigLIP encoder.
  • (New Step): If the total token count is too large (especially for high-resolution images), apply bilinear interpolation to reduce tokens per crop.

(Bilinear Interpolation: A method for resizing images by averaging the pixel values in a 2D grid. If an image is too large, this step can reduce its resolution before encoding, preventing the total token count from growing too big.)

In Figure 2,

  • (a) shows the Higher AnyRes strategy with bilinear interpolation.
  • (b) is the original AnyRes approach (without interpolation), which can produce more tokens than desired for high-resolution images.

Scenarios: (Fig.3)

1. Single Image

  • Typically produces 729 tokens at a standard resolution (e.g., 384×384).

2. Multi-Image

  • Each image is encoded into about 729 tokens, so more images mean more tokens.

3. Video

  • Each frame is treated like a single image, but you can reduce tokens per frame if you have many frames. This keeps the total token count from getting too large and helps the model handle longer videos.

By balancing the total number of tokens across these scenarios, LLaVA-OneVision can transfer knowledge effectively between single-image, multi-image, and video tasks. All of this is done while keeping the minimalist design (SigLIP + projection + LLM) that comes from the original LLaVA approach.

Projection Layer Training Strategy

In LLaVA-OneVision, the projection layer is a simple two-layer MLP that converts visual features from the SigLIP encoder into tokens that the language model can understand. Unlike previous LLaVA versions, where the focus was primarily on single-image tasks, the projection layer here is trained with a new strategy that prepares it for a broader range of scenarios.

  Stage-1 (Language-Image Alignment):

  • The projection layer is initially trained exclusively using image-text pairs. This step ensures that visual features are properly aligned with the language model’s embedding space. The focus here is solely on achieving a robust mapping from vision to language.

  Stage-1.5 (High-Quality Knowledge Learning):

  • Next, high-quality data is injected to further refine the projection layer’s performance. This stage leverages carefully curated instruction data, allowing the layer to capture more nuanced and diverse visual representations.

  Stage-2 (Visual Instruction Tuning):

  • Finally, the projection layer, along with the rest of the model, is fine-tuned on a mixture of single-image, multi-image, and video data. This step adapts the projection layer to handle different visual scenarios, ensuring smooth task transfer across modalities.

The new strategy makes the projection layer more robust and flexible, ultimately enhancing the model’s ability to transfer knowledge across single-image, multi-image, and video tasks.

* High-Quality Data Generation in Three Steps

  • Automated Generation: Use strong pre-trained models (e.g., GPT-4) to automatically produce detailed descriptions and instructions from images.
  • Manual Curation: Experts carefully filter and refine the generated data to ensure accuracy and diversity.
  • Diverse Data Sources: Aggregate data from various sources such as OCR, chart analysis, and synthetic datasets (including for Chinese tasks) to cover a broad range of visual scenarios.

Key points on LLaVA-OneVision

  Unified Representation: 

  • Uses the “Higher AnyRes” strategy to split images and video frames into balanced visual tokens (with bilinear interpolation when needed).

  Multi-Modal Training: 

  • Employs a three-stage training pipeline—alignment, high-quality data injection, and joint fine-tuning for images, multi-image, and video tasks.

  Minimalist Design: 

  • Combines a SigLIP encoder, a simple projection layer, and a powerful language model for efficient multi-modal learning.


Friday, October 18, 2024

OPERA[CVPR 2024 Highlight] explained + a little follow-up experiment

OPERA is a decoding method for multimodal LLMs, aiming to mitigate hallucinations by discouraging the model from over-trusting certain “summary” tokens and providing a fallback mechanism if a partial over-trust pattern emerges. It requires no extra training data or model fine-tuning, yet significantly reduces hallucination.

[https://arxiv.org/pdf/2311.17911]

Definition

Anchor Patterns/Knowledge Aggregation Patterns

Modern LLMs develop "anchor patterns" in their attention mechanisms. Instead of processing all previous tokens equally, they focus on a few key "summary" tokens (often punctuation or short words). These anchors compile information to guide future outputs. However, this selective attention can cause hallucinations when important visual details are missed, leading models to invent non-existent elements like cars or trees that aren't actually in images. (* Note that attention usually aggregate on period tokens(.) Be careful about it.)

Positive Correlation with Hallucinations

Figure 4 shows that more anchor tokens (visible as column-like attention patterns) correlate with increased hallucinations. This suggests these aggregation patterns directly contribute to factual errors in image descriptions rather than being merely harmless computational artifacts.

OPERA: A Beam Search-Based Decoding (Proposed Method)

OPERA, modifies standard Beam Search by incorporating two main tricks: (If you need a quick refresher on Beam Search, check some YouTube videos.)

1. Over-Trust Logit Penalty

2. Retrospection-Allocation Strategy

Together, these components discourage the model from following an anchor token’s lead and allow it to “roll back” if the partial over-trust pattern becomes too strong.

Over-Trust Logit Penalty

  1. Local Window on Self-Attention:
    • We examine attention weights (ω) for recent tokens, using a size k window to analyze how recent tokens interact with each other.
  2. Pattern Identification:
    • After eliminating forward attention (upper triangle), we apply scaling factor σ to highlight attention patterns.
    • Column-wise multiplication reveals tokens that accumulate excessive influence (potential anchors).
  3. Measuring Over-Reliance:
    • The maximum column product φ(ω<t) quantifies the model's over-dependence on individual tokens.
  4. Dynamic Correction:
    • During generation, we penalize next-token predictions by subtracting α · φ(ω<t) from logits.
    • Mathematically:
      p(xt | x<t) = Softmax[H(ht) - α·φ(ω<t)]xt
    • This encourages the model to consider broader context rather than fixating on single influential tokens.

Retrospection-Allocation Strategy

2. Retrospection-Allocation Strategy

Even with logit penalties, all beam candidates may develop the same over-trust patterns. Our rollback approach addresses this:

  1. Pattern Recognition:
    • We monitor maximum column-wise score positions in recent tokens (window size l). When the same anchor location appears frequently (≥ r times), we identify a persistent over-trust pattern.
  2. Strategic Reselection:
    • For an anchor at position s, we revert to the sequence before position s+1 was generated.
    • We then select an alternative token from the candidates, excluding the previously chosen one.
    • This process repeats up to β times to prevent excessive rollbacks.
Through this two-step approach, OPERA reduces hallucinations by preventing premature fixation on summary tokens and strategically backtracking when necessary to break established anchor patterns.

Experiments

Configure all methods with their default settings. OPERA uses Beam Search (Nbeam = 5) enhanced with Over-Trust Penalty and Retrospection-Allocation mechanisms.

Implementation Details

  • Scaling Factor (σ): 50, ensuring strong anchor tokens produce products > 1, while weaker attention patterns remain < 1.
  • Candidate Count (Ncan): Default 5; higher values improve exploration at increased computational cost.
  • Standard Parameters: α = 1, β = 5, and r = 15 remain consistent across all MLLM implementations.

Key points on OPERA

  1. Anchor Detection: OPERA identifies column-like attention patterns focused on punctuation or short words, preventing over-reliance on these tokens at the expense of visual information.
  2. Attention Penalty: By penalizing candidate tokens that excessively depend on anchors, OPERA reduces the continuation of hallucinated narratives once problematic patterns emerge.
  3. Strategic Backtracking: When all beams fixate on the same anchor, OPERA rolls back and chooses alternative generation paths, effectively resetting the model's focus.
---------------------------------------------------------------------

Additional Experiment (Bonus)

Setup:

  • Input: [image] ⬇️
  • Model: LLaVA-1.5-7B
  • Sample Caption: In the image, a zebra is standing in a field with dry leaves scattered around. It appears to be grazing on the leaves, possibly searching for food. Apart from the zebra, there are a few other animals in the scene, including a horse and a cow

  1. Let's examine the attention map. The coordinates represent each token in the caption. After one forward pass, we visualize how each token attends to previous tokens.
  2. We can observe the phenomena identified in OPERA. The orange boxes highlight how aggregation corresponds to hallucinated tokens, while yellow boxes show aggregation on periods. The leftmost column displays attention weights on image tokens (sum of all image token weights).

However, testing with additional examples reveals that attention aggregation occurs in both hallucination and non-hallucination cases. Even when examining attention to image tokens, there's no significant reduction in image attention during hallucination generation.

A fair conclusion is that aggregation is a nature of autoregressive next-token prediction LLM when generating contextual text. While it becomes pronounced with hallucinated tokens, the relationship is not so obvious.

Regarding attention weights on the image tokens, the following examples labeled [index]_[token] show that weights on image tokens are generally substantial regardless of it is describing an exist object or a hallucinated object. However, accurate descriptions correlate with more precise distribution of attention on relevant objects—an interesting observation, hope you enjoy that :).

[5_z(ebra)]

[54_horse]

[57_cow]




Sunday, October 6, 2024

Strawberry🍓

* How many "r"s in the word "strawberry"?

The Strawberry Rumor

On August 8, Sam Altman, the CEO of OpenAI, posted a cryptic tweet featuring a photo of strawberry plants in his garden, along with the caption “I love summer in the garden.”

Speculation soon began swirling around the idea that this tweet might be a playful hint at a new AI model under Project Strawberry—a project rumored to be the next step toward GPT-5. In previous rumors, Project Strawberry was referred to as “Project Q*,” suggesting an experimental or transitional initiative within OpenAI’s pipeline.

OpenAI o1

On September 12, OpenAI unveiled a model known as “o1.”

the official page: https://openai.com/index/learning-to-reason-with-llms/

This new model, trained with Reinforcement Learning techniques, is designed to excel at complex reasoning tasks such as mathematics and coding. What sets o1 apart from many of its predecessors is its capacity to “think before answering”—it produces a longer hidden chain of thought internally before presenting an answer to the user.

LLM Basics: The Previous Paradigm

Traditional Large Language Model (LLM) pipelines typically follow three stages:

1. Pre-training – Train on massive text corpora (the scaling law: bigger models + more data = better performance).

2. Post-training – Often includes techniques like Reinforcement Learning from Human Feedback (RLHF) to refine the base model into a chatbot or specialized assistant.

3. Inference – The final next-token prediction process for user queries: “The sky is [ ] …”

o1’s paradigm

  • Smaller Model Size: Compared to GPT-4 or Llama 3’s largest variants (70B/405B parameters), o1 might be relatively small. 
  • Chain-of-Thought Reasoning: o1 can generate a hidden “chain of thought” (CoT) before producing an answer, indicative of its focus on multi-step reasoning.
  • Built via Reinforcement Learning: While RLHF is used in many LLMs, o1 supposedly employs additional RL techniques, potentially going beyond simple hu
    man feedback loops.
  • Task-Specific Strength: Although powerful in math and coding challenges, o1 is reportedly less adept than GPT-4o in tasks like personal writing or broad creative composition.

What’s the Secret Sauce?

Reinforcement Learning (RL) vs. RLHF

A key talking point in the community is the difference between true RL and RLHF:

1. RL Is Powerful

  • Consider AlphaGo, which was trained with real self-play and Monte Carlo Tree Search (MCTS), exploring thousands of potential moves before deciding on the next best move.
  • Go is a clear example of a game with a final win/loss outcome, analogous to reasoning tasks where a solution is either correct or incorrect.

2. RLHF Is Not

  • RLHF typically means human annotators rate or rank the model outputs, providing a reward signal.
  • While it can guide a model to produce more coherent or polite responses, it lacks the self-play and deep exploration features that characterize classic RL.
    (If you want to be super-human, human feedback is not enough.)

What if AlphaGo were trained purely with RLHF? Human evaluators would have to label or rank each move. This would be enormously labor-intensive and might miss the power of self-play. Hence the distinction: RLHF is partially “RL,” but does not harness the full potential of iterative self-improvement that true RL can offer.

Tree-of-Thought, Multi-Agent Debate, and Verifiers

In large-scale reasoning tasks—especially math or coding—a linear Chain-of-Thought might not be enough. Some hypothesize that o1 could be using:

1. Tree-of-Thought Exploration

  • Rather than generating a single chain of reasoning, the model expands multiple branches, evaluating different solution paths in parallel, akin to MCTS. 

[https://jyopari.github.io/MCTS.html]


2. Multi-Agent Debate

Multiple “agents” (or multiple copies of the model) could debate or verify each other’s answers, leading to more robust final solutions.

The “infinite monkey theorem” suggests that, given enough random attempts, a correct solution might eventually appear. But we need an effective verifier to pick the correct one out of many.

Majority voting works if the model’s error patterns are random, but in practice, errors can cluster. A specialized verifier is thus potentially more reliable.

[https://arxiv.org/abs/2407.21787]

Key Points on o1

1. Small yet Powerful: Despite its smaller size compared to GPT-4, o1 excels in math and coding tasks by generating a hidden chain of thought before providing an answer.

2. Reinforcement Learning Focus: Goes beyond standard RLHF, possibly integrating true RL methods (self-play, multi-agent debate) to refine its reasoning.


Friday, August 16, 2024

LLaVA’s Journey: From LLaVA to LLaVA-1.5 and LLaVA-NEXT (1.6)

Large Language and Vision Assistant(LLaVA) has evolved through several versions, each bringing improvements in model architecture and dataset diversity.

If you want to quickly deploy LLaVA, try the Hugging Face versions:

👉 LLaVA on Hugging Face

1. LLaVA (Initial Release)

  • GitHubLLaVA
  • PaperVisual Instruction Tuning (NeurIPS 2023) → arXiv
  • Description: The first version of LLaVA introduced vision-language alignment using large language models (LLMs) and an image encoder.

2. LLaVA-1.5

  • PaperImproved Baselines with Visual Instruction Tuning (CVPR 2024) → arXiv
  • Improvements: Trained on a wider range of visual instruction datasets for better generalization.
  • Available Models:
    • llava-hf/llava-1.5-7b-hf 
    • llava-hf/llava-1.5-13b-hf 

3. LLaVA-NeXT (LLaVA-1.6)

  • Updates:
    • Higher image resolution
    • Expanded reasoning and OCR datasets
    • More architecture variations
  • Available Model Variants:
    • llava-hf/llava-v1.6-mistral-7b-hf
    • llava-hf/llava-v1.6-vicuna-7b-hf
    • llava-hf/llava-v1.6-vicuna-13b-hf
    • llava-hf/llava-v1.6-34b-hf
    • llava-hf/llama3-llava-next-8b-hf
    • llava-hf/llava-next-72b-hf
    • llava-hf/llava-next-110b-hf

---------------------------------------------------------------------

Now, if you want to continue exploring what happened with LLaVA, please look ahead.

From LLaVA to LLaVA‑1.5

Typical LVLM Architecture: Vision Encoder + Connector + Language Decoder

LLaVA follows a common architecture for large vision-language models (LVLMs): a pre-trained visual encoder processes images, a connector aligns visual features to text embeddings, and a pre-trained language decoder generates text responses.

Visual Encoder: CLIP ViT‑L/14

  • Developed by OpenAI, CLIP (Contrastive Language–Image Pre-training) employs contrastive learning to align images and text within a shared embedding space.

  • Trained on a vast dataset of 400 million image-text pairs, enabling robust open-set recognition capabilities.

  • Processes images by dividing them into non-overlapping patches, each measuring 14×14 pixels, resulting in a total of 256 patches for a 224×224 pixel input image.

Language Decoder: Vicuña‑7B/13B

  • Fine-tuned from the LLaMA base model using 70K user-shared ChatGPT conversations.
  • Reportedly achieves over 90% of ChatGPT’s quality.
  • The overall size of LLaVA is primarily determined by this language model (i.e., ~7B or ~13B parameters).
  • High-quality data for post-training is important!

Connector: A Learnable Layer Aligning Vision and Text

  • Converts the visual encoder’s feature dimension (e.g. 768) to match the language model’s token embedding dimension (e.g. 4096).
  • Initially a linear projection in LLaVA; later improved to an MLP for better alignment in LLaVA‑1.5.

How LLaVA Is Trained

LLaVA introduced a new way to create multimodal instruction-following data—one of the key innovations for its success.

1. Data Generation with GPT‑4

  • Prompt text-only GPT‑4 to simulate “image-based” instructions.
  • Provide GPT‑4 with detailed text descriptions, bounding boxes, and sample Q&A pairs.
  • Manually add a few examples to guide GPT‑4, which then generates ~158K instruction-answer pairs.

2. Two-Stage Training (with Supervised Fine-Tuning)

  • Stage 1: Feature Alignment (Connector Pre-training)
    • Use ~595K image-text pairs (e.g., from Conceptual Captions) to train the connector so that CLIP’s outputs align properly with Vicuña’s text embeddings.
    • This step typically completes in about 4 hours on 8× A100 GPUs.
  • Stage 2: Instruction Fine-Tuning
    • Fine-tune the entire model on 158K multimodal Q&A data plus the ScienceQA dataset.
    • Instruction fine-tuning takes around 10 hours, while the ScienceQA step takes ~4 hours.

Resource Efficiency

LLaVA’s impressive results come from relatively modest resources and data sizes. This highlights the power of high-quality instruction datasets for quickly boosting model performance.

Transition to LLaVA‑1.5

Building on LLaVA’s success, LLaVA‑1.5 introduced multiple enhancements to both data and model structure:

  Data Updates

  • Increased variety in training prompts (more VQA datasets, better prompt formatting).
  • Included Optical Character Recognition (OCR) data to handle text in images.

  Resolution Scaling

  • Boosted input resolution from 224×224 up to 336×336, allowing the model to capture more visual detail.

  Connector Upgrade

  • Moved from a simple linear projection to a more expressive MLP, improving alignment between CLIP and Vicuña.

  Data Efficiency

  • A finding: randomly downsampling LLaVA’s training mixture by up to 75% does not significantly reduce performance. This again suggests the high-quality data is important for LLM/LVLM post-training.

---------------------------------------------------------------------

LLaVA‑NEXT: Key updates

  Higher Input Resolution:

  • LLaVA-NeXT increases the input image resolution to 4× more pixels, supporting versatile aspect ratios (up to 672×672, 336×1344, and 1344×336). This enhancement allows the model to capture finer visual details.

  Enhanced Visual Reasoning & OCR:

  • With an improved visual instruction tuning data mixture, LLaVA-NeXT delivers superior reasoning and OCR capabilities—vital for more accurate and robust multimodal understanding.

  Expanded Capabilities:

  • The model exhibits better visual conversation skills, broader world knowledge, and improved logical reasoning, making it effective across a wider range of applications.

  Efficient Deployment:

  • Despite its advanced features, LLaVA-NeXT maintains the minimalist design and data efficiency of LLaVA‑1.5. For instance, the largest 34B variant completes training in about 1 day on 32 A100 GPUs, demonstrating a highly cost-effective training process.

For more details, please refer to the LLaVA-NeXT blog.

Key points on LLaVA

  Architecture:

  • A classic vision-language setup combining a CLIP-based visual encoder and a language decoder with a connector.

  High-Quality Instruction Data:

  • Uses novel GPT‑4-generated, multimodal instruction data, emphasizing the importance of quality over quantity.

  Two-Stage Training:

  • Involves connector pre-training on ~595K image-text pairs followed by supervised fine-tuning on 158K multimodal Q&A samples (plus ScienceQA).

Saturday, March 9, 2024

DeepSeekMoE: Advanced MoE explained

Let’s dive right into fine-grained expert segmentation and shared expert isolation—two key innovations from the “DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models” paper.[https://arxiv.org/pdf/2401.06066]


Fine-Grained Expert Segmentation (Section 3.1)

Instead of having a small number of large experts, DeepSeekMoE splits each expert into multiple smaller ones. See Fig2(b).


  Why This Design?

  • Increased Specialization: Each expert is responsible for a smaller, more specific part of the knowledge, avoiding the problem where a single expert has to handle too many diverse topics.
  • More Flexible Combinations: Since more experts are available, the model has more unique ways to route tokens, making learning more efficient.
  • Better Load Balancing: With more experts, the router can distribute tokens more evenly, avoiding scenarios where certain experts are overused while others remain underutilized.

Shared Expert Isolation (Section 3.2)

Shared experts store general knowledge, allowing routed experts to focus on more specialized tasks. See Fig2(c).

  Why This Design?

  • Reduces Redundancy: Without shared experts, multiple routed experts may end up learning overlapping “common knowledge,” leading to wasted parameters. By isolating this into dedicated shared experts, routed experts can specialize more effectively.
  • Ensures Stability: Since shared experts are always active, they provide a consistent base of knowledge that helps guide learning across different routed experts.
  • Prevents Knowledge Fragmentation: By keeping general knowledge centralized, it prevents situations where some routed experts lack crucial background information.

DeepSeekMoE 2B Model Example

  • 63 routed experts and 1 shared expert, which is always active for all tokens.
  • Instead of activating just 2 experts per token (as in top-2 routing used by traditional MoE), DeepSeekMoE activates 7 routed experts per token.
  • The 7 routed experts selected per token focus only on specialized knowledge, while the shared expert handles common linguistic and factual information.

By combining Fine-Grained Expert Segmentation and Shared Expert Isolation, DeepSeekMoE improves expert specialization and reduces redundancy. These changes lead to better overall efficiency while maintaining the same computation cost per token.

Analysis on Expert Specialization (Section 4.5)

To understand the effectiveness of these innovations, the authors conducted an in-depth analysis on a 2B model (with 2.0B total parameters, 1 shared expert, and 7 activated routed experts). Here’s a simplified explanation of their findings:

  Background: 

  • GShard: GShard is an MoE architecture introduced by Google in 2021. In GShard, each token is typically assigned to 2 experts using a top-2 routing strategy (serves as a baseline).[https://arxiv.org/abs/2006.16668

  • Pile Loss: Pile Loss refers to the cross-entropy loss measured on the Pile dataset—a large, diverse text corpus. A lower Pile Loss indicates that the model predicts tokens more accurately.[https://arxiv.org/abs/2101.00027]

Experiment 1: Disabling Top Routed Experts (Figure 4)

  Method:

  1. Identify the “top routed experts”—the experts with the highest routing scores for each token (i.e., those most frequently chosen).
  2. Manually disable a fraction of these top experts, then select the top‑K experts from the remaining ones.
  3. Measure the model’s loss (Pile Loss) on the Pile dataset.

  Results:

  • For DeepSeekMoE, once the most important experts are disabled, the loss immediately and significantly increases; whereas for GShard×1.5, the loss rises much more gradually.

  Conclusion:

  • DeepSeekMoE’s experts are more “irreplaceable,” meaning that each expert learns unique and essential knowledge.
  • In contrast, the experts in GShard×1.5 appear to have more overlapping or redundant knowledge, so disabling some experts has a smaller overall effect.

Experiment 2: Shared Experts Cannot Be Replaced by Routed Experts

  Method:

  • Disable the shared expert in DeepSeekMoE and, to keep the overall computation constant, activate one additional routed expert.
  • Observe the change in Pile Loss.

  Results:

  • The loss increases significantly from 1.808 to 2.414, indicating a clear performance drop.

  Conclusion:

  • The shared expert contains general, foundational knowledge that is vital to the model.
  • Routed experts cannot replace this common knowledge, underlining the importance of the shared expert.

Experiment 3: Varying the Number of Activated Experts vs. Pile Loss (Figure 5)

  Method:

  • Vary the number of activated routed experts in DeepSeekMoE from 3 to 7 and record the corresponding Pile Loss.
  • Compare these results with GShard, which typically uses top‑2 routing.

  Results:

  • DeepSeekMoE achieves a Pile Loss comparable to GShard (top‑2) when only 4 experts are activated.

  Conclusion:

  • DeepSeekMoE’s experts are of higher quality, or in other words, the knowledge they acquire is more concentrated.
  • This means that the model does not need to activate as many experts to achieve strong performance.

Scaling Up to DeepSeekMoE 16B

When scaling from the 2B model to the 16B model, several changes and improvements are introduced:

  Model Architecture Changes:

  • The 16B model uses 28 Transformer layers (compared to 9 layers in the 2B model).
  • The hidden dimension increases to 2048 with 16 attention heads.
  • In the MoE layers (all but the first layer), the design is adjusted: each MoE layer now includes 2 shared experts and 64 routed experts.
  • For each token, the model activates 2 shared experts along with 6 routed experts (compared to 1 shared expert and 7 routed experts in the 2B model).
  • The total parameters are approximately 16.4B, while the activated parameters per token are around 2.8B.

  Training Resources:

  • The 16B model is trained on a large-scale corpus with 2 trillion tokens.
  • Training is performed on ? * NVIDIA A100 or H800 nodes, each node contains 8 GPUs.

  Runtime Resources:

  • DeepSeekMoE 16B model can be deployed on a single GPU with 40GB of memory.

Key Points on DeepSeekMoE

  Fine-Grained Expert Segmentation:

  • Splits each large expert into many smaller sub-experts.
  • Allows the model to learn more focused and specialized knowledge.

  Shared Expert Isolation:

  • Dedicates specific experts to capture common or general knowledge.
  • Enables the other (routed) experts to concentrate solely on specialized tasks.


Thursday, February 29, 2024

A quick study of LLM-as-a-Judge in evaluating chatbots (MT-Bench)

This paper proposes using large language models (LLMs) as judges to evaluate other LLM-based chatbots. [https://arxiv.org/pdf/2306.05685]

The key claim: a strong model such as GPT-4 can consistently match human preferences at above 80% agreement, verified through crowdsourced votes.

Introduction

Modern chatbots engage in multi-turn conversations, making traditional single-turn benchmarks inadequate to capture the model's performance under this situation. To ensure fairness and validate its results, the authors introduce a crowdsourcing setup with over 30,000 votes from human users.

Evaluation Methods: Pairwise vs. Single-Answer

1. Pairwise Comparison

  • The judge sees two answers for a single question and decides: “Answer A,” “Answer B,” or “Tie.”
  • This generates a win rate for each chatbot.

2. Single-Answer Grading

  • The judge scores a single response on a numerical scale (e.g., 0–10).
  • While it’s more scalable, it relies on a consistent grading standard.

Challenges

1. Position Bias (Favoring the First Answer)

  • Solution: Swap the order of responses and re-evaluate. If results change significantly, the bias is detected and accounted for. Average the results across both orders, or discard inconsistent votes.

2. Verbosity Bias (Preferring Longer Responses)

  • Solution: Use a “repetitive list” test—adding unnecessary details to see if the judge incorrectly favors the longer version. If so, prompt adjustments (e.g., explicitly instructing “conciseness is not a drawback”) are made to help redirect focus to substance over length.

3. Maintaining Consistency in Judgment: Ensure the LLM follows structured grading criteria by providing clear, step-by-step evaluation guidelines in the prompt.

4. Bias Reduction Through Aggregation: Collect multiple judgments from different prompts and combine them.

MT-Bench Setting

80 fixed questions in eight categories (writing, roleplay, reasoning, math, coding, extraction, STEM, and humanities). GPT-4 as the Main Judge.

  • Table 7: Shows GPT-4’s high win rates against GPT-3.5, Vicuna-13B and LLaMA-13B across all 8 categories, underscoring its strong performance in multi-turn dialogue.
  • Figure 4: Compares how different judges (GPT-4, GPT-3.5, and humans) align on ranking models. GPT-4’s evaluations agree with human preferences at over 80% consistency.


Does the Judge Favor Its Own Answers?

Yes, the authors consider self-enhancement bias—an LLM may rate its own outputs more positively. By comparing GPT-4’s judgments to human votes, they find most of its decisions still align with humans, but caution remains that some bias might persist if the judge and the answer are from the same model. GPT-4 favors itself with a 10% higher win rate; Claude-v1 favors itself with a 25% higher win rate. However, they also favor other models and GPT-3.5 does not favor itself.

Key Points on MT-Bench

  • Fixed Benchmark: MT-Bench provides a standardized set of 80 multi-turn questions, ensuring fair model comparisons.
  • Crowdsourced Validation: Over 30K human votes confirm that GPT-4’s judgments align well with human preferences.
  • Limitations: The method provides relative rankings, not absolute scores, and self-enhancement biases still need further refinement.



Mixtral of Experts? Mixture of Experts? Try to explain Mixtral 8x7B

 What You Think MoE Is vs. What It Really Is

What You Think MoE Is a perfect ensemble of experts working in harmony, each contributing their unique skills to solve complex problems.
What It Really Is ... a collection of specialized sub-networks, guided by a gating mechanism that decides which experts to activate for each input.

The 2017 MoE Framework

The idea of MoE is not completely new. It is similar to ensemble learning, where multiple models work together to improve results. But Google researchers took this idea further in 2017 [https://arxiv.org/abs/1701.06538]. They introduced an MoE layer that consists of:

1. Experts: Many smaller models, each trained to handle different types of input.

2. A gating network: A special part of the model that decides which experts to activate.

For example, if there are 512 experts but only 2 are used per input, then the model remains large in total size, but each input only uses a small part of it. This keeps the computing cost low while increasing the model’s ability to handle complex tasks.


2024 Mixtral of Experts

In 2024, a new MoE model called Mixtral of Experts was introduced [https://arxiv.org/abs/2401.04088]. The name is a creative spin on Mixture of Experts. This model improves the way experts are selected and aims to make MoE work better in large AI systems. It continues the idea that bigger models do not have to be more expensive to run if only a few parts of them are used at a time.


How Experts Are Chosen (Routing Methods)


A key part of MoE is choosing which experts should handle each input. There are two main ways to do this:

  1. Token-based routing:
    • The model scores all the experts and picks the best ones for each input token (word or piece of data).
    • If we have a score vector \( \alpha = G(x) \), where \( G(x) \) is the gating network's output, we choose the top \( k \) experts and calculate the final output as: \[ \text{MoE}(x) = \sum_{j=1}^{k} \alpha_{i_j} \cdot E_{i_j}(x) \] where \( E_{i_j} \) is the expert function and \( i_j \) represents the chosen experts.
    • Problem: Some experts may get too much work while others are rarely used.
  2. Expert-based routing:
    • Instead of tokens choosing experts, experts choose which tokens to process.
    • Each expert has a fixed capacity, preventing some from being overloaded while others remain idle.

Understanding MoE with a Simple Visualization

A great way to understand MoE is by looking at how tokens are assigned to experts. In the following visualization, each token is directed to different experts. Instead of every expert working on all tokens, the model learns to send different types of input to the most relevant experts.

This method shows why MoE is efficient. Instead of loading and computing all model parameters at once, only a fraction is used per step, saving both memory and time.

Key Points on MoE

  • Design & Performance MoE takes a smart approach by selectively activating different expert models for each input. This leads to strong benchmark results while keeping computational costs manageable during training and inference.
  • Practical Trade-offs While effective in research settings, local deployment can be challenging due to memory requirements. Running multiple expert models simultaneously needs substantial GPU resources, especially when handling the input-output flow between experts.
  • Research Value The success of MoE reveals interesting insights about language model architecture. It shows that models can perform well using only a subset of parameters, suggesting promising directions for making future models more efficient.


s1: Simple Test-Time Scaling paper explained

  [ https://arxiv.org/pdf/2501.19393 ] Published on February 3rd , from researchers at Stanford, University of Washington, Allen Institute f...