Thursday, February 29, 2024

Mixtral of Experts? Mixture of Experts? Try to explain Mixtral 8x7B

 What You Think MoE Is vs. What It Really Is

What You Think MoE Is a perfect ensemble of experts working in harmony, each contributing their unique skills to solve complex problems.
What It Really Is ... a collection of specialized sub-networks, guided by a gating mechanism that decides which experts to activate for each input.

The 2017 MoE Framework

The idea of MoE is not completely new. It is similar to ensemble learning, where multiple models work together to improve results. But Google researchers took this idea further in 2017 [https://arxiv.org/abs/1701.06538]. They introduced an MoE layer that consists of:

1. Experts: Many smaller models, each trained to handle different types of input.

2. A gating network: A special part of the model that decides which experts to activate.

For example, if there are 512 experts but only 2 are used per input, then the model remains large in total size, but each input only uses a small part of it. This keeps the computing cost low while increasing the model’s ability to handle complex tasks.


2024 Mixtral of Experts

In 2024, a new MoE model called Mixtral of Experts was introduced [https://arxiv.org/abs/2401.04088]. The name is a creative spin on Mixture of Experts. This model improves the way experts are selected and aims to make MoE work better in large AI systems. It continues the idea that bigger models do not have to be more expensive to run if only a few parts of them are used at a time.


How Experts Are Chosen (Routing Methods)


A key part of MoE is choosing which experts should handle each input. There are two main ways to do this:

  1. Token-based routing:
    • The model scores all the experts and picks the best ones for each input token (word or piece of data).
    • If we have a score vector \( \alpha = G(x) \), where \( G(x) \) is the gating network's output, we choose the top \( k \) experts and calculate the final output as: \[ \text{MoE}(x) = \sum_{j=1}^{k} \alpha_{i_j} \cdot E_{i_j}(x) \] where \( E_{i_j} \) is the expert function and \( i_j \) represents the chosen experts.
    • Problem: Some experts may get too much work while others are rarely used.
  2. Expert-based routing:
    • Instead of tokens choosing experts, experts choose which tokens to process.
    • Each expert has a fixed capacity, preventing some from being overloaded while others remain idle.

Understanding MoE with a Simple Visualization

A great way to understand MoE is by looking at how tokens are assigned to experts. In the following visualization, each token is directed to different experts. Instead of every expert working on all tokens, the model learns to send different types of input to the most relevant experts.

This method shows why MoE is efficient. Instead of loading and computing all model parameters at once, only a fraction is used per step, saving both memory and time.

Key Points on MoE

  • Design & Performance MoE takes a smart approach by selectively activating different expert models for each input. This leads to strong benchmark results while keeping computational costs manageable during training and inference.
  • Practical Trade-offs While effective in research settings, local deployment can be challenging due to memory requirements. Running multiple expert models simultaneously needs substantial GPU resources, especially when handling the input-output flow between experts.
  • Research Value The success of MoE reveals interesting insights about language model architecture. It shows that models can perform well using only a subset of parameters, suggesting promising directions for making future models more efficient.


No comments:

Post a Comment

s1: Simple Test-Time Scaling paper explained

  [ https://arxiv.org/pdf/2501.19393 ] Published on February 3rd , from researchers at Stanford, University of Washington, Allen Institute f...