Let’s dive right into fine-grained expert segmentation and shared expert isolation—two key innovations from the “DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models” paper.[https://arxiv.org/pdf/2401.06066]
Fine-Grained Expert Segmentation (Section 3.1)
Instead of having a small number of large experts, DeepSeekMoE splits each expert into multiple smaller ones. See Fig2(b).
Why This Design?
- Increased Specialization: Each expert is responsible for a smaller, more specific part of the knowledge, avoiding the problem where a single expert has to handle too many diverse topics.
- More Flexible Combinations: Since more experts are available, the model has more unique ways to route tokens, making learning more efficient.
- Better Load Balancing: With more experts, the router can distribute tokens more evenly, avoiding scenarios where certain experts are overused while others remain underutilized.
Shared Expert Isolation (Section 3.2)
Shared experts store general knowledge, allowing routed experts to focus on more specialized tasks. See Fig2(c).
Why This Design?
- Reduces Redundancy: Without shared experts, multiple routed experts may end up learning overlapping “common knowledge,” leading to wasted parameters. By isolating this into dedicated shared experts, routed experts can specialize more effectively.
- Ensures Stability: Since shared experts are always active, they provide a consistent base of knowledge that helps guide learning across different routed experts.
- Prevents Knowledge Fragmentation: By keeping general knowledge centralized, it prevents situations where some routed experts lack crucial background information.
DeepSeekMoE 2B Model Example
- 63 routed experts and 1 shared expert, which is always active for all tokens.
- Instead of activating just 2 experts per token (as in top-2 routing used by traditional MoE), DeepSeekMoE activates 7 routed experts per token.
- The 7 routed experts selected per token focus only on specialized knowledge, while the shared expert handles common linguistic and factual information.
By combining Fine-Grained Expert Segmentation and Shared Expert Isolation, DeepSeekMoE improves expert specialization and reduces redundancy. These changes lead to better overall efficiency while maintaining the same computation cost per token.
Analysis on Expert Specialization (Section 4.5)
To understand the effectiveness of these innovations, the authors conducted an in-depth analysis on a 2B model (with 2.0B total parameters, 1 shared expert, and 7 activated routed experts). Here’s a simplified explanation of their findings:
Background:
- GShard: GShard is an MoE architecture introduced by Google in 2021. In GShard, each token is typically assigned to 2 experts using a top-2 routing strategy (serves as a baseline).[https://arxiv.org/abs/2006.16668]
- Pile Loss: Pile Loss refers to the cross-entropy loss measured on the Pile dataset—a large, diverse text corpus. A lower Pile Loss indicates that the model predicts tokens more accurately.[https://arxiv.org/abs/2101.00027]
Experiment 1: Disabling Top Routed Experts (Figure 4)
Method:
- Identify the “top routed experts”—the experts with the highest routing scores for each token (i.e., those most frequently chosen).
- Manually disable a fraction of these top experts, then select the top‑K experts from the remaining ones.
- Measure the model’s loss (Pile Loss) on the Pile dataset.
Results:
- For DeepSeekMoE, once the most important experts are disabled, the loss immediately and significantly increases; whereas for GShard×1.5, the loss rises much more gradually.
Conclusion:
- DeepSeekMoE’s experts are more “irreplaceable,” meaning that each expert learns unique and essential knowledge.
- In contrast, the experts in GShard×1.5 appear to have more overlapping or redundant knowledge, so disabling some experts has a smaller overall effect.
Experiment 2: Shared Experts Cannot Be Replaced by Routed Experts
Method:
- Disable the shared expert in DeepSeekMoE and, to keep the overall computation constant, activate one additional routed expert.
- Observe the change in Pile Loss.
Results:
- The loss increases significantly from 1.808 to 2.414, indicating a clear performance drop.
Conclusion:
- The shared expert contains general, foundational knowledge that is vital to the model.
- Routed experts cannot replace this common knowledge, underlining the importance of the shared expert.
Experiment 3: Varying the Number of Activated Experts vs. Pile Loss (Figure 5)
Method:
- Vary the number of activated routed experts in DeepSeekMoE from 3 to 7 and record the corresponding Pile Loss.
- Compare these results with GShard, which typically uses top‑2 routing.
Results:
- DeepSeekMoE achieves a Pile Loss comparable to GShard (top‑2) when only 4 experts are activated.
Conclusion:
- DeepSeekMoE’s experts are of higher quality, or in other words, the knowledge they acquire is more concentrated.
- This means that the model does not need to activate as many experts to achieve strong performance.
Scaling Up to DeepSeekMoE 16B
When scaling from the 2B model to the 16B model, several changes and improvements are introduced:
Model Architecture Changes:
- The 16B model uses 28 Transformer layers (compared to 9 layers in the 2B model).
- The hidden dimension increases to 2048 with 16 attention heads.
- In the MoE layers (all but the first layer), the design is adjusted: each MoE layer now includes 2 shared experts and 64 routed experts.
- For each token, the model activates 2 shared experts along with 6 routed experts (compared to 1 shared expert and 7 routed experts in the 2B model).
- The total parameters are approximately 16.4B, while the activated parameters per token are around 2.8B.
Training Resources:
- The 16B model is trained on a large-scale corpus with 2 trillion tokens.
- Training is performed on ? * NVIDIA A100 or H800 nodes, each node contains 8 GPUs.
Runtime Resources:
- DeepSeekMoE 16B model can be deployed on a single GPU with 40GB of memory.
Key Points on DeepSeekMoE
Fine-Grained Expert Segmentation:
- Splits each large expert into many smaller sub-experts.
- Allows the model to learn more focused and specialized knowledge.
Shared Expert Isolation:
- Dedicates specific experts to capture common or general knowledge.
- Enables the other (routed) experts to concentrate solely on specialized tasks.