Towards a Comprehensive Scaling Law of Mixture-of-Experts
Overview
Overall Novelty Assessment
This paper contributes a comprehensive joint scaling law for Mixture-of-Experts models by systematically decomposing five key factors: data size, total parameters, activated parameters, number of active experts, and shared expert ratio. It resides in the 'Comprehensive Multi-Factor Scaling Laws' leaf, which contains only three papers total, indicating a relatively sparse research direction within the broader taxonomy. The sibling papers in this leaf similarly pursue unified scaling relationships across multiple MoE-specific dimensions, suggesting this work addresses a recognized gap in theoretical understanding of sparse expert architectures.
The taxonomy reveals that theoretical scaling law formulation divides into comprehensive multi-factor approaches versus specialized single-dimension studies. Neighboring leaves examine parameter-FLOP tradeoffs, efficiency leverage, and granularity effects in isolation. The broader 'Empirical Characterization' branch contains work on optimal configuration and upcycling that validates scaling predictions experimentally rather than deriving formal laws. This paper's position suggests it bridges theoretical rigor with practical design considerations, sitting at the intersection of formal modeling and the empirical optimization work found in adjacent branches focused on resource allocation and hyperparameter tuning.
Among the 27 candidates examined through limited semantic search, none clearly refute the three main contributions. The comprehensive joint scaling law examined 10 candidates with zero refutable overlaps, the theoretical derivation of optimal configurations also examined 10 with none refutable, and the characterization of non-monotonic coupled effects examined 7 with none refutable. This suggests that within the search scope, the specific combination of five factors and their coupled, non-monotonic treatment appears distinct from prior work. However, the limited search scale means potentially relevant papers outside the top-K semantic matches may exist but were not examined.
Based on the examined literature, the work appears to occupy a relatively novel position by simultaneously addressing factor multiplicity, coupling relationships, and non-monotonic effects in a unified framework. The sparse population of the taxonomy leaf and absence of refuting candidates within the search scope support this impression, though the analysis acknowledges its limitation to 27 semantically similar papers rather than an exhaustive field survey.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors systematically identify five key factors affecting MoE performance and conduct 450 controlled experiments to construct a comprehensive joint scaling law. This law accounts for data size, total model size, activated model size, number of active experts, and ratio of shared experts, providing more accurate predictions than existing scaling laws.
The authors derive closed-form expressions for optimal values of the number of activated experts, ratio of shared experts, and activated parameter ratio. They show that optimal G and S are independent of model size and data size, while optimal Na/N decreases as total model size increases.
The authors identify and address three critical challenges unique to MoE scaling laws: multiple influencing factors, intricate coupling relationships among factors, and non-monotonic performance impacts. They provide a fine-grained investigation revealing how factors like Na and G exhibit hook-shaped relationships with loss.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[9] Unified Scaling Laws for Routed Language Models PDF
[27] Scaling Laws Across Model Architectures: A Comparative Analysis of Dense and MoE Models in Large Language Models PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Comprehensive joint MoE scaling law with five key factors
The authors systematically identify five key factors affecting MoE performance and conduct 450 controlled experiments to construct a comprehensive joint scaling law. This law accounts for data size, total model size, activated model size, number of active experts, and ratio of shared experts, providing more accurate predictions than existing scaling laws.
[1] Mixture of experts in large language models PDF
[3] DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models PDF
[8] Tutel: Adaptive Mixture-of-Experts at Scale PDF
[12] Joint MoE Scaling Laws: Mixture of Experts Can Be Memory Efficient PDF
[13] Scaling laws for fine-grained mixture of experts PDF
[14] Scaling Vision with Sparse Mixture of Experts PDF
[18] Glam: Efficient scaling of language models with mixture-of-experts PDF
[24] Megascale-infer: Efficient mixture-of-experts model serving with disaggregated expert parallelism PDF
[51] Mixture-of-experts with expert choice routing PDF
[52] From Sparse to Soft Mixtures of Experts PDF
Theoretical derivation of optimal MoE configurations
The authors derive closed-form expressions for optimal values of the number of activated experts, ratio of shared experts, and activated parameter ratio. They show that optimal G and S are independent of model size and data size, while optimal Na/N decreases as total model size increases.
[3] DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models PDF
[51] Mixture-of-experts with expert choice routing PDF
[60] Understanding and leveraging the expert specialization of context faithfulness in mixture-of-experts llms PDF
[61] A survey on inference optimization techniques for mixture of experts models PDF
[62] A Survey on Mixture of Experts PDF
[63] Capacity-Aware Inference: Mitigating the Straggler Effect in Mixture of Experts PDF
[64] Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts PDF
[65] Harder tasks need more experts: Dynamic routing in moe models PDF
[66] Improving Expert Specialization in Mixture of Experts PDF
[67] ExpertFlow: Optimized Expert Activation and Token Allocation for Efficient Mixture-of-Experts Inference PDF
Characterization of non-monotonic and coupled factor effects in MoE
The authors identify and address three critical challenges unique to MoE scaling laws: multiple influencing factors, intricate coupling relationships among factors, and non-monotonic performance impacts. They provide a fine-grained investigation revealing how factors like Na and G exhibit hook-shaped relationships with loss.