Towards a Comprehensive Scaling Law of Mixture-of-Experts

ICLR 2026 Conference SubmissionAnonymous Authors
Scaling law; MoE; LLM
Abstract:

Mixture-of-Experts (MoE) models have become the consensus approach for enabling parameter-efficient scaling and cost-effective deployment in large language models. However, existing scaling laws for dense models are inapplicable to MoE models, which stems from three critical challenges: the multiplicity of influencing factors, their intricate coupling relationships and the non-monotonic nature of their performance impacts. They collectively necessitate a fine-grained investigation into MoE-specific scaling laws. In this work, we perform a systematic decomposition of MoE settings, identifying five key factors that influence model performance from both size and structural perspectives (data size (DD), total model size (NN), activated model size (NaN_a), number of active experts (GG) and the ratio of shared experts (SS)). Specifically, we design 450450 controlled experiments to characterize their marginal effects, ultimately constructing a comprehensive and precise joint MoE scaling law that considers all essential factors. Furthermore, we derive the theoretically optimal and practically efficiency-aware optimal configurations for GG, SS and Na/NN_a/N with detailed analyses. Our results demonstrate that the optimal settings for GG and SS are independent of both the model architecture and data size. With the scaling of NN, the optimal activation parameter ratio of Na/NN_a/N becomes sparser. Our proposed MoE scaling law could function as an accurate and insightful guidance to facilitate future MoE model design and training.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

This paper contributes a comprehensive joint scaling law for Mixture-of-Experts models by systematically decomposing five key factors: data size, total parameters, activated parameters, number of active experts, and shared expert ratio. It resides in the 'Comprehensive Multi-Factor Scaling Laws' leaf, which contains only three papers total, indicating a relatively sparse research direction within the broader taxonomy. The sibling papers in this leaf similarly pursue unified scaling relationships across multiple MoE-specific dimensions, suggesting this work addresses a recognized gap in theoretical understanding of sparse expert architectures.

The taxonomy reveals that theoretical scaling law formulation divides into comprehensive multi-factor approaches versus specialized single-dimension studies. Neighboring leaves examine parameter-FLOP tradeoffs, efficiency leverage, and granularity effects in isolation. The broader 'Empirical Characterization' branch contains work on optimal configuration and upcycling that validates scaling predictions experimentally rather than deriving formal laws. This paper's position suggests it bridges theoretical rigor with practical design considerations, sitting at the intersection of formal modeling and the empirical optimization work found in adjacent branches focused on resource allocation and hyperparameter tuning.

Among the 27 candidates examined through limited semantic search, none clearly refute the three main contributions. The comprehensive joint scaling law examined 10 candidates with zero refutable overlaps, the theoretical derivation of optimal configurations also examined 10 with none refutable, and the characterization of non-monotonic coupled effects examined 7 with none refutable. This suggests that within the search scope, the specific combination of five factors and their coupled, non-monotonic treatment appears distinct from prior work. However, the limited search scale means potentially relevant papers outside the top-K semantic matches may exist but were not examined.

Based on the examined literature, the work appears to occupy a relatively novel position by simultaneously addressing factor multiplicity, coupling relationships, and non-monotonic effects in a unified framework. The sparse population of the taxonomy leaf and absence of refuting candidates within the search scope support this impression, though the analysis acknowledges its limitation to 27 semantically similar papers rather than an exhaustive field survey.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
27
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Scaling laws for Mixture-of-Experts language models. The field of MoE scaling research has evolved into a rich taxonomy spanning theoretical foundations, empirical optimization, architectural innovations, and practical deployment considerations. At the highest level, the taxonomy divides into branches addressing theoretical scaling law formulation (where researchers derive predictive relationships between model size, compute, and performance), empirical characterization (focused on experimental validation and hyperparameter tuning), architecture design (exploring expert specialization patterns and routing mechanisms), training and inference systems (tackling distributed computation challenges), multimodal extensions (adapting MoE principles to vision-language tasks), model compression (reducing memory and latency costs), comprehensive surveys, large-scale production case studies, and alternative paradigms for specialized applications. Works like DeepSeekMoE[3] and OLMoE[7] exemplify architectural innovations, while systems research such as Tutel[8] and DeepSpeed MoE[15] address the engineering challenges of scaling MoE models efficiently. Within this landscape, particularly active lines of inquiry contrast dense versus sparse scaling trade-offs, optimal expert granularity, and the interplay between parameter count and FLOPs as explored in Parameters vs FLOPs[4] and Inference Optimal MoE[6]. The theoretical branch, where Comprehensive MoE Scaling[0] resides, focuses on deriving unified predictive laws that account for multiple factors—expert count, routing strategies, and activation sparsity—simultaneously. This work sits alongside Unified Routed Scaling[9] and Dense vs MoE[27], which similarly investigate how different architectural choices influence scaling behavior. Compared to empirical studies like Upcycling MoE Scaling[5] that validate scaling through experimental sweeps, Comprehensive MoE Scaling[0] emphasizes formal modeling of the relationships governing MoE efficiency, aiming to provide principled guidance for practitioners navigating the complex design space of sparse expert architectures.

Claimed Contributions

Comprehensive joint MoE scaling law with five key factors

The authors systematically identify five key factors affecting MoE performance and conduct 450 controlled experiments to construct a comprehensive joint scaling law. This law accounts for data size, total model size, activated model size, number of active experts, and ratio of shared experts, providing more accurate predictions than existing scaling laws.

10 retrieved papers
Theoretical derivation of optimal MoE configurations

The authors derive closed-form expressions for optimal values of the number of activated experts, ratio of shared experts, and activated parameter ratio. They show that optimal G and S are independent of model size and data size, while optimal Na/N decreases as total model size increases.

10 retrieved papers
Characterization of non-monotonic and coupled factor effects in MoE

The authors identify and address three critical challenges unique to MoE scaling laws: multiple influencing factors, intricate coupling relationships among factors, and non-monotonic performance impacts. They provide a fine-grained investigation revealing how factors like Na and G exhibit hook-shaped relationships with loss.

7 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Comprehensive joint MoE scaling law with five key factors

The authors systematically identify five key factors affecting MoE performance and conduct 450 controlled experiments to construct a comprehensive joint scaling law. This law accounts for data size, total model size, activated model size, number of active experts, and ratio of shared experts, providing more accurate predictions than existing scaling laws.

Contribution

Theoretical derivation of optimal MoE configurations

The authors derive closed-form expressions for optimal values of the number of activated experts, ratio of shared experts, and activated parameter ratio. They show that optimal G and S are independent of model size and data size, while optimal Na/N decreases as total model size increases.

Contribution

Characterization of non-monotonic and coupled factor effects in MoE

The authors identify and address three critical challenges unique to MoE scaling laws: multiple influencing factors, intricate coupling relationships among factors, and non-monotonic performance impacts. They provide a fine-grained investigation revealing how factors like Na and G exhibit hook-shaped relationships with loss.