Towards a Comprehensive Scaling Law of Mixture-of-Experts

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Scaling law; MoE; LLM

Mixture-of-Experts (MoE) models have become the consensus approach for enabling parameter-efficient scaling and cost-effective deployment in large language models. However, existing scaling laws for dense models are inapplicable to MoE models, which stems from three critical challenges: the multiplicity of influencing factors, their intricate coupling relationships and the non-monotonic nature of their performance impacts. They collectively necessitate a fine-grained investigation into MoE-specific scaling laws. In this work, we perform a systematic decomposition of MoE settings, identifying five key factors that influence model performance from both size and structural perspectives (data size ( $D$ ), total model size ( $N$ ), activated model size ( $N_a$ ), number of active experts ( $G$ ) and the ratio of shared experts ( $S$ )). Specifically, we design $450$ controlled experiments to characterize their marginal effects, ultimately constructing a comprehensive and precise joint MoE scaling law that considers all essential factors. Furthermore, we derive the theoretically optimal and practically efficiency-aware optimal configurations for $G$ , $S$ and $N_a/N$ with detailed analyses. Our results demonstrate that the optimal settings for $G$ and $S$ are independent of both the model architecture and data size. With the scaling of $N$ , the optimal activation parameter ratio of $N_a/N$ becomes sparser. Our proposed MoE scaling law could function as an accurate and insightful guidance to facilitate future MoE model design and training.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

This paper contributes a comprehensive joint scaling law for Mixture-of-Experts models by systematically decomposing five key factors: data size, total parameters, activated parameters, number of active experts, and shared expert ratio. It resides in the 'Comprehensive Multi-Factor Scaling Laws' leaf, which contains only three papers total, indicating a relatively sparse research direction within the broader taxonomy. The sibling papers in this leaf similarly pursue unified scaling relationships across multiple MoE-specific dimensions, suggesting this work addresses a recognized gap in theoretical understanding of sparse expert architectures.

The taxonomy reveals that theoretical scaling law formulation divides into comprehensive multi-factor approaches versus specialized single-dimension studies. Neighboring leaves examine parameter-FLOP tradeoffs, efficiency leverage, and granularity effects in isolation. The broader 'Empirical Characterization' branch contains work on optimal configuration and upcycling that validates scaling predictions experimentally rather than deriving formal laws. This paper's position suggests it bridges theoretical rigor with practical design considerations, sitting at the intersection of formal modeling and the empirical optimization work found in adjacent branches focused on resource allocation and hyperparameter tuning.

Among the 27 candidates examined through limited semantic search, none clearly refute the three main contributions. The comprehensive joint scaling law examined 10 candidates with zero refutable overlaps, the theoretical derivation of optimal configurations also examined 10 with none refutable, and the characterization of non-monotonic coupled effects examined 7 with none refutable. This suggests that within the search scope, the specific combination of five factors and their coupled, non-monotonic treatment appears distinct from prior work. However, the limited search scale means potentially relevant papers outside the top-K semantic matches may exist but were not examined.

Based on the examined literature, the work appears to occupy a relatively novel position by simultaneously addressing factor multiplicity, coupling relationships, and non-monotonic effects in a unified framework. The sparse population of the taxonomy leaf and absence of refuting candidates within the search scope support this impression, though the analysis acknowledges its limitation to 27 semantically similar papers rather than an exhaustive field survey.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Scaling laws for Mixture-of-Experts language models. The field of MoE scaling research has evolved into a rich taxonomy spanning theoretical foundations, empirical optimization, architectural innovations, and practical deployment considerations. At the highest level, the taxonomy divides into branches addressing theoretical scaling law formulation (where researchers derive predictive relationships between model size, compute, and performance), empirical characterization (focused on experimental validation and hyperparameter tuning), architecture design (exploring expert specialization patterns and routing mechanisms), training and inference systems (tackling distributed computation challenges), multimodal extensions (adapting MoE principles to vision-language tasks), model compression (reducing memory and latency costs), comprehensive surveys, large-scale production case studies, and alternative paradigms for specialized applications. Works like DeepSeekMoE[3] and OLMoE[7] exemplify architectural innovations, while systems research such as Tutel[8] and DeepSpeed MoE[15] address the engineering challenges of scaling MoE models efficiently. Within this landscape, particularly active lines of inquiry contrast dense versus sparse scaling trade-offs, optimal expert granularity, and the interplay between parameter count and FLOPs as explored in Parameters vs FLOPs[4] and Inference Optimal MoE[6]. The theoretical branch, where Comprehensive MoE Scaling[0] resides, focuses on deriving unified predictive laws that account for multiple factors—expert count, routing strategies, and activation sparsity—simultaneously. This work sits alongside Unified Routed Scaling[9] and Dense vs MoE[27], which similarly investigate how different architectural choices influence scaling behavior. Compared to empirical studies like Upcycling MoE Scaling[5] that validate scaling through experimental sweeps, Comprehensive MoE Scaling[0] emphasizes formal modeling of the relationships governing MoE efficiency, aiming to provide principled guidance for practitioners navigating the complex design space of sparse expert architectures.

Claimed Contributions

Comprehensive joint MoE scaling law with five key factors

10 retrieved papers

The authors systematically identify five key factors affecting MoE performance and conduct 450 controlled experiments to construct a comprehensive joint scaling law. This law accounts for data size, total model size, activated model size, number of active experts, and ratio of shared experts, providing more accurate predictions than existing scaling laws.

10 retrieved papers

Theoretical derivation of optimal MoE configurations

10 retrieved papers

The authors derive closed-form expressions for optimal values of the number of activated experts, ratio of shared experts, and activated parameter ratio. They show that optimal G and S are independent of model size and data size, while optimal Na/N decreases as total model size increases.

10 retrieved papers

Characterization of non-monotonic and coupled factor effects in MoE

7 retrieved papers

The authors identify and address three critical challenges unique to MoE scaling laws: multiple influencing factors, intricate coupling relationships among factors, and non-monotonic performance impacts. They provide a fine-grained investigation revealing how factors like Na and G exhibit hook-shaped relationships with loss.

7 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[9] Unified Scaling Laws for Routed Language Models PDF

Clark, Aidan, Aidan Clark, Casas, Diego de Las, Diego de las Casas, Guy, Aurelia, Aurelia Guy, Mensch, Arthur, Arthur Mensch, Paganini Michela, M. Paganini, A. Mensch, Hoffmann, Jordan, Jordan Hoffmann, Michela Paganini, Damoc, Bogdan, Bogdan Damoc, Hechtman, Blake, Blake A. Hechtman, Cai, Trevor, Trevor Cai, Borgeaud, Sebastian, Sebastian Borgeaud, Driessche, George van den, George van den Driessche, Rutherford, Eliza, Eliza Rutherford, Hennigan, Tom, Tom Hennigan, Johnson, Matthew, Matthew Johnson, T. Hennigan, Millican, Katie, Katie Millican, Matthew G. Johnson, Cassirer, Albin, Albin Cassirer, Jones, Chris, C. A. Jones, Buchatskaya, Elena, Elena Buchatskaya, Chris Jones, Budden David, David Budden, Sifre, Laurent, Laurent Sifre, D. Budden, Osindero, Simon, Simon Osindero, L. Sifre, Vinyals, Oriol, Oriol Vinyals, Rae, Jack, Jack W. Rae, O. Vinyals, Elsen, Erich, Erich Elsen, Kavukcuoglu, Koray, Koray Kavukcuoglu, Simonyan, Karen, Karen Simonyan, K. Kavukcuoglu, K. Simonyan (2022)

[27] Scaling Laws Across Model Architectures: A Comparative Analysis of Dense and MoE Models in Large Language Models PDF

Chen Zheng-yu, He Keqing, Li Bei, Wang Jingang, Wang Siqi, Zhang Min (2024) • Conference on Empirical Methods in Natural Language Processing

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Comprehensive joint MoE scaling law with five key factors

[1] Mixture of experts in large language models PDF

Cannot Refute

[3] DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models PDF

Cannot Refute

[8] Tutel: Adaptive Mixture-of-Experts at Scale PDF

Cannot Refute

[12] Joint MoE Scaling Laws: Mixture of Experts Can Be Memory Efficient PDF

Cannot Refute

[13] Scaling laws for fine-grained mixture of experts PDF

Cannot Refute

[14] Scaling Vision with Sparse Mixture of Experts PDF

Cannot Refute

[18] Glam: Efficient scaling of language models with mixture-of-experts PDF

Cannot Refute

[24] Megascale-infer: Efficient mixture-of-experts model serving with disaggregated expert parallelism PDF

Cannot Refute

[51] Mixture-of-experts with expert choice routing PDF

Cannot Refute

[52] From Sparse to Soft Mixtures of Experts PDF

Cannot Refute

Contribution

Theoretical derivation of optimal MoE configurations

[3] DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models PDF

Cannot Refute

[51] Mixture-of-experts with expert choice routing PDF

Cannot Refute

[60] Understanding and leveraging the expert specialization of context faithfulness in mixture-of-experts llms PDF

Cannot Refute

[61] A survey on inference optimization techniques for mixture of experts models PDF

Cannot Refute

[62] A Survey on Mixture of Experts PDF

Cannot Refute

[63] Capacity-Aware Inference: Mitigating the Straggler Effect in Mixture of Experts PDF

Cannot Refute

[64] Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts PDF

Cannot Refute

[65] Harder tasks need more experts: Dynamic routing in moe models PDF

Cannot Refute

[66] Improving Expert Specialization in Mixture of Experts PDF

Cannot Refute

[67] ExpertFlow: Optimized Expert Activation and Token Allocation for Efficient Mixture-of-Experts Inference PDF

Cannot Refute

Contribution

Characterization of non-monotonic and coupled factor effects in MoE

[53] Fusemoe: Mixture-of-experts transformers for fleximodal fusion PDF

Cannot Refute

[54] Mixture of experts (moe): A big data perspective PDF

Cannot Refute

[55] International collaboration framework for the calculation of performance loss rates: Data quality, benchmarks, and trends (towards a uniform methodology) PDF

Cannot Refute

[56] Extending mixture of experts model to investigate heterogeneity of trajectories: When, where, and how to add which covariates. PDF

Cannot Refute

[57] Multi-Expert Collaboration Based Information Graph Learning for Anomaly Diagnosis in Smart Grids PDF

Cannot Refute

[58] Infinite mixture-of-experts model for sparse survival regression with application to breast cancer PDF

Cannot Refute

[59] Rsynllm: A Risk-Aware Routing Mixture-of-Experts Large Language Model for Multi-Energy Load Forecasting in Large-Scale Power Distribution Networks PDF

Cannot Refute

Towards a Comprehensive Scaling Law of Mixture-of-Experts

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[9] Unified Scaling Laws for Routed Language Models PDF

[27] Scaling Laws Across Model Architectures: A Comparative Analysis of Dense and MoE Models in Large Language Models PDF

Contribution Analysis

Comprehensive joint MoE scaling law with five key factors

[1] Mixture of experts in large language models PDF

[3] DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models PDF

[8] Tutel: Adaptive Mixture-of-Experts at Scale PDF

[12] Joint MoE Scaling Laws: Mixture of Experts Can Be Memory Efficient PDF

[13] Scaling laws for fine-grained mixture of experts PDF

[14] Scaling Vision with Sparse Mixture of Experts PDF

[18] Glam: Efficient scaling of language models with mixture-of-experts PDF

[24] Megascale-infer: Efficient mixture-of-experts model serving with disaggregated expert parallelism PDF

[51] Mixture-of-experts with expert choice routing PDF

[52] From Sparse to Soft Mixtures of Experts PDF

Theoretical derivation of optimal MoE configurations

[3] DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models PDF

[51] Mixture-of-experts with expert choice routing PDF

[60] Understanding and leveraging the expert specialization of context faithfulness in mixture-of-experts llms PDF

[61] A survey on inference optimization techniques for mixture of experts models PDF

[62] A Survey on Mixture of Experts PDF

[63] Capacity-Aware Inference: Mitigating the Straggler Effect in Mixture of Experts PDF

[64] Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts PDF

[65] Harder tasks need more experts: Dynamic routing in moe models PDF

[66] Improving Expert Specialization in Mixture of Experts PDF

[67] ExpertFlow: Optimized Expert Activation and Token Allocation for Efficient Mixture-of-Experts Inference PDF

Characterization of non-monotonic and coupled factor effects in MoE

[53] Fusemoe: Mixture-of-experts transformers for fleximodal fusion PDF

[54] Mixture of experts (moe): A big data perspective PDF

[55] International collaboration framework for the calculation of performance loss rates: Data quality, benchmarks, and trends (towards a uniform methodology) PDF

[56] Extending mixture of experts model to investigate heterogeneity of trajectories: When, where, and how to add which covariates. PDF

[57] Multi-Expert Collaboration Based Information Graph Learning for Anomaly Diagnosis in Smart Grids PDF

[58] Infinite mixture-of-experts model for sparse survival regression with application to breast cancer PDF

[59] Rsynllm: A Risk-Aware Routing Mixture-of-Experts Large Language Model for Multi-Energy Load Forecasting in Large-Scale Power Distribution Networks PDF

Table of Contents