Generalization and Scaling Laws for Mixture-of-Experts Transformers

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.5 Download Report PDF

mixture of expertsscaling lawsllmSparse TransformersGeneralization bounds

We develop a theory of generalization and scaling for Mixture-of-Experts (MoE) Transformers that cleanly separates \emph{active} per-input capacity from \emph{routing} combinatorics. Conditioning on fixed routing patterns and union-bounding across them, we obtain a sup-norm covering-number bound whose metric entropy scales with the active parameter budget and incurs a MoE-specific overhead. Combining this with a standard ERM argument for squared loss we provided a generalization bound under a $d$ -dimensional manifold model and $C^\beta$ targets, showing that approximation and estimation trade off in the same way as in dense networks once active parameters are counted appropriately. We further prove a constructive approximation theorem for MoE architectures, demonstrating that accuracy can be improved either by scaling active capacity or by increasing the number of available experts, with the better of the two mechanisms prevailing. From these results we derive neural scaling laws, covering model scaling, data scaling and compute–optimal tradeoffs. The theory highlights that enlarging the expert pool at fixed sparsity influences performance only through a mild logarithmic routing term, whereas increasing active capacity per input drives the main gains in generalization and approximation. These insights provide principled guidance for the design of efficient sparse Transformer systems and clarify the fundamental tradeoffs underlying their empirical scaling behavior.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper develops a theoretical framework for MoE generalization and scaling, deriving bounds that separate active per-input capacity from routing combinatorics and proving constructive approximation theorems. It resides in the 'Generalization Theory and Approximation' leaf under 'Theoretical Foundations and Scaling Laws', which contains only two papers total. This is a notably sparse research direction within the broader taxonomy of 50 papers across 36 topics, suggesting that rigorous theoretical characterization of MoE capacity and convergence remains relatively underexplored compared to empirical scaling studies, architectural innovations, and application-focused work.

The taxonomy tree reveals that theoretical work on MoE is split between 'Generalization Theory and Approximation' (2 papers) and 'Empirical Scaling Laws' (3 papers), with the latter examining granularity trade-offs and practical scaling behavior. Neighboring branches focus on architectural design (routing mechanisms, expert construction, fine-grained architectures) and training systems (distributed parallelism, inference optimization). The scope note for this leaf explicitly excludes purely empirical scaling studies, positioning the paper as one of very few attempts to provide mathematical foundations for understanding MoE capacity, approximation guarantees, and scaling behavior through formal analysis rather than experimental observation.

Among 26 candidates examined across three contributions, the analysis found limited prior work overlap. The generalization bound contribution examined 6 candidates with none clearly refuting it, while the constructive approximation theorem examined 10 candidates with no refutations. The neural scaling laws contribution examined 10 candidates and found 3 potentially refutable matches, suggesting this aspect has more substantial prior work. The limited search scope (top-K semantic search plus citation expansion) means these statistics reflect a targeted sample rather than exhaustive coverage, but the low refutation rates across most contributions suggest the theoretical approach may offer fresh perspectives within the examined literature.

Based on the limited search scope of 26 candidates, the work appears to occupy a relatively sparse theoretical niche, with most prior MoE research focusing on empirical scaling, architectural design, or applications. The taxonomy structure confirms that rigorous generalization theory for MoE remains underdeveloped compared to other research directions. However, the analysis cannot rule out relevant theoretical work outside the top-K semantic matches examined, and the three refutable candidates for scaling laws indicate some overlap with existing empirical or theoretical scaling studies.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Generalization and scaling laws for Mixture-of-Experts Transformers. The field has evolved into a rich ecosystem spanning theoretical foundations, architectural innovations, training systems, domain-specific applications, generalization studies, comparative surveys, and emerging cross-domain extensions. Theoretical Foundations and Scaling Laws examine how MoE models scale with parameters and data, exploring approximation guarantees and capacity trade-offs that underpin works like Switch Transformers[7] and GLaM[12]. Architectural Innovations and Design focus on novel routing mechanisms, expert specialization strategies, and structural variants such as soft versus sparse gating, exemplified by Sparse to Soft MoE[10] and Multi-head MoE[40]. Training Systems and Deployment address the engineering challenges of distributed computation, memory efficiency, and inference optimization, with contributions like FasterMoE[13] and Megascale-MoE[31]. Applications and Domain-Specific Adaptations demonstrate MoE utility in vision, time series, biomedical domains, and multimodal settings, while Generalization and Transfer Learning investigates how sparse expert selection affects out-of-distribution robustness and task transfer. A particularly active line of work centers on understanding generalization properties and scaling behavior in sparse MoE architectures. MoE Scaling Laws[0] contributes to this area by analyzing how expert count, sparsity patterns, and model depth influence generalization bounds and approximation capacity. This theoretical perspective complements empirical scaling studies like Fine-grained MoE Scaling[3], which examines granular trade-offs between expert granularity and performance, and Sparse MoE Generalization[5], which investigates how sparse routing impacts generalization across diverse tasks. While MoE Scaling Laws[0] emphasizes rigorous theoretical characterization of capacity and convergence, Fine-grained MoE Scaling[3] provides empirical insights into optimal expert configurations, and Sparse MoE Generalization[5] bridges theory and practice by studying generalization under realistic sparsity constraints. Together, these works address fundamental questions about when and why MoE architectures generalize effectively, informing both theoretical understanding and practical design choices in large-scale deployments.

Claimed Contributions

Generalization bound separating active capacity from routing combinatorics

6 retrieved papers

The authors derive a covering-number bound for MoE Transformers in which metric entropy decomposes additively into an active-capacity term and a routing term. This separation enables distinct analysis of approximation (driven by active parameters) and routing overhead (scaling logarithmically with expert pool size).

6 retrieved papers

Constructive approximation theorem for MoE architectures

10 retrieved papers

The authors provide a manifold-based, k-sparse partition-of-unity construction showing that MoE approximation error can be reduced through two routes: increasing active per-token capacity or enlarging the expert pool. The theorem formalizes how these two mechanisms trade off under smoothness and intrinsic-dimension assumptions.

10 retrieved papers

Neural scaling laws for MoE measured against active parameters

Can Refute

10 retrieved papers

The authors derive explicit power-law scaling exponents for model size, dataset size, and compute-optimal allocation in MoE Transformers. These laws recover dense-network exponents but are indexed by active (not total) parameters and include a logarithmic routing overhead term, providing principled guidance for MoE design.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[5] Sparse Mixture-of-Experts for Compositional Generalization: Empirical Evidence and Theoretical Foundations of Optimal Sparsity PDF

Zhao, Jinze, Wang, Peihao, Jinze Zhao, Yang Jun-jie, Peihao Wang, Cai, Ruisi, Junjie Yang, Liu Gao-wen, Ruisi Cai, Srinivasa, Jayanth, Gaowen Liu, Kompella, Ramana Rao, Jayanth Srinivasa, Liang, Yingbin, R. Kompella, Zhangyang, Yingbin Liang, Zhangyang Wang (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Generalization bound separating active capacity from routing combinatorics

[61] Theory on Mixture-of-Experts in Continual Learning PDF

Cannot Refute

[62] Towards Convergence Rates for Parameter Estimation in Gaussian-gated Mixture of Experts PDF

Cannot Refute

[63] The Intersection of Modular Architectures and Scalable AI Systems PDF

Cannot Refute

[64] On the identifiability of mixtures-of-experts PDF

Cannot Refute

[65] On Minimax Estimation of Parameters in Softmax-Contaminated Mixture of Experts PDF

Cannot Refute

[66] Route, Select, Activate: The Mechanics of Mixture of Experts PDF

Cannot Refute

Contribution

Constructive approximation theorem for MoE architectures

[51] Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer PDF

Cannot Refute

[52] A theoretical view on sparsely activated networks PDF

Cannot Refute

[53] MomentumSMoe: Integrating momentum into sparse mixture of experts PDF

Cannot Refute

[54] Sigmoid gating is more sample efficient than softmax gating in mixture of experts PDF

Cannot Refute

[55] Generalization error analysis for sparse mixture-of-experts: A preliminary study PDF

Cannot Refute

[56] Statistical perspective of top-k sparse softmax gating mixture of experts PDF

Cannot Refute

[57] Moirai-MoE: Empowering Time Series Foundation Models with Sparse Mixture of Experts PDF

Cannot Refute

[58] Taming Sparsely Activated Transformer with Stochastic Experts PDF

Cannot Refute

[59] Unveiling super experts in mixture-of-experts large language models PDF

Cannot Refute

[60] Towards Understanding Mixture of Experts in Deep Learning PDF

Cannot Refute

Contribution

Neural scaling laws for MoE measured against active parameters

[3] Scaling laws for fine-grained mixture of experts PDF

Can Refute

[18] Towards greater leverage: Scaling laws for efficient mixture-of-experts language models PDF

Can Refute

[67] Joint MoE Scaling Laws: Mixture of Experts Can Be Memory Efficient PDF

Can Refute

[1] Scaling Vision with Sparse Mixture of Experts PDF

Cannot Refute

[12] Glam: Efficient scaling of language models with mixture-of-experts PDF

Cannot Refute

[15] Scaling vision-language models with sparse mixture of experts PDF

Cannot Refute

[16] Uni-MoE: Scaling Unified Multimodal LLMs With Mixture of Experts PDF

Cannot Refute

[68] Rankmixer: Scaling up ranking models in industrial recommenders PDF

Cannot Refute

[69] DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model PDF

Cannot Refute

[70] Gshard: Scaling giant models with conditional computation and automatic sharding PDF

Cannot Refute

Generalization and Scaling Laws for Mixture-of-Experts Transformers

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[5] Sparse Mixture-of-Experts for Compositional Generalization: Empirical Evidence and Theoretical Foundations of Optimal Sparsity PDF

Contribution Analysis

Generalization bound separating active capacity from routing combinatorics

[61] Theory on Mixture-of-Experts in Continual Learning PDF

[62] Towards Convergence Rates for Parameter Estimation in Gaussian-gated Mixture of Experts PDF

[63] The Intersection of Modular Architectures and Scalable AI Systems PDF

[64] On the identifiability of mixtures-of-experts PDF

[65] On Minimax Estimation of Parameters in Softmax-Contaminated Mixture of Experts PDF

[66] Route, Select, Activate: The Mechanics of Mixture of Experts PDF

Constructive approximation theorem for MoE architectures

[51] Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer PDF

[52] A theoretical view on sparsely activated networks PDF

[53] MomentumSMoe: Integrating momentum into sparse mixture of experts PDF

[54] Sigmoid gating is more sample efficient than softmax gating in mixture of experts PDF

[55] Generalization error analysis for sparse mixture-of-experts: A preliminary study PDF

[56] Statistical perspective of top-k sparse softmax gating mixture of experts PDF

[57] Moirai-MoE: Empowering Time Series Foundation Models with Sparse Mixture of Experts PDF

[58] Taming Sparsely Activated Transformer with Stochastic Experts PDF

[59] Unveiling super experts in mixture-of-experts large language models PDF

[60] Towards Understanding Mixture of Experts in Deep Learning PDF

Neural scaling laws for MoE measured against active parameters

[3] Scaling laws for fine-grained mixture of experts PDF

[18] Towards greater leverage: Scaling laws for efficient mixture-of-experts language models PDF

[67] Joint MoE Scaling Laws: Mixture of Experts Can Be Memory Efficient PDF

[1] Scaling Vision with Sparse Mixture of Experts PDF

[12] Glam: Efficient scaling of language models with mixture-of-experts PDF

[15] Scaling vision-language models with sparse mixture of experts PDF

[16] Uni-MoE: Scaling Unified Multimodal LLMs With Mixture of Experts PDF

[68] Rankmixer: Scaling up ranking models in industrial recommenders PDF

[69] DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model PDF

[70] Gshard: Scaling giant models with conditional computation and automatic sharding PDF

Table of Contents