Generalization and Scaling Laws for Mixture-of-Experts Transformers

ICLR 2026 Conference SubmissionAnonymous Authors
mixture of expertsscaling lawsllmSparse TransformersGeneralization bounds
Abstract:

We develop a theory of generalization and scaling for Mixture-of-Experts (MoE) Transformers that cleanly separates \emph{active} per-input capacity from \emph{routing} combinatorics. Conditioning on fixed routing patterns and union-bounding across them, we obtain a sup-norm covering-number bound whose metric entropy scales with the active parameter budget and incurs a MoE-specific overhead. Combining this with a standard ERM argument for squared loss we provided a generalization bound under a dd-dimensional manifold model and CβC^\beta targets, showing that approximation and estimation trade off in the same way as in dense networks once active parameters are counted appropriately. We further prove a constructive approximation theorem for MoE architectures, demonstrating that accuracy can be improved either by scaling active capacity or by increasing the number of available experts, with the better of the two mechanisms prevailing. From these results we derive neural scaling laws, covering model scaling, data scaling and compute–optimal tradeoffs. The theory highlights that enlarging the expert pool at fixed sparsity influences performance only through a mild logarithmic routing term, whereas increasing active capacity per input drives the main gains in generalization and approximation. These insights provide principled guidance for the design of efficient sparse Transformer systems and clarify the fundamental tradeoffs underlying their empirical scaling behavior.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper develops a theoretical framework for MoE generalization and scaling, deriving bounds that separate active per-input capacity from routing combinatorics and proving constructive approximation theorems. It resides in the 'Generalization Theory and Approximation' leaf under 'Theoretical Foundations and Scaling Laws', which contains only two papers total. This is a notably sparse research direction within the broader taxonomy of 50 papers across 36 topics, suggesting that rigorous theoretical characterization of MoE capacity and convergence remains relatively underexplored compared to empirical scaling studies, architectural innovations, and application-focused work.

The taxonomy tree reveals that theoretical work on MoE is split between 'Generalization Theory and Approximation' (2 papers) and 'Empirical Scaling Laws' (3 papers), with the latter examining granularity trade-offs and practical scaling behavior. Neighboring branches focus on architectural design (routing mechanisms, expert construction, fine-grained architectures) and training systems (distributed parallelism, inference optimization). The scope note for this leaf explicitly excludes purely empirical scaling studies, positioning the paper as one of very few attempts to provide mathematical foundations for understanding MoE capacity, approximation guarantees, and scaling behavior through formal analysis rather than experimental observation.

Among 26 candidates examined across three contributions, the analysis found limited prior work overlap. The generalization bound contribution examined 6 candidates with none clearly refuting it, while the constructive approximation theorem examined 10 candidates with no refutations. The neural scaling laws contribution examined 10 candidates and found 3 potentially refutable matches, suggesting this aspect has more substantial prior work. The limited search scope (top-K semantic search plus citation expansion) means these statistics reflect a targeted sample rather than exhaustive coverage, but the low refutation rates across most contributions suggest the theoretical approach may offer fresh perspectives within the examined literature.

Based on the limited search scope of 26 candidates, the work appears to occupy a relatively sparse theoretical niche, with most prior MoE research focusing on empirical scaling, architectural design, or applications. The taxonomy structure confirms that rigorous generalization theory for MoE remains underdeveloped compared to other research directions. However, the analysis cannot rule out relevant theoretical work outside the top-K semantic matches examined, and the three refutable candidates for scaling laws indicate some overlap with existing empirical or theoretical scaling studies.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
26
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: Generalization and scaling laws for Mixture-of-Experts Transformers. The field has evolved into a rich ecosystem spanning theoretical foundations, architectural innovations, training systems, domain-specific applications, generalization studies, comparative surveys, and emerging cross-domain extensions. Theoretical Foundations and Scaling Laws examine how MoE models scale with parameters and data, exploring approximation guarantees and capacity trade-offs that underpin works like Switch Transformers[7] and GLaM[12]. Architectural Innovations and Design focus on novel routing mechanisms, expert specialization strategies, and structural variants such as soft versus sparse gating, exemplified by Sparse to Soft MoE[10] and Multi-head MoE[40]. Training Systems and Deployment address the engineering challenges of distributed computation, memory efficiency, and inference optimization, with contributions like FasterMoE[13] and Megascale-MoE[31]. Applications and Domain-Specific Adaptations demonstrate MoE utility in vision, time series, biomedical domains, and multimodal settings, while Generalization and Transfer Learning investigates how sparse expert selection affects out-of-distribution robustness and task transfer. A particularly active line of work centers on understanding generalization properties and scaling behavior in sparse MoE architectures. MoE Scaling Laws[0] contributes to this area by analyzing how expert count, sparsity patterns, and model depth influence generalization bounds and approximation capacity. This theoretical perspective complements empirical scaling studies like Fine-grained MoE Scaling[3], which examines granular trade-offs between expert granularity and performance, and Sparse MoE Generalization[5], which investigates how sparse routing impacts generalization across diverse tasks. While MoE Scaling Laws[0] emphasizes rigorous theoretical characterization of capacity and convergence, Fine-grained MoE Scaling[3] provides empirical insights into optimal expert configurations, and Sparse MoE Generalization[5] bridges theory and practice by studying generalization under realistic sparsity constraints. Together, these works address fundamental questions about when and why MoE architectures generalize effectively, informing both theoretical understanding and practical design choices in large-scale deployments.

Claimed Contributions

Generalization bound separating active capacity from routing combinatorics

The authors derive a covering-number bound for MoE Transformers in which metric entropy decomposes additively into an active-capacity term and a routing term. This separation enables distinct analysis of approximation (driven by active parameters) and routing overhead (scaling logarithmically with expert pool size).

6 retrieved papers
Constructive approximation theorem for MoE architectures

The authors provide a manifold-based, k-sparse partition-of-unity construction showing that MoE approximation error can be reduced through two routes: increasing active per-token capacity or enlarging the expert pool. The theorem formalizes how these two mechanisms trade off under smoothness and intrinsic-dimension assumptions.

10 retrieved papers
Neural scaling laws for MoE measured against active parameters

The authors derive explicit power-law scaling exponents for model size, dataset size, and compute-optimal allocation in MoE Transformers. These laws recover dense-network exponents but are indexed by active (not total) parameters and include a logarithmic routing overhead term, providing principled guidance for MoE design.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Generalization bound separating active capacity from routing combinatorics

The authors derive a covering-number bound for MoE Transformers in which metric entropy decomposes additively into an active-capacity term and a routing term. This separation enables distinct analysis of approximation (driven by active parameters) and routing overhead (scaling logarithmically with expert pool size).

Contribution

Constructive approximation theorem for MoE architectures

The authors provide a manifold-based, k-sparse partition-of-unity construction showing that MoE approximation error can be reduced through two routes: increasing active per-token capacity or enlarging the expert pool. The theorem formalizes how these two mechanisms trade off under smoothness and intrinsic-dimension assumptions.

Contribution

Neural scaling laws for MoE measured against active parameters

The authors derive explicit power-law scaling exponents for model size, dataset size, and compute-optimal allocation in MoE Transformers. These laws recover dense-network exponents but are indexed by active (not total) parameters and include a logarithmic routing overhead term, providing principled guidance for MoE design.