Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks

ICLR 2026 Conference SubmissionAnonymous Authors
Mixture of Expertsmemorizationreasoningscaling lawslarge language models
Abstract:

Empirical scaling laws have driven the evolution of large language models (LLMs), yet their coefficients shift whenever the model architecture or data pipeline changes. Mixture‑of‑Experts (MoE) models, now standard in state‑of‑the‑art systems, introduce a new sparsity dimension that current dense‑model frontiers overlook. We investigate how MoE sparsity influences two distinct capability regimes: memorization skills and reasoning skills. By training MoE families that vary total parameters, active parameters, and top-kk routing under fixed compute budgets, we disentangle pre-training loss from downstream accuracy. Our results reveal two principles. First, Active FLOPs: models with identical training loss but greater active compute achieve higher reasoning accuracy. Second, Total tokens per parameter (TPP): memorization tasks improve with more parameters, while reasoning tasks benefit from optimal TPP, indicating that reasoning is data-hungry. Neither reinforcement learning post-training (GRPO) nor increased test-time compute alters these trends. We therefore argue that optimal MoE sparsity must be determined jointly by active FLOPs and TPP, revising the classical picture of compute-optimal scaling. All code, data sources, and logs are released to facilitate reproducibility and future work.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes two principles for MoE sparsity selection: Active FLOPs (models with identical training loss but greater active compute achieve higher reasoning accuracy) and Total tokens per parameter (TPP, distinguishing memorization from reasoning tasks). It resides in the 'Parameter-FLOPs Trade-offs and Scaling Principles' leaf alongside two sibling papers—Inference-Optimal MoE and Parameters vs FLOPs—within the broader 'Scaling Laws and Compute-Optimal Design' branch. This leaf contains only three papers total, indicating a relatively sparse research direction focused on theoretical scaling relationships rather than empirical system benchmarks or routing mechanisms.

The taxonomy reveals that most MoE research concentrates on routing strategies (six sub-leaves), system optimization (five sub-leaves), and compression techniques (three sub-leaves), while scaling law investigations remain comparatively underexplored. The sibling papers address inference-time efficiency and broad parameter-compute frontiers, whereas neighboring branches examine routing stability, expert pruning, and training systems. The paper's focus on disentangling active compute from total parameters through controlled experiments positions it at the intersection of scaling theory and architectural design, diverging from the field's dominant emphasis on deployment optimization and routing policy refinement.

Among thirty candidates examined, none clearly refuted any of the three contributions. The Active FLOPs principle examined ten candidates with zero refutable matches; the TPP principle similarly found no overlapping prior work across ten candidates; the revised compute-optimal framework also encountered no refutations among ten examined papers. This absence of refutation reflects either genuine novelty within the limited search scope or insufficient coverage of closely related scaling law studies. The sibling papers in the same taxonomy leaf establish parameter-compute trade-offs but do not explicitly separate memorization from reasoning or quantify active FLOPs effects, suggesting the contributions address gaps within this sparse research direction.

Based on the top-thirty semantic matches and taxonomy structure, the work appears to introduce distinct principles within an underexplored corner of MoE research. The limited search scope and sparse taxonomy leaf suggest the analysis captures the most relevant prior work but cannot guarantee exhaustive coverage of all scaling law investigations. The absence of refutations across all contributions, combined with the leaf's small size, indicates the paper may be advancing a relatively novel theoretical framework, though broader literature searches could reveal additional related efforts.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Optimal sparsity selection in mixture-of-experts language models. The field organizes around several complementary perspectives on how to design, deploy, and optimize MoE architectures. Scaling Laws and Compute-Optimal Design investigates fundamental trade-offs between parameter count and computational cost, seeking principles that guide when and how much sparsity to introduce—works like Inference-Optimal MoE[2] and Parameters vs FLOPs[3] exemplify efforts to balance model capacity with inference efficiency. Routing Strategies and Expert Selection focuses on mechanisms that decide which experts process each token, ranging from learned gating functions to more sophisticated schemes like Expert Choice Routing[42] and Maximum Score Routing[16]. Model Compression and Sparsification explores techniques to reduce active parameters through pruning or dynamic expert selection, as seen in Efficient Expert Pruning[10] and Expert Pruning Skipping[8]. Architecture Design examines structural choices—expert granularity, layer configurations, and hybrid dense-sparse patterns—while System Optimization addresses practical deployment challenges such as memory management and parallelization strategies exemplified by MegaBlocks[6] and DeepSpeed-MoE[9]. Empirical Studies and Open Models, including OpenMoE[4] and OLMoE[13], provide reproducible benchmarks, and Domain-Specific Applications adapt MoE principles to specialized tasks. A central tension runs through the literature: increasing sparsity reduces computation but risks underutilizing model capacity or destabilizing training, while denser activation preserves expressiveness at higher cost. Many studies explore adaptive or learned routing to strike this balance dynamically, and recent work investigates how expert collaboration and token-level specialization interact with sparsity choices. Optimal Sparsity Reasoning[0] sits squarely within the Scaling Laws branch alongside Inference-Optimal MoE[2] and Parameters vs FLOPs[3], emphasizing principled selection of sparsity levels based on compute budgets and downstream performance. Where Inference-Optimal MoE[2] targets deployment-time efficiency and Parameters vs FLOPs[3] examines broad parameter-compute frontiers, Optimal Sparsity Reasoning[0] focuses on reasoning through the interplay of expert utilization, routing entropy, and task-specific demands to prescribe sparsity configurations. This positioning reflects a shift from purely empirical tuning toward theory-driven guidelines that inform architecture decisions before large-scale training begins.

Claimed Contributions

Active FLOPs principle for MoE reasoning performance

The authors demonstrate that downstream reasoning quality in MoE models is determined not solely by pre-training loss, but critically by the number of active FLOPs during both training and inference. Models with larger top-k consistently outperform those with smaller top-k even when pre-training loss is matched.

10 retrieved papers
Total tokens per parameter (TPP) principle distinguishing memorization from reasoning

The work establishes that memorization tasks are parameter-hungry and benefit from lower TPP (more parameters), whereas reasoning tasks exhibit a non-monotonic relationship with TPP, peaking around 20 tokens per parameter. This reveals that reasoning skills require careful balancing of data and parameters.

10 retrieved papers
Revised compute-optimal scaling framework for MoE models

The authors argue that the classical compute-optimal scaling laws must be revised for MoE architectures to jointly account for active FLOPs and tokens-per-parameter ratio. This framework shows that optimal sparsity is task-dependent: memorization favors higher sparsity while reasoning requires balancing active compute with data intensity.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Active FLOPs principle for MoE reasoning performance

The authors demonstrate that downstream reasoning quality in MoE models is determined not solely by pre-training loss, but critically by the number of active FLOPs during both training and inference. Models with larger top-k consistently outperform those with smaller top-k even when pre-training loss is matched.

Contribution

Total tokens per parameter (TPP) principle distinguishing memorization from reasoning

The work establishes that memorization tasks are parameter-hungry and benefit from lower TPP (more parameters), whereas reasoning tasks exhibit a non-monotonic relationship with TPP, peaking around 20 tokens per parameter. This reveals that reasoning skills require careful balancing of data and parameters.

Contribution

Revised compute-optimal scaling framework for MoE models

The authors argue that the classical compute-optimal scaling laws must be revised for MoE architectures to jointly account for active FLOPs and tokens-per-parameter ratio. This framework shows that optimal sparsity is task-dependent: memorization favors higher sparsity while reasoning requires balancing active compute with data intensity.