Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

Mixture of Expertsmemorizationreasoningscaling lawslarge language models

Empirical scaling laws have driven the evolution of large language models (LLMs), yet their coefficients shift whenever the model architecture or data pipeline changes. Mixture‑of‑Experts (MoE) models, now standard in state‑of‑the‑art systems, introduce a new sparsity dimension that current dense‑model frontiers overlook. We investigate how MoE sparsity influences two distinct capability regimes: memorization skills and reasoning skills. By training MoE families that vary total parameters, active parameters, and top- $k$ routing under fixed compute budgets, we disentangle pre-training loss from downstream accuracy. Our results reveal two principles. First, Active FLOPs: models with identical training loss but greater active compute achieve higher reasoning accuracy. Second, Total tokens per parameter (TPP): memorization tasks improve with more parameters, while reasoning tasks benefit from optimal TPP, indicating that reasoning is data-hungry. Neither reinforcement learning post-training (GRPO) nor increased test-time compute alters these trends. We therefore argue that optimal MoE sparsity must be determined jointly by active FLOPs and TPP, revising the classical picture of compute-optimal scaling. All code, data sources, and logs are released to facilitate reproducibility and future work.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes two principles for MoE sparsity selection: Active FLOPs (models with identical training loss but greater active compute achieve higher reasoning accuracy) and Total tokens per parameter (TPP, distinguishing memorization from reasoning tasks). It resides in the 'Parameter-FLOPs Trade-offs and Scaling Principles' leaf alongside two sibling papers—Inference-Optimal MoE and Parameters vs FLOPs—within the broader 'Scaling Laws and Compute-Optimal Design' branch. This leaf contains only three papers total, indicating a relatively sparse research direction focused on theoretical scaling relationships rather than empirical system benchmarks or routing mechanisms.

The taxonomy reveals that most MoE research concentrates on routing strategies (six sub-leaves), system optimization (five sub-leaves), and compression techniques (three sub-leaves), while scaling law investigations remain comparatively underexplored. The sibling papers address inference-time efficiency and broad parameter-compute frontiers, whereas neighboring branches examine routing stability, expert pruning, and training systems. The paper's focus on disentangling active compute from total parameters through controlled experiments positions it at the intersection of scaling theory and architectural design, diverging from the field's dominant emphasis on deployment optimization and routing policy refinement.

Among thirty candidates examined, none clearly refuted any of the three contributions. The Active FLOPs principle examined ten candidates with zero refutable matches; the TPP principle similarly found no overlapping prior work across ten candidates; the revised compute-optimal framework also encountered no refutations among ten examined papers. This absence of refutation reflects either genuine novelty within the limited search scope or insufficient coverage of closely related scaling law studies. The sibling papers in the same taxonomy leaf establish parameter-compute trade-offs but do not explicitly separate memorization from reasoning or quantify active FLOPs effects, suggesting the contributions address gaps within this sparse research direction.

Based on the top-thirty semantic matches and taxonomy structure, the work appears to introduce distinct principles within an underexplored corner of MoE research. The limited search scope and sparse taxonomy leaf suggest the analysis captures the most relevant prior work but cannot guarantee exhaustive coverage of all scaling law investigations. The absence of refutations across all contributions, combined with the leaf's small size, indicates the paper may be advancing a relatively novel theoretical framework, though broader literature searches could reveal additional related efforts.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Optimal sparsity selection in mixture-of-experts language models. The field organizes around several complementary perspectives on how to design, deploy, and optimize MoE architectures. Scaling Laws and Compute-Optimal Design investigates fundamental trade-offs between parameter count and computational cost, seeking principles that guide when and how much sparsity to introduce—works like Inference-Optimal MoE[2] and Parameters vs FLOPs[3] exemplify efforts to balance model capacity with inference efficiency. Routing Strategies and Expert Selection focuses on mechanisms that decide which experts process each token, ranging from learned gating functions to more sophisticated schemes like Expert Choice Routing[42] and Maximum Score Routing[16]. Model Compression and Sparsification explores techniques to reduce active parameters through pruning or dynamic expert selection, as seen in Efficient Expert Pruning[10] and Expert Pruning Skipping[8]. Architecture Design examines structural choices—expert granularity, layer configurations, and hybrid dense-sparse patterns—while System Optimization addresses practical deployment challenges such as memory management and parallelization strategies exemplified by MegaBlocks[6] and DeepSpeed-MoE[9]. Empirical Studies and Open Models, including OpenMoE[4] and OLMoE[13], provide reproducible benchmarks, and Domain-Specific Applications adapt MoE principles to specialized tasks. A central tension runs through the literature: increasing sparsity reduces computation but risks underutilizing model capacity or destabilizing training, while denser activation preserves expressiveness at higher cost. Many studies explore adaptive or learned routing to strike this balance dynamically, and recent work investigates how expert collaboration and token-level specialization interact with sparsity choices. Optimal Sparsity Reasoning[0] sits squarely within the Scaling Laws branch alongside Inference-Optimal MoE[2] and Parameters vs FLOPs[3], emphasizing principled selection of sparsity levels based on compute budgets and downstream performance. Where Inference-Optimal MoE[2] targets deployment-time efficiency and Parameters vs FLOPs[3] examines broad parameter-compute frontiers, Optimal Sparsity Reasoning[0] focuses on reasoning through the interplay of expert utilization, routing entropy, and task-specific demands to prescribe sparsity configurations. This positioning reflects a shift from purely empirical tuning toward theory-driven guidelines that inform architecture decisions before large-scale training begins.

Claimed Contributions

Active FLOPs principle for MoE reasoning performance

10 retrieved papers

The authors demonstrate that downstream reasoning quality in MoE models is determined not solely by pre-training loss, but critically by the number of active FLOPs during both training and inference. Models with larger top-k consistently outperform those with smaller top-k even when pre-training loss is matched.

10 retrieved papers

Total tokens per parameter (TPP) principle distinguishing memorization from reasoning

10 retrieved papers

The work establishes that memorization tasks are parameter-hungry and benefit from lower TPP (more parameters), whereas reasoning tasks exhibit a non-monotonic relationship with TPP, peaking around 20 tokens per parameter. This reveals that reasoning skills require careful balancing of data and parameters.

10 retrieved papers

Revised compute-optimal scaling framework for MoE models

10 retrieved papers

The authors argue that the classical compute-optimal scaling laws must be revised for MoE architectures to jointly account for active FLOPs and tokens-per-parameter ratio. This framework shows that optimal sparsity is task-dependent: memorization favors higher sparsity while reasoning requires balancing active compute with data intensity.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[2] Toward inference-optimal mixture-of-expert large language models PDF

Yun, Longfei, Zhuang, Yonghao, Longfei Yun, Fu Yao, Yonghao Zhuang, Xing, Eric P., Yao Fu, Zhang Hao, Eric P. Xing, Hao Zhang (2024)

[3] Parameters vs flops: Scaling laws for optimal sparsity for mixture-of-experts language models PDF

Abnar, Samira, Shah, Harshay, Samira Abnar, Busbridge, Dan, Harshay Shah, Dan Busbridge, Susskind, Josh, Alaaeldin Mohamed Elnouby Ali, Thilak, Vimal, Josh Susskind, Vimal Thilak (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Active FLOPs principle for MoE reasoning performance

[3] Parameters vs flops: Scaling laws for optimal sparsity for mixture-of-experts language models PDF

Cannot Refute

[4] Openmoe: An early effort on open mixture-of-experts language models PDF

Cannot Refute

[64] Efficient scaling of large language models with mixture of experts and 3D analog in-memory computing PDF

Cannot Refute

[71] Infrastructure Economics of Sparse Mixture-of-Experts in Cloud-Native NLP: Benchmarking Cost, Accuracy, and Performance PDF

Cannot Refute

[72] Mixture of Lookup Experts PDF

Cannot Refute

[73] Every FLOP Counts: Scaling a 300B Mixture-of-Experts LING LLM without Premium GPUs PDF

Cannot Refute

[74] Scaling laws across model architectures: A comparative analysis of dense and MoE models in large language models PDF

Cannot Refute

[75] BlackMamba: Mixture of Experts for State-Space Models PDF

Cannot Refute

[76] Efficient Diffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning PDF

Cannot Refute

[77] CLIP-UP: A Simple and Efficient Mixture-of-Experts CLIP Training Recipe with Sparse Upcycling PDF

Cannot Refute

Contribution

Total tokens per parameter (TPP) principle distinguishing memorization from reasoning

[51] Surprising effectiveness of pretraining ternary language model at scale PDF

Cannot Refute

[52] Cola: Compute-efficient pre-training of llms via low-rank activation PDF

Cannot Refute

[53] Inference Optimal VLMs Need Fewer Visual Tokens and More Parameters PDF

Cannot Refute

[54] A Theory of Inference Compute Scaling: Reasoning through Directed Stochastic Skill Search PDF

Cannot Refute

[55] Resolving discrepancies in compute-optimal scaling of language models PDF

Cannot Refute

[56] Evaluation of pre-training large language models on leadership-class supercomputers: J. Yin et al. PDF

Cannot Refute

[57] Learning associative reasoning towards systematicity using modular networks PDF

Cannot Refute

[58] Scaling Laws and Efficient Inference for Ternary Language Models PDF

Cannot Refute

[59] Projected Compression PDF

Cannot Refute

[60] Revisiting the Scaling Properties of Downstream Metrics in Large Language Model Training PDF

Cannot Refute

Contribution

Revised compute-optimal scaling framework for MoE models

[61] Glam: Efficient scaling of language models with mixture-of-experts PDF

Cannot Refute

[62] Towards greater leverage: Scaling laws for efficient mixture-of-experts language models PDF

Cannot Refute

[63] DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model PDF

Cannot Refute

[64] Efficient scaling of large language models with mixture of experts and 3D analog in-memory computing PDF

Cannot Refute

[65] Scaling diffusion transformers to 16 billion parameters PDF

Cannot Refute

[66] Scaling laws for native multimodal models PDF

Cannot Refute

[67] Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity PDF

Cannot Refute

[68] Scaling physics-informed hard constraints with mixture-of-experts PDF

Cannot Refute

[69] Mixture of a million experts PDF

Cannot Refute

[70] Every Activation Boosted: Scaling General Reasoner to 1 Trillion Open Language Foundation PDF

Cannot Refute

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[2] Toward inference-optimal mixture-of-expert large language models PDF

[3] Parameters vs flops: Scaling laws for optimal sparsity for mixture-of-experts language models PDF

Contribution Analysis

Active FLOPs principle for MoE reasoning performance

[3] Parameters vs flops: Scaling laws for optimal sparsity for mixture-of-experts language models PDF

[4] Openmoe: An early effort on open mixture-of-experts language models PDF

[64] Efficient scaling of large language models with mixture of experts and 3D analog in-memory computing PDF

[71] Infrastructure Economics of Sparse Mixture-of-Experts in Cloud-Native NLP: Benchmarking Cost, Accuracy, and Performance PDF

[72] Mixture of Lookup Experts PDF

[73] Every FLOP Counts: Scaling a 300B Mixture-of-Experts LING LLM without Premium GPUs PDF

[74] Scaling laws across model architectures: A comparative analysis of dense and MoE models in large language models PDF

[75] BlackMamba: Mixture of Experts for State-Space Models PDF

[76] Efficient Diffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning PDF

[77] CLIP-UP: A Simple and Efficient Mixture-of-Experts CLIP Training Recipe with Sparse Upcycling PDF

Total tokens per parameter (TPP) principle distinguishing memorization from reasoning

[51] Surprising effectiveness of pretraining ternary language model at scale PDF

[52] Cola: Compute-efficient pre-training of llms via low-rank activation PDF

[53] Inference Optimal VLMs Need Fewer Visual Tokens and More Parameters PDF

[54] A Theory of Inference Compute Scaling: Reasoning through Directed Stochastic Skill Search PDF

[55] Resolving discrepancies in compute-optimal scaling of language models PDF

[56] Evaluation of pre-training large language models on leadership-class supercomputers: J. Yin et al. PDF

[57] Learning associative reasoning towards systematicity using modular networks PDF

[58] Scaling Laws and Efficient Inference for Ternary Language Models PDF

[59] Projected Compression PDF

[60] Revisiting the Scaling Properties of Downstream Metrics in Large Language Model Training PDF

Revised compute-optimal scaling framework for MoE models

[61] Glam: Efficient scaling of language models with mixture-of-experts PDF

[62] Towards greater leverage: Scaling laws for efficient mixture-of-experts language models PDF

[63] DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model PDF

[64] Efficient scaling of large language models with mixture of experts and 3D analog in-memory computing PDF

[65] Scaling diffusion transformers to 16 billion parameters PDF

[66] Scaling laws for native multimodal models PDF

[67] Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity PDF

[68] Scaling physics-informed hard constraints with mixture-of-experts PDF

[69] Mixture of a million experts PDF

[70] Every Activation Boosted: Scaling General Reasoner to 1 Trillion Open Language Foundation PDF

Table of Contents