CompeteSMoE - Statistically Guaranteed Mixture of Experts Training via Competition

ICLR 2026 Conference SubmissionAnonymous Authors
Mixture of ExpertsLarge Language Models
Abstract:

Sparse mixture of experts (SMoE) offers an appealing solution to scale up the model complexity beyond the mean of increasing the network's depth or width. However, we argue that effective SMoE training remains challenging because of the suboptimal routing process where experts that perform computation do not directly contribute to the routing process. In this work, we propose competition, a novel mechanism to route tokens to experts with the highest neural response. Theoretically, we show that the competition mechanism enjoys a better sample efficiency than the traditional softmax routing. Furthermore, we develop CompeteSMoE, a simple yet effective algorithm to train large language models by deploying a router to learn the competition policy, thus enjoying strong performances at a low training overhead. Our extensive empirical evaluations on both the visual instruction tuning and language pre-training tasks demonstrate the efficacy, robustness, and scalability of CompeteSMoE compared to state-of-the-art SMoE strategies. We will publish the implementation upon acceptance.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes CompeteSMoE, which introduces a competition mechanism for routing tokens to experts based on neural response rather than traditional softmax gating. It resides in the Dynamic and Adaptive Routing Strategies leaf, which contains six papers including the original work. This leaf sits within the broader Routing Mechanism Design and Optimization branch, indicating a moderately populated research direction focused on learned routing policies. The taxonomy shows this is an active area with multiple concurrent approaches exploring adaptive token assignment strategies.

The taxonomy reveals neighboring leaves addressing Alternative Routing Paradigms (expert choice, soft assignment) and Routing Optimization and Efficiency (load balancing, computational efficiency). CompeteSMoE diverges from expert-choice methods like those in the alternative paradigms leaf by maintaining token-initiated routing while introducing competitive dynamics. The sibling papers in the same leaf include AdaMoE, HyperMoE, and Expert Race, which similarly explore adaptive mechanisms but through different lenses—momentum-based updates, hypernetwork-driven routing, and competitive expert selection respectively. The scope note clarifies this leaf excludes fixed routing, positioning CompeteSMoE firmly in the learned-dynamic category.

Among twenty-one candidates examined across three contributions, the competition mechanism itself shows no clear refutation (ten candidates examined, zero refutable). The theoretical sample efficiency claim encountered one refutable candidate among ten examined, suggesting some overlap with prior theoretical work on routing efficiency. The CompeteSMoE algorithm examined only one candidate with no refutation. The limited search scope—top-K semantic matches plus citation expansion—means these statistics reflect a focused rather than exhaustive literature review. The core routing mechanism appears more novel than the theoretical guarantees within this examined set.

Based on the examined candidates, the work appears to occupy a distinct position within dynamic routing strategies, though the theoretical contribution shows some overlap with existing efficiency analyses. The taxonomy structure confirms this sits in an active research direction with multiple competing approaches. The analysis covers top-twenty-one semantic matches and does not claim exhaustive coverage of all MoE routing literature or adjacent fields like neural architecture search.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
21
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: routing tokens to experts in sparse mixture of experts models. The field has organized itself around several major branches that reflect both algorithmic and practical concerns. Routing Mechanism Design and Optimization explores how to assign tokens to experts, encompassing static top-k schemes like Sparsely Gated MoE[11] and dynamic strategies that adapt routing decisions based on input characteristics or training state, as seen in works such as AdaMoE[17] and HyperMoE[42]. Training Dynamics and Stability addresses challenges like load imbalance and representation collapse, with methods like StableMoE[18] and MomentumSMoE[39] proposing auxiliary losses and momentum-based techniques. System Infrastructure and Deployment focuses on efficient implementation at scale, exemplified by Tutel[6] and FasterMoE[4], while Domain-Specific Applications and Architectures adapt MoE principles to vision, multimodal, and specialized tasks. Theoretical Foundations and Analysis provides formal understanding of generalization and routing behavior, grounding empirical advances in principled frameworks. Within Routing Mechanism Design, dynamic and adaptive strategies have attracted considerable attention as researchers seek to move beyond fixed top-k selection. CompeteSMoE[0] sits squarely in this active subfield, proposing a competition-based mechanism that adjusts expert selection dynamically during training. This approach contrasts with simpler adaptive methods like MaskMoE[32], which uses learned masks, and Expert Race[13], which frames routing as a competitive process among experts. Meanwhile, Efficient Routing[10] and Omni Router[5] explore complementary angles on reducing computational overhead while maintaining routing quality. The central tension across these works involves balancing adaptivity—allowing the model to refine its routing strategy—against stability and computational cost, with CompeteSMoE[0] emphasizing competitive dynamics as a way to encourage specialization without heavy auxiliary constraints.

Claimed Contributions

Competition mechanism for routing tokens to experts

The authors introduce a competition-based routing strategy where all experts compute outputs and tokens are routed to experts with the highest neural responses, rather than using a separate router. This mechanism involves experts directly in the routing process, addressing limitations of traditional softmax routing.

10 retrieved papers
Theoretical guarantee of better sample efficiency

The authors provide a rigorous convergence analysis demonstrating that the competition mechanism achieves parametric convergence rates for expert estimation, requiring fewer samples than softmax routing to approximate experts with a given error.

10 retrieved papers
Can Refute
CompeteSMoE algorithm for large-scale models

The authors develop a practical algorithm that implements the competition mechanism in large-scale models through scheduled router training. The router learns to approximate the competition policy via distillation loss while maintaining low computational overhead through careful scheduling of competition activation across layers.

1 retrieved paper

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Competition mechanism for routing tokens to experts

The authors introduce a competition-based routing strategy where all experts compute outputs and tokens are routed to experts with the highest neural responses, rather than using a separate router. This mechanism involves experts directly in the routing process, addressing limitations of traditional softmax routing.

Contribution

Theoretical guarantee of better sample efficiency

The authors provide a rigorous convergence analysis demonstrating that the competition mechanism achieves parametric convergence rates for expert estimation, requiring fewer samples than softmax routing to approximate experts with a given error.

Contribution

CompeteSMoE algorithm for large-scale models

The authors develop a practical algorithm that implements the competition mechanism in large-scale models through scheduled router training. The router learns to approximate the competition policy via distillation loss while maintaining low computational overhead through careful scheduling of competition activation across layers.