CompeteSMoE - Statistically Guaranteed Mixture of Experts Training via Competition

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Mixture of ExpertsLarge Language Models

Sparse mixture of experts (SMoE) offers an appealing solution to scale up the model complexity beyond the mean of increasing the network's depth or width. However, we argue that effective SMoE training remains challenging because of the suboptimal routing process where experts that perform computation do not directly contribute to the routing process. In this work, we propose competition, a novel mechanism to route tokens to experts with the highest neural response. Theoretically, we show that the competition mechanism enjoys a better sample efficiency than the traditional softmax routing. Furthermore, we develop CompeteSMoE, a simple yet effective algorithm to train large language models by deploying a router to learn the competition policy, thus enjoying strong performances at a low training overhead. Our extensive empirical evaluations on both the visual instruction tuning and language pre-training tasks demonstrate the efficacy, robustness, and scalability of CompeteSMoE compared to state-of-the-art SMoE strategies. We will publish the implementation upon acceptance.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes CompeteSMoE, which introduces a competition mechanism for routing tokens to experts based on neural response rather than traditional softmax gating. It resides in the Dynamic and Adaptive Routing Strategies leaf, which contains six papers including the original work. This leaf sits within the broader Routing Mechanism Design and Optimization branch, indicating a moderately populated research direction focused on learned routing policies. The taxonomy shows this is an active area with multiple concurrent approaches exploring adaptive token assignment strategies.

The taxonomy reveals neighboring leaves addressing Alternative Routing Paradigms (expert choice, soft assignment) and Routing Optimization and Efficiency (load balancing, computational efficiency). CompeteSMoE diverges from expert-choice methods like those in the alternative paradigms leaf by maintaining token-initiated routing while introducing competitive dynamics. The sibling papers in the same leaf include AdaMoE, HyperMoE, and Expert Race, which similarly explore adaptive mechanisms but through different lenses—momentum-based updates, hypernetwork-driven routing, and competitive expert selection respectively. The scope note clarifies this leaf excludes fixed routing, positioning CompeteSMoE firmly in the learned-dynamic category.

Among twenty-one candidates examined across three contributions, the competition mechanism itself shows no clear refutation (ten candidates examined, zero refutable). The theoretical sample efficiency claim encountered one refutable candidate among ten examined, suggesting some overlap with prior theoretical work on routing efficiency. The CompeteSMoE algorithm examined only one candidate with no refutation. The limited search scope—top-K semantic matches plus citation expansion—means these statistics reflect a focused rather than exhaustive literature review. The core routing mechanism appears more novel than the theoretical guarantees within this examined set.

Based on the examined candidates, the work appears to occupy a distinct position within dynamic routing strategies, though the theoretical contribution shows some overlap with existing efficiency analyses. The taxonomy structure confirms this sits in an active research direction with multiple competing approaches. The analysis covers top-twenty-one semantic matches and does not claim exhaustive coverage of all MoE routing literature or adjacent fields like neural architecture search.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: routing tokens to experts in sparse mixture of experts models. The field has organized itself around several major branches that reflect both algorithmic and practical concerns. Routing Mechanism Design and Optimization explores how to assign tokens to experts, encompassing static top-k schemes like Sparsely Gated MoE[11] and dynamic strategies that adapt routing decisions based on input characteristics or training state, as seen in works such as AdaMoE[17] and HyperMoE[42]. Training Dynamics and Stability addresses challenges like load imbalance and representation collapse, with methods like StableMoE[18] and MomentumSMoE[39] proposing auxiliary losses and momentum-based techniques. System Infrastructure and Deployment focuses on efficient implementation at scale, exemplified by Tutel[6] and FasterMoE[4], while Domain-Specific Applications and Architectures adapt MoE principles to vision, multimodal, and specialized tasks. Theoretical Foundations and Analysis provides formal understanding of generalization and routing behavior, grounding empirical advances in principled frameworks. Within Routing Mechanism Design, dynamic and adaptive strategies have attracted considerable attention as researchers seek to move beyond fixed top-k selection. CompeteSMoE[0] sits squarely in this active subfield, proposing a competition-based mechanism that adjusts expert selection dynamically during training. This approach contrasts with simpler adaptive methods like MaskMoE[32], which uses learned masks, and Expert Race[13], which frames routing as a competitive process among experts. Meanwhile, Efficient Routing[10] and Omni Router[5] explore complementary angles on reducing computational overhead while maintaining routing quality. The central tension across these works involves balancing adaptivity—allowing the model to refine its routing strategy—against stability and computational cost, with CompeteSMoE[0] emphasizing competitive dynamics as a way to encourage specialization without heavy auxiliary constraints.

Claimed Contributions

Competition mechanism for routing tokens to experts

10 retrieved papers

The authors introduce a competition-based routing strategy where all experts compute outputs and tokens are routed to experts with the highest neural responses, rather than using a separate router. This mechanism involves experts directly in the routing process, addressing limitations of traditional softmax routing.

10 retrieved papers

Theoretical guarantee of better sample efficiency

Can Refute

10 retrieved papers

The authors provide a rigorous convergence analysis demonstrating that the competition mechanism achieves parametric convergence rates for expert estimation, requiring fewer samples than softmax routing to approximate experts with a given error.

10 retrieved papers

Can Refute

CompeteSMoE algorithm for large-scale models

1 retrieved paper

The authors develop a practical algorithm that implements the competition mechanism in large-scale models through scheduled router training. The router learns to approximate the competition policy via distillation loss while maintaining low computational overhead through careful scheduling of competition activation across layers.

1 retrieved paper

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[10] Efficient Routing in Sparse Mixture-of-Experts PDF

Masoumeh Zareapoor, Pourya Shamsolmoali, Fateme Vesaghati (2024)

[13] Expert Race: A Flexible Routing Strategy for Scaling Diffusion Transformer with Mixture of Experts PDF

Yuan, Yike, Wang Zi-yu, Yike Yuan, Huang Zihao, Ziyu Wang, Zhu Defa, Zihao Huang, Zhou Xun, Defa Zhu, Yu, Jingyi, Xun Zhou, Jingyi Yu, Qiyang Min (2025)

[17] AdaMoE: Token-Adaptive Routing with Null Experts for Mixture-of-Experts Language Models PDF

Zihao Zeng, Yibo Miao, Hongcheng Gao, Hao Zhang, Zhijie Deng (2024)

[32] Maskmoe: Boosting token-level learning via routing mask in mixture-of-experts PDF

Su, Zhenpeng, Lin Zijia, Zhenpeng Su, Bai Xue, Zijia Lin, Wu Xing, Xue Bai, Xiong Yizhe, Xing Wu, Yizhe Xiong, Ma Guangyuan, Haoran Lian, Chen Hui, Guangyuan Ma, Ding, Guiguang, Hui Chen, Zhou Wei, Guiguang Ding, Hu Song-Lin, Wei Zhou, Songlin Hu (2024)

[42] Hypermoe: Towards better mixture of experts via transferring among experts PDF

Fu Jie, He, Zhaofeng, Qiu, Zihan, Wang Zili, Wu Huijia, Zhao Hao (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Competition mechanism for routing tokens to experts

[60] Teamlora: Boosting low-rank adaptation with expert collaboration and competition PDF

Cannot Refute

[61] CompeteSMoE - Effective Training of Sparse Mixture of Experts via Competition PDF

Cannot Refute

[62] Unchosen experts can contribute too: Unleashing moe models' power by self-contrast PDF

Cannot Refute

[63] MambaFormer: Token-Level Guided Routing Mixture-of-Experts for Accurate and Efficient Clinical Assistance PDF

Cannot Refute

[64] Transformers with competitive ensembles of independent mechanisms PDF

Cannot Refute

[65] ALTER: All-in-One Layer Pruning and Temporal Expert Routing for Efficient Diffusion Generation PDF

Cannot Refute

[66] GatePro: Parameter-Free Expert Selection Optimization for Mixture-of-Experts Models PDF

Cannot Refute

[67] ActVAR: Activating Mixtures of Weights and Tokens for Efficient Visual Autoregressive Generation PDF

Cannot Refute

[68] Mamba Goes HoME: Hierarchical Soft Mixture-of-Experts for 3D Medical Image Segmentation PDF

Cannot Refute

[69] Neural Inhibition Improves Dynamic Routing and Mixture of Experts PDF

Cannot Refute

Contribution

Theoretical guarantee of better sample efficiency

[58] Convergence Rates for Mixture-of-Experts PDF

Can Refute

[18] Stablemoe: Stable routing strategy for mixture of experts PDF

Cannot Refute

[25] Mixture-of-Experts with Expert Choice Routing PDF

Cannot Refute

[51] Is Temperature Sample Efficient for Softmax Gaussian Mixture of Experts? PDF

Cannot Refute

[52] Masks can be learned as an alternative to experts PDF

Cannot Refute

[53] Convergence rates for softmax gating mixture of experts PDF

Cannot Refute

[54] Tight clusters make specialized experts PDF

Cannot Refute

[55] Classification of the highârank syntaxa of the Central and Eastern Balkan dry grasslands with a new hierarchical expert system approach PDF

Cannot Refute

[56] Convergence Rates for Gaussian Mixtures of Experts PDF

Cannot Refute

[57] Mixture of experts: a literature survey PDF

Cannot Refute

Contribution

CompeteSMoE algorithm for large-scale models

[59] Distilled neural networks for efficient learning to rank PDF

Cannot Refute

CompeteSMoE - Statistically Guaranteed Mixture of Experts Training via Competition

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[10] Efficient Routing in Sparse Mixture-of-Experts PDF

[13] Expert Race: A Flexible Routing Strategy for Scaling Diffusion Transformer with Mixture of Experts PDF

[17] AdaMoE: Token-Adaptive Routing with Null Experts for Mixture-of-Experts Language Models PDF

[32] Maskmoe: Boosting token-level learning via routing mask in mixture-of-experts PDF

[42] Hypermoe: Towards better mixture of experts via transferring among experts PDF

Contribution Analysis

Competition mechanism for routing tokens to experts

[60] Teamlora: Boosting low-rank adaptation with expert collaboration and competition PDF

[61] CompeteSMoE - Effective Training of Sparse Mixture of Experts via Competition PDF

[62] Unchosen experts can contribute too: Unleashing moe models' power by self-contrast PDF

[63] MambaFormer: Token-Level Guided Routing Mixture-of-Experts for Accurate and Efficient Clinical Assistance PDF

[64] Transformers with competitive ensembles of independent mechanisms PDF

[65] ALTER: All-in-One Layer Pruning and Temporal Expert Routing for Efficient Diffusion Generation PDF

[66] GatePro: Parameter-Free Expert Selection Optimization for Mixture-of-Experts Models PDF

[67] ActVAR: Activating Mixtures of Weights and Tokens for Efficient Visual Autoregressive Generation PDF

[68] Mamba Goes HoME: Hierarchical Soft Mixture-of-Experts for 3D Medical Image Segmentation PDF

[69] Neural Inhibition Improves Dynamic Routing and Mixture of Experts PDF

Theoretical guarantee of better sample efficiency

[58] Convergence Rates for Mixture-of-Experts PDF

[18] Stablemoe: Stable routing strategy for mixture of experts PDF

[25] Mixture-of-Experts with Expert Choice Routing PDF

[51] Is Temperature Sample Efficient for Softmax Gaussian Mixture of Experts? PDF

[52] Masks can be learned as an alternative to experts PDF

[53] Convergence rates for softmax gating mixture of experts PDF

[54] Tight clusters make specialized experts PDF

[55] Classification of the highârank syntaxa of the Central and Eastern Balkan dry grasslands with a new hierarchical expert system approach PDF

[56] Convergence Rates for Gaussian Mixtures of Experts PDF

[57] Mixture of experts: a literature survey PDF

CompeteSMoE algorithm for large-scale models

[59] Distilled neural networks for efficient learning to rank PDF

Table of Contents

[55] Classification of the highârank syntaxa of the Central and Eastern Balkan dry grasslands with a new hierarchical expert system approach PDF