BadMoE: Backdooring Mixture-of-Experts LLMs via Optimizing Routing Triggers and Infecting Dormant Experts

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.5 Download Report PDF

Mixture-of-Experts LLMsbackdoor attackRouting Optimization

Mixture-of-Experts (MoE) architectures are rapidly becoming the standard for building scalable, efficient large language models (LLMs). Their open availability, however, exposes them to supply-chain backdoor attacks, where an adversary can modify a checkpoint and redistribute a poisoned version. MoE’s intrinsic sparsity further amplifies this risk, as small changes in activated experts may disproportionately influence the model’s output. In this work, we propose BadMoE, a novel backdoor attack that exploits the overlooked structural vulnerabilities introduced by expert sparsity and routing. We first provide theoretical intuition that the MoE output can be governed by dominating experts. Guided by this insight, BadMoE poisons underutilized (``dormant'') experts and utilizes routing-aware triggers to activate them, enabling stealthy and effective manipulation. Specifically, BadMoE involves three steps: 1) identifying dormant experts unrelated to the target task, 2) optimizing a routing-aware trigger toward these experts, and 3) promoting them to dominating roles through training data. Extensive experiments on three MoE LLMs across multiple backdoor tasks show that BadMoE, using only two injected experts, can reliably control outputs, outperform existing attacks, and evade current defenses. By leveraging architectural sparsity and dynamic usage profiling, our approach uncovers backdoor vulnerabilities in MoE LLMs that are overlooked by traditional attacks, highlighting potential security risks in emerging sparse architectures.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces BadMoE, a backdoor attack exploiting routing mechanisms and dormant experts in Mixture-of-Experts large language models. Within the taxonomy, it resides in the 'Routing-Based Backdoor Injection' leaf alongside one sibling paper. This leaf is part of a broader 'Backdoor Attack Methods on MoE LLMs' branch containing three attack-focused categories. The taxonomy encompasses thirteen papers across ten leaf nodes, suggesting a relatively sparse but growing research area where routing-based attacks represent a focused subfield rather than a crowded domain.

The taxonomy reveals neighboring work in 'Patch-Based MoE Backdoor Attacks' targeting image classification and 'Safety Alignment Compromise via Expert Poisoning' focused on bypassing guardrails. These sibling categories share the common theme of exploiting MoE architectural properties but differ in attack vectors: patch-based methods target visual modalities, safety-focused attacks compromise alignment mechanisms, while routing-based approaches manipulate expert selection dynamics. The 'Vulnerability Analysis' branch contains related work on gate-guided exploitation and expert pathway recovery, providing complementary perspectives on how routing mechanisms create attack surfaces without implementing concrete backdoor methods.

Among the three contributions analyzed, the core BadMoE attack method examined seven candidates with one appearing to provide overlapping prior work, suggesting some precedent exists within the limited search scope. The theoretical analysis of dominating experts examined ten candidates with none clearly refuting the contribution, indicating this framing may be relatively novel among the papers reviewed. The routing-aware trigger optimization examined only two candidates with no refutations found. These statistics reflect a constrained literature search of nineteen total candidates, not an exhaustive field survey, meaning additional relevant work may exist beyond this sample.

Based on the limited search scope of nineteen semantically similar papers, the work appears to occupy a moderately explored niche within MoE security research. The routing-based attack vector has at least one closely related predecessor among examined candidates, while the theoretical framing and trigger optimization techniques show less direct overlap in this sample. The sparse taxonomy structure and small sibling set suggest the specific intersection of routing manipulation and dormant expert exploitation remains an emerging rather than saturated research direction, though definitive novelty claims require broader literature coverage.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Backdoor attacks on Mixture-of-Experts large language models. The field structure reflects a maturing concern with security vulnerabilities unique to MoE architectures, where conditional computation and expert routing introduce novel attack surfaces. The taxonomy organizes work into four main branches: attack methods that exploit MoE-specific mechanisms, defense and robustness techniques tailored to these models, vulnerability analyses that map out potential weaknesses in routing and expert selection, and broader surveys that consider deployment contexts such as edge intelligence. Attack methods tend to focus on manipulating the routing mechanism or embedding triggers within specific experts, as seen in works like Patch MoE Backdoors[4] and GateBreaker[6]. Defense mechanisms, exemplified by Graph MoE Defense[5], aim to detect anomalous routing patterns or harden expert pathways. Vulnerability analyses such as Dynamic Expert Routing[3] and Steering MoE[2] explore how adversaries might steer model behavior by exploiting the gating network, while comprehensive surveys like Mobile Edge Intelligence[1] situate these threats within real-world deployment scenarios. Particularly active lines of work contrast direct expert poisoning with routing-layer manipulation. Some studies investigate how attackers can inject malicious behavior into individual experts without altering the gating logic, while others, including BadMoE[0], concentrate on routing-based backdoor injection that subtly biases expert selection toward compromised pathways. BadMoE[0] sits within the routing-focused cluster, closely aligned with Dynamic Expert Routing[3], which similarly examines how dynamic gating decisions can be exploited. Compared to approaches that target expert weights directly, BadMoE[0] emphasizes the strategic manipulation of routing probabilities to activate backdoors conditionally. This distinction highlights an ongoing tension in the field: whether defenses should prioritize monitoring expert outputs or scrutinizing the gating mechanism itself, a question that remains central as MoE models scale and diversify.

Claimed Contributions

BadMoE backdoor attack method for MoE LLMs

Can Refute

7 retrieved papers

The authors introduce BadMoE, a three-stage backdoor attack specifically designed for Mixture-of-Experts LLMs. The method identifies dormant (underutilized) experts, optimizes routing-aware triggers to activate them, and fine-tunes these experts to dominate model outputs when triggers are present.

7 retrieved papers

Can Refute

Theoretical analysis of dominating experts in MoE

10 retrieved papers

The authors present a theoretical framework (Definition 5.1 and Theorem 5.1) proving that individual experts in MoE architectures can be perturbed to dominate the overall model output, providing the conceptual foundation for their attack strategy.

10 retrieved papers

Routing-aware trigger optimization with perplexity constraint

2 retrieved papers

The authors develop a gradient-based optimization method that generates triggers specifically designed to activate target dormant experts while maintaining sentence fluency through a perplexity-based constraint, enabling stealthy backdoor activation.

2 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[3] Who Speaks for the Trigger? Dynamic Expert Routing in Backdoored Mixture-of-Experts Transformers PDF

Zhao Xin, Chen Xiaojun, Xin Zhao, Liu Bingshan, Xiaojun Chen, Gao, Haoyu, Bingshan Liu, Zhao Zhendong, Haoyu Gao, Chen Yi-Long, Zhendong Zhao, Yilong Chen (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

BadMoE backdoor attack method for MoE LLMs

[3] Who Speaks for the Trigger? Dynamic Expert Routing in Backdoored Mixture-of-Experts Transformers PDF

Can Refute

[6] GateBreaker: Gate-Guided Attacks on Mixture-of-Expert LLMs PDF

Cannot Refute

[12] Expert Pathway Recovery through Structural Memorization in Fine-Tuned Mixture-of-Experts Large Language Models PDF

Cannot Refute

[13] M O E VIL : Poisoning Experts to Compromise the Safety of Mixture-of-Experts LLMs PDF

Cannot Refute

[14] BadPatches: Routing-aware Backdoor Attacks on Vision Mixture of Experts PDF

Cannot Refute

[15] BadPatches: Backdoor Attacks Against Patch-based Mixture of Experts Architectures PDF

Cannot Refute

[16] The Blood-Cell Trio Meets the Token-Trio PDF

Cannot Refute

Contribution

Theoretical analysis of dominating experts in MoE

[17] Moe jetpack: From dense checkpoints to adaptive mixture of experts for vision tasks PDF

Cannot Refute

[18] Hmoe: Heterogeneous mixture of experts for language modeling PDF

Cannot Refute

[19] On the adversarial robustness of mixture of experts PDF

Cannot Refute

[20] Sparse-MoE-SAM: A Lightweight Framework Integrating MoE and SAM with a Sparse Attention Mechanism for Plant Disease Segmentation in Resource â¦ PDF

Cannot Refute

[21] Rethinking Gating Mechanism in Sparse MoE: Handling Arbitrary Modality Inputs with Confidence-Guided Gate PDF

Cannot Refute

[22] Merge, then compress: Demystify efficient smoe with hints from its routing policy PDF

Cannot Refute

[23] CAG-MoE: Multimodal Emotion Recognition with Cross-Attention Gated Mixture of Experts PDF

Cannot Refute

[24] The Illusion of Specialization: Unveiling the Domain-Invariant" Standing Committee" in Mixture-of-Experts Models PDF

Cannot Refute

[25] Towards Modular and Adaptive AI: A Survey on Mixture of Experts Architectures PDF

Cannot Refute

[26] A General Theory for Softmax Gating Multinomial Logistic Mixture of Experts PDF

Cannot Refute

Contribution

Routing-aware trigger optimization with perplexity constraint

[16] The Blood-Cell Trio Meets the Token-Trio PDF

Cannot Refute

[27] JailbreakloRA: Your downloaded loRA from sharing platforms might be unsafe PDF

Cannot Refute

BadMoE: Backdooring Mixture-of-Experts LLMs via Optimizing Routing Triggers and Infecting Dormant Experts

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[3] Who Speaks for the Trigger? Dynamic Expert Routing in Backdoored Mixture-of-Experts Transformers PDF

Contribution Analysis

BadMoE backdoor attack method for MoE LLMs

[3] Who Speaks for the Trigger? Dynamic Expert Routing in Backdoored Mixture-of-Experts Transformers PDF

[6] GateBreaker: Gate-Guided Attacks on Mixture-of-Expert LLMs PDF

[12] Expert Pathway Recovery through Structural Memorization in Fine-Tuned Mixture-of-Experts Large Language Models PDF

[13] M O E VIL : Poisoning Experts to Compromise the Safety of Mixture-of-Experts LLMs PDF

[14] BadPatches: Routing-aware Backdoor Attacks on Vision Mixture of Experts PDF

[15] BadPatches: Backdoor Attacks Against Patch-based Mixture of Experts Architectures PDF

[16] The Blood-Cell Trio Meets the Token-Trio PDF

Theoretical analysis of dominating experts in MoE

[17] Moe jetpack: From dense checkpoints to adaptive mixture of experts for vision tasks PDF

[18] Hmoe: Heterogeneous mixture of experts for language modeling PDF

[19] On the adversarial robustness of mixture of experts PDF

[20] Sparse-MoE-SAM: A Lightweight Framework Integrating MoE and SAM with a Sparse Attention Mechanism for Plant Disease Segmentation in Resource â¦ PDF

[21] Rethinking Gating Mechanism in Sparse MoE: Handling Arbitrary Modality Inputs with Confidence-Guided Gate PDF

[22] Merge, then compress: Demystify efficient smoe with hints from its routing policy PDF

[23] CAG-MoE: Multimodal Emotion Recognition with Cross-Attention Gated Mixture of Experts PDF

[24] The Illusion of Specialization: Unveiling the Domain-Invariant" Standing Committee" in Mixture-of-Experts Models PDF

[25] Towards Modular and Adaptive AI: A Survey on Mixture of Experts Architectures PDF

[26] A General Theory for Softmax Gating Multinomial Logistic Mixture of Experts PDF

Routing-aware trigger optimization with perplexity constraint

[16] The Blood-Cell Trio Meets the Token-Trio PDF

[27] JailbreakloRA: Your downloaded loRA from sharing platforms might be unsafe PDF

Table of Contents

[20] Sparse-MoE-SAM: A Lightweight Framework Integrating MoE and SAM with a Sparse Attention Mechanism for Plant Disease Segmentation in Resource â¦ PDF