BadMoE: Backdooring Mixture-of-Experts LLMs via Optimizing Routing Triggers and Infecting Dormant Experts

ICLR 2026 Conference SubmissionAnonymous Authors
Mixture-of-Experts LLMsbackdoor attackRouting Optimization
Abstract:

Mixture-of-Experts (MoE) architectures are rapidly becoming the standard for building scalable, efficient large language models (LLMs). Their open availability, however, exposes them to supply-chain backdoor attacks, where an adversary can modify a checkpoint and redistribute a poisoned version. MoE’s intrinsic sparsity further amplifies this risk, as small changes in activated experts may disproportionately influence the model’s output. In this work, we propose BadMoE, a novel backdoor attack that exploits the overlooked structural vulnerabilities introduced by expert sparsity and routing. We first provide theoretical intuition that the MoE output can be governed by dominating experts. Guided by this insight, BadMoE poisons underutilized (``dormant'') experts and utilizes routing-aware triggers to activate them, enabling stealthy and effective manipulation. Specifically, BadMoE involves three steps: 1) identifying dormant experts unrelated to the target task, 2) optimizing a routing-aware trigger toward these experts, and 3) promoting them to dominating roles through training data. Extensive experiments on three MoE LLMs across multiple backdoor tasks show that BadMoE, using only two injected experts, can reliably control outputs, outperform existing attacks, and evade current defenses. By leveraging architectural sparsity and dynamic usage profiling, our approach uncovers backdoor vulnerabilities in MoE LLMs that are overlooked by traditional attacks, highlighting potential security risks in emerging sparse architectures.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces BadMoE, a backdoor attack exploiting routing mechanisms and dormant experts in Mixture-of-Experts large language models. Within the taxonomy, it resides in the 'Routing-Based Backdoor Injection' leaf alongside one sibling paper. This leaf is part of a broader 'Backdoor Attack Methods on MoE LLMs' branch containing three attack-focused categories. The taxonomy encompasses thirteen papers across ten leaf nodes, suggesting a relatively sparse but growing research area where routing-based attacks represent a focused subfield rather than a crowded domain.

The taxonomy reveals neighboring work in 'Patch-Based MoE Backdoor Attacks' targeting image classification and 'Safety Alignment Compromise via Expert Poisoning' focused on bypassing guardrails. These sibling categories share the common theme of exploiting MoE architectural properties but differ in attack vectors: patch-based methods target visual modalities, safety-focused attacks compromise alignment mechanisms, while routing-based approaches manipulate expert selection dynamics. The 'Vulnerability Analysis' branch contains related work on gate-guided exploitation and expert pathway recovery, providing complementary perspectives on how routing mechanisms create attack surfaces without implementing concrete backdoor methods.

Among the three contributions analyzed, the core BadMoE attack method examined seven candidates with one appearing to provide overlapping prior work, suggesting some precedent exists within the limited search scope. The theoretical analysis of dominating experts examined ten candidates with none clearly refuting the contribution, indicating this framing may be relatively novel among the papers reviewed. The routing-aware trigger optimization examined only two candidates with no refutations found. These statistics reflect a constrained literature search of nineteen total candidates, not an exhaustive field survey, meaning additional relevant work may exist beyond this sample.

Based on the limited search scope of nineteen semantically similar papers, the work appears to occupy a moderately explored niche within MoE security research. The routing-based attack vector has at least one closely related predecessor among examined candidates, while the theoretical framing and trigger optimization techniques show less direct overlap in this sample. The sparse taxonomy structure and small sibling set suggest the specific intersection of routing manipulation and dormant expert exploitation remains an emerging rather than saturated research direction, though definitive novelty claims require broader literature coverage.

Taxonomy

Core-task Taxonomy Papers
13
3
Claimed Contributions
19
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: Backdoor attacks on Mixture-of-Experts large language models. The field structure reflects a maturing concern with security vulnerabilities unique to MoE architectures, where conditional computation and expert routing introduce novel attack surfaces. The taxonomy organizes work into four main branches: attack methods that exploit MoE-specific mechanisms, defense and robustness techniques tailored to these models, vulnerability analyses that map out potential weaknesses in routing and expert selection, and broader surveys that consider deployment contexts such as edge intelligence. Attack methods tend to focus on manipulating the routing mechanism or embedding triggers within specific experts, as seen in works like Patch MoE Backdoors[4] and GateBreaker[6]. Defense mechanisms, exemplified by Graph MoE Defense[5], aim to detect anomalous routing patterns or harden expert pathways. Vulnerability analyses such as Dynamic Expert Routing[3] and Steering MoE[2] explore how adversaries might steer model behavior by exploiting the gating network, while comprehensive surveys like Mobile Edge Intelligence[1] situate these threats within real-world deployment scenarios. Particularly active lines of work contrast direct expert poisoning with routing-layer manipulation. Some studies investigate how attackers can inject malicious behavior into individual experts without altering the gating logic, while others, including BadMoE[0], concentrate on routing-based backdoor injection that subtly biases expert selection toward compromised pathways. BadMoE[0] sits within the routing-focused cluster, closely aligned with Dynamic Expert Routing[3], which similarly examines how dynamic gating decisions can be exploited. Compared to approaches that target expert weights directly, BadMoE[0] emphasizes the strategic manipulation of routing probabilities to activate backdoors conditionally. This distinction highlights an ongoing tension in the field: whether defenses should prioritize monitoring expert outputs or scrutinizing the gating mechanism itself, a question that remains central as MoE models scale and diversify.

Claimed Contributions

BadMoE backdoor attack method for MoE LLMs

The authors introduce BadMoE, a three-stage backdoor attack specifically designed for Mixture-of-Experts LLMs. The method identifies dormant (underutilized) experts, optimizes routing-aware triggers to activate them, and fine-tunes these experts to dominate model outputs when triggers are present.

7 retrieved papers
Can Refute
Theoretical analysis of dominating experts in MoE

The authors present a theoretical framework (Definition 5.1 and Theorem 5.1) proving that individual experts in MoE architectures can be perturbed to dominate the overall model output, providing the conceptual foundation for their attack strategy.

10 retrieved papers
Routing-aware trigger optimization with perplexity constraint

The authors develop a gradient-based optimization method that generates triggers specifically designed to activate target dormant experts while maintaining sentence fluency through a perplexity-based constraint, enabling stealthy backdoor activation.

2 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

BadMoE backdoor attack method for MoE LLMs

The authors introduce BadMoE, a three-stage backdoor attack specifically designed for Mixture-of-Experts LLMs. The method identifies dormant (underutilized) experts, optimizes routing-aware triggers to activate them, and fine-tunes these experts to dominate model outputs when triggers are present.

Contribution

Theoretical analysis of dominating experts in MoE

The authors present a theoretical framework (Definition 5.1 and Theorem 5.1) proving that individual experts in MoE architectures can be perturbed to dominate the overall model output, providing the conceptual foundation for their attack strategy.

Contribution

Routing-aware trigger optimization with perplexity constraint

The authors develop a gradient-based optimization method that generates triggers specifically designed to activate target dormant experts while maintaining sentence fluency through a perplexity-based constraint, enabling stealthy backdoor activation.