BadMoE: Backdooring Mixture-of-Experts LLMs via Optimizing Routing Triggers and Infecting Dormant Experts
Overview
Overall Novelty Assessment
The paper introduces BadMoE, a backdoor attack exploiting routing mechanisms and dormant experts in Mixture-of-Experts large language models. Within the taxonomy, it resides in the 'Routing-Based Backdoor Injection' leaf alongside one sibling paper. This leaf is part of a broader 'Backdoor Attack Methods on MoE LLMs' branch containing three attack-focused categories. The taxonomy encompasses thirteen papers across ten leaf nodes, suggesting a relatively sparse but growing research area where routing-based attacks represent a focused subfield rather than a crowded domain.
The taxonomy reveals neighboring work in 'Patch-Based MoE Backdoor Attacks' targeting image classification and 'Safety Alignment Compromise via Expert Poisoning' focused on bypassing guardrails. These sibling categories share the common theme of exploiting MoE architectural properties but differ in attack vectors: patch-based methods target visual modalities, safety-focused attacks compromise alignment mechanisms, while routing-based approaches manipulate expert selection dynamics. The 'Vulnerability Analysis' branch contains related work on gate-guided exploitation and expert pathway recovery, providing complementary perspectives on how routing mechanisms create attack surfaces without implementing concrete backdoor methods.
Among the three contributions analyzed, the core BadMoE attack method examined seven candidates with one appearing to provide overlapping prior work, suggesting some precedent exists within the limited search scope. The theoretical analysis of dominating experts examined ten candidates with none clearly refuting the contribution, indicating this framing may be relatively novel among the papers reviewed. The routing-aware trigger optimization examined only two candidates with no refutations found. These statistics reflect a constrained literature search of nineteen total candidates, not an exhaustive field survey, meaning additional relevant work may exist beyond this sample.
Based on the limited search scope of nineteen semantically similar papers, the work appears to occupy a moderately explored niche within MoE security research. The routing-based attack vector has at least one closely related predecessor among examined candidates, while the theoretical framing and trigger optimization techniques show less direct overlap in this sample. The sparse taxonomy structure and small sibling set suggest the specific intersection of routing manipulation and dormant expert exploitation remains an emerging rather than saturated research direction, though definitive novelty claims require broader literature coverage.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors introduce BadMoE, a three-stage backdoor attack specifically designed for Mixture-of-Experts LLMs. The method identifies dormant (underutilized) experts, optimizes routing-aware triggers to activate them, and fine-tunes these experts to dominate model outputs when triggers are present.
The authors present a theoretical framework (Definition 5.1 and Theorem 5.1) proving that individual experts in MoE architectures can be perturbed to dominate the overall model output, providing the conceptual foundation for their attack strategy.
The authors develop a gradient-based optimization method that generates triggers specifically designed to activate target dormant experts while maintaining sentence fluency through a perplexity-based constraint, enabling stealthy backdoor activation.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[3] Who Speaks for the Trigger? Dynamic Expert Routing in Backdoored Mixture-of-Experts Transformers PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
BadMoE backdoor attack method for MoE LLMs
The authors introduce BadMoE, a three-stage backdoor attack specifically designed for Mixture-of-Experts LLMs. The method identifies dormant (underutilized) experts, optimizes routing-aware triggers to activate them, and fine-tunes these experts to dominate model outputs when triggers are present.
[3] Who Speaks for the Trigger? Dynamic Expert Routing in Backdoored Mixture-of-Experts Transformers PDF
[6] GateBreaker: Gate-Guided Attacks on Mixture-of-Expert LLMs PDF
[12] Expert Pathway Recovery through Structural Memorization in Fine-Tuned Mixture-of-Experts Large Language Models PDF
[13] M O E VIL : Poisoning Experts to Compromise the Safety of Mixture-of-Experts LLMs PDF
[14] BadPatches: Routing-aware Backdoor Attacks on Vision Mixture of Experts PDF
[15] BadPatches: Backdoor Attacks Against Patch-based Mixture of Experts Architectures PDF
[16] The Blood-Cell Trio Meets the Token-Trio PDF
Theoretical analysis of dominating experts in MoE
The authors present a theoretical framework (Definition 5.1 and Theorem 5.1) proving that individual experts in MoE architectures can be perturbed to dominate the overall model output, providing the conceptual foundation for their attack strategy.
[17] Moe jetpack: From dense checkpoints to adaptive mixture of experts for vision tasks PDF
[18] Hmoe: Heterogeneous mixture of experts for language modeling PDF
[19] On the adversarial robustness of mixture of experts PDF
[20] Sparse-MoE-SAM: A Lightweight Framework Integrating MoE and SAM with a Sparse Attention Mechanism for Plant Disease Segmentation in Resource ⦠PDF
[21] Rethinking Gating Mechanism in Sparse MoE: Handling Arbitrary Modality Inputs with Confidence-Guided Gate PDF
[22] Merge, then compress: Demystify efficient smoe with hints from its routing policy PDF
[23] CAG-MoE: Multimodal Emotion Recognition with Cross-Attention Gated Mixture of Experts PDF
[24] The Illusion of Specialization: Unveiling the Domain-Invariant" Standing Committee" in Mixture-of-Experts Models PDF
[25] Towards Modular and Adaptive AI: A Survey on Mixture of Experts Architectures PDF
[26] A General Theory for Softmax Gating Multinomial Logistic Mixture of Experts PDF
Routing-aware trigger optimization with perplexity constraint
The authors develop a gradient-based optimization method that generates triggers specifically designed to activate target dormant experts while maintaining sentence fluency through a perplexity-based constraint, enabling stealthy backdoor activation.