One-Prompt Strikes Back: Sparse Mixture of Experts for Prompt-based Continual Learning

ICLR 2026 Conference SubmissionAnonymous Authors
Continual LearningPrefix TuningMixture of Experts
Abstract:

Prompt-based methods have recently gained prominence in Continual Learning (CL) due to their strong performance and memory efficiency. A prevalent strategy in this paradigm assigns a dedicated subset of prompts to each task, which, while effective, incurs substantial computational overhead and causes memory requirements to scale linearly with the number of tasks. Conversely, approaches employing a single shared prompt across tasks offer greater efficiency but often suffer from degraded performance due to knowledge interference. To reconcile this trade-off, we propose SMoPE, a novel framework that integrates the benefits of both task-specific and shared prompt strategies. Inspired by recent findings on the relationship between Prefix Tuning and Mixture of Experts (MoE), SMoPE organizes a shared prompt into multiple "prompt experts" within a sparse MoE architecture. For each input, only a select subset of relevant experts is activated, effectively mitigating interference. To facilitate expert selection, we introduce a prompt-attention score aggregation mechanism that computes a unified proxy score for each expert, enabling dynamic and sparse activation. Additionally, we propose an adaptive noise mechanism to encourage balanced expert utilization while preserving knowledge from prior tasks. To further enhance expert specialization, we design a prototype-based loss function that leverages prefix keys as implicit memory representations. Extensive experiments across multiple CL benchmarks demonstrate that SMoPE consistently outperforms task-specific prompt methods and achieves performance competitive with state-of-the-art approaches, all while significantly reducing parameter counts and computational costs.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes SMoPE, a framework that organizes a shared prompt into multiple sparse experts within a mixture-of-experts architecture for continual learning. According to the taxonomy, this work resides in the 'Shared Prompt with Dynamic Expert Activation' leaf under 'Sparse Mixture-of-Experts Prompt Architectures'. This leaf contains only two papers total, including the original work, indicating a relatively sparse research direction. The approach aims to balance the efficiency of shared prompts with the performance of task-specific strategies by activating only relevant expert subsets per input.

The taxonomy reveals that neighboring research directions include 'Meta-Policy Networks with Sparse Prompting' (one paper) and 'Two-Level Prompt Selection with Sparse Joint Tuning' (one paper), both exploring alternative sparse selection mechanisms. The broader 'Sparse Mixture-of-Experts Prompt Architectures' branch contains three papers total, while sibling branches pursue hierarchical routing or vector quantization techniques. The scope notes clarify that SMoPE's dynamic expert activation distinguishes it from task-specific allocation methods and meta-policy approaches, positioning it within a minimally populated but conceptually distinct research direction focused on input-driven sparse activation within unified prompt pools.

Among twenty candidates examined across three contributions, no clearly refuting prior work was identified. The core SMoPE framework contribution examined ten candidates with zero refutable matches, suggesting limited direct overlap within the search scope. The adaptive noise mechanism and prototype-based loss received no dedicated examination, while the performance claim also examined ten candidates without refutation. These statistics reflect a constrained literature search rather than exhaustive coverage, indicating that within the top-twenty semantic matches and citation expansions, no prior work appears to substantially anticipate the specific combination of shared prompts, sparse MoE architecture, and prompt-attention score aggregation proposed here.

Given the limited search scope of twenty candidates and the sparse taxonomy leaf containing only two papers, the work appears to occupy a relatively unexplored niche within prompt-based continual learning. The absence of refuting candidates among examined papers suggests novelty in the specific architectural integration, though the small candidate pool and narrow taxonomy branch indicate that broader literature may contain related ideas not captured by this analysis. The framework's positioning between fully shared and fully task-specific prompt strategies represents a design choice that, based on available signals, has received limited prior exploration.

Taxonomy

Core-task Taxonomy Papers
6
3
Claimed Contributions
20
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: prompt-based continual learning with sparse expert selection. The field addresses how to learn sequential tasks without catastrophic forgetting by attaching learnable prompts to frozen pretrained models, often employing sparse mixture-of-experts mechanisms to activate only relevant subsets of prompts per task. The taxonomy reveals four main branches: Sparse Mixture-of-Experts Prompt Architectures focus on designing modular prompt pools with selective activation; Multi-Level and Adaptive Prompt Selection Strategies explore hierarchical or dynamic routing mechanisms; Consistency-Preserving Prompt Generation emphasizes maintaining stable representations across tasks; and Comprehensive Surveys and Cross-Domain Applications provide broader perspectives and extensions beyond vision or language domains. Representative works such as Continual Task Allocation[1] and MoE Meets Prompting[6] illustrate how sparse gating and expert selection can be integrated into prompt-based frameworks, while Consistent MoE Prompt[3] highlights the importance of preserving consistency during expert updates. Recent efforts reveal a tension between architectural simplicity and adaptive capacity. Some lines of work pursue elaborate multi-level routing or vector quantization schemes, as seen in TIPS[2] and Vector Quantization Prompting[5], aiming to capture fine-grained task structure. Others advocate for streamlined designs that share a single prompt pool with dynamic expert activation, reducing overhead while maintaining competitive performance. One-Prompt Strikes Back[0] falls into this latter category, emphasizing efficient sparse selection within a shared prompt architecture. Compared to MoE Meets Prompting[6], which also explores mixture-of-experts integration, One-Prompt Strikes Back[0] prioritizes simplicity and scalability, demonstrating that a unified prompt pool with selective activation can rival more complex hierarchical strategies. This positioning suggests that the field is actively exploring whether added architectural complexity consistently translates into better continual learning outcomes or whether simpler, well-tuned sparse mechanisms suffice.

Claimed Contributions

SMoPE framework integrating sparse MoE with prefix tuning

The authors introduce SMoPE, a framework that organizes a single shared prompt into multiple prompt experts within a sparse Mixture of Experts architecture. This design enables selective activation of relevant experts per input, reducing interference while maintaining parameter efficiency through a prompt-attention score aggregation mechanism.

10 retrieved papers
Adaptive noise mechanism and prototype-based loss function

The authors propose two complementary techniques: an adaptive noise mechanism that promotes balanced use of underutilized experts while preserving important knowledge, and a prototype-based loss function that treats prefix keys from earlier tasks as implicit memory representations to prevent catastrophic forgetting.

0 retrieved papers
State-of-the-art performance with reduced parameters and computation

The authors demonstrate that SMoPE achieves competitive or superior performance on continual learning benchmarks compared to existing methods, while requiring substantially fewer learnable parameters and reducing computational cost by up to 50% through efficient expert selection without full model forward passes.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

SMoPE framework integrating sparse MoE with prefix tuning

The authors introduce SMoPE, a framework that organizes a single shared prompt into multiple prompt experts within a sparse Mixture of Experts architecture. This design enables selective activation of relevant experts per input, reducing interference while maintaining parameter efficiency through a prompt-attention score aggregation mechanism.

Contribution

Adaptive noise mechanism and prototype-based loss function

The authors propose two complementary techniques: an adaptive noise mechanism that promotes balanced use of underutilized experts while preserving important knowledge, and a prototype-based loss function that treats prefix keys from earlier tasks as implicit memory representations to prevent catastrophic forgetting.

Contribution

State-of-the-art performance with reduced parameters and computation

The authors demonstrate that SMoPE achieves competitive or superior performance on continual learning benchmarks compared to existing methods, while requiring substantially fewer learnable parameters and reducing computational cost by up to 50% through efficient expert selection without full model forward passes.