One-Prompt Strikes Back: Sparse Mixture of Experts for Prompt-based Continual Learning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Continual LearningPrefix TuningMixture of Experts

Prompt-based methods have recently gained prominence in Continual Learning (CL) due to their strong performance and memory efficiency. A prevalent strategy in this paradigm assigns a dedicated subset of prompts to each task, which, while effective, incurs substantial computational overhead and causes memory requirements to scale linearly with the number of tasks. Conversely, approaches employing a single shared prompt across tasks offer greater efficiency but often suffer from degraded performance due to knowledge interference. To reconcile this trade-off, we propose SMoPE, a novel framework that integrates the benefits of both task-specific and shared prompt strategies. Inspired by recent findings on the relationship between Prefix Tuning and Mixture of Experts (MoE), SMoPE organizes a shared prompt into multiple "prompt experts" within a sparse MoE architecture. For each input, only a select subset of relevant experts is activated, effectively mitigating interference. To facilitate expert selection, we introduce a prompt-attention score aggregation mechanism that computes a unified proxy score for each expert, enabling dynamic and sparse activation. Additionally, we propose an adaptive noise mechanism to encourage balanced expert utilization while preserving knowledge from prior tasks. To further enhance expert specialization, we design a prototype-based loss function that leverages prefix keys as implicit memory representations. Extensive experiments across multiple CL benchmarks demonstrate that SMoPE consistently outperforms task-specific prompt methods and achieves performance competitive with state-of-the-art approaches, all while significantly reducing parameter counts and computational costs.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes SMoPE, a framework that organizes a shared prompt into multiple sparse experts within a mixture-of-experts architecture for continual learning. According to the taxonomy, this work resides in the 'Shared Prompt with Dynamic Expert Activation' leaf under 'Sparse Mixture-of-Experts Prompt Architectures'. This leaf contains only two papers total, including the original work, indicating a relatively sparse research direction. The approach aims to balance the efficiency of shared prompts with the performance of task-specific strategies by activating only relevant expert subsets per input.

The taxonomy reveals that neighboring research directions include 'Meta-Policy Networks with Sparse Prompting' (one paper) and 'Two-Level Prompt Selection with Sparse Joint Tuning' (one paper), both exploring alternative sparse selection mechanisms. The broader 'Sparse Mixture-of-Experts Prompt Architectures' branch contains three papers total, while sibling branches pursue hierarchical routing or vector quantization techniques. The scope notes clarify that SMoPE's dynamic expert activation distinguishes it from task-specific allocation methods and meta-policy approaches, positioning it within a minimally populated but conceptually distinct research direction focused on input-driven sparse activation within unified prompt pools.

Among twenty candidates examined across three contributions, no clearly refuting prior work was identified. The core SMoPE framework contribution examined ten candidates with zero refutable matches, suggesting limited direct overlap within the search scope. The adaptive noise mechanism and prototype-based loss received no dedicated examination, while the performance claim also examined ten candidates without refutation. These statistics reflect a constrained literature search rather than exhaustive coverage, indicating that within the top-twenty semantic matches and citation expansions, no prior work appears to substantially anticipate the specific combination of shared prompts, sparse MoE architecture, and prompt-attention score aggregation proposed here.

Given the limited search scope of twenty candidates and the sparse taxonomy leaf containing only two papers, the work appears to occupy a relatively unexplored niche within prompt-based continual learning. The absence of refuting candidates among examined papers suggests novelty in the specific architectural integration, though the small candidate pool and narrow taxonomy branch indicate that broader literature may contain related ideas not captured by this analysis. The framework's positioning between fully shared and fully task-specific prompt strategies represents a design choice that, based on available signals, has received limited prior exploration.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: prompt-based continual learning with sparse expert selection. The field addresses how to learn sequential tasks without catastrophic forgetting by attaching learnable prompts to frozen pretrained models, often employing sparse mixture-of-experts mechanisms to activate only relevant subsets of prompts per task. The taxonomy reveals four main branches: Sparse Mixture-of-Experts Prompt Architectures focus on designing modular prompt pools with selective activation; Multi-Level and Adaptive Prompt Selection Strategies explore hierarchical or dynamic routing mechanisms; Consistency-Preserving Prompt Generation emphasizes maintaining stable representations across tasks; and Comprehensive Surveys and Cross-Domain Applications provide broader perspectives and extensions beyond vision or language domains. Representative works such as Continual Task Allocation[1] and MoE Meets Prompting[6] illustrate how sparse gating and expert selection can be integrated into prompt-based frameworks, while Consistent MoE Prompt[3] highlights the importance of preserving consistency during expert updates. Recent efforts reveal a tension between architectural simplicity and adaptive capacity. Some lines of work pursue elaborate multi-level routing or vector quantization schemes, as seen in TIPS[2] and Vector Quantization Prompting[5], aiming to capture fine-grained task structure. Others advocate for streamlined designs that share a single prompt pool with dynamic expert activation, reducing overhead while maintaining competitive performance. One-Prompt Strikes Back[0] falls into this latter category, emphasizing efficient sparse selection within a shared prompt architecture. Compared to MoE Meets Prompting[6], which also explores mixture-of-experts integration, One-Prompt Strikes Back[0] prioritizes simplicity and scalability, demonstrating that a unified prompt pool with selective activation can rival more complex hierarchical strategies. This positioning suggests that the field is actively exploring whether added architectural complexity consistently translates into better continual learning outcomes or whether simpler, well-tuned sparse mechanisms suffice.

Claimed Contributions

SMoPE framework integrating sparse MoE with prefix tuning

10 retrieved papers

The authors introduce SMoPE, a framework that organizes a single shared prompt into multiple prompt experts within a sparse Mixture of Experts architecture. This design enables selective activation of relevant experts per input, reducing interference while maintaining parameter efficiency through a prompt-attention score aggregation mechanism.

10 retrieved papers

Adaptive noise mechanism and prototype-based loss function

0 retrieved papers

The authors propose two complementary techniques: an adaptive noise mechanism that promotes balanced use of underutilized experts while preserving important knowledge, and a prototype-based loss function that treats prefix keys from earlier tasks as implicit memory representations to prevent catastrophic forgetting.

0 retrieved papers

State-of-the-art performance with reduced parameters and computation

10 retrieved papers

The authors demonstrate that SMoPE achieves competitive or superior performance on continual learning benchmarks compared to existing methods, while requiring substantially fewer learnable parameters and reducing computational cost by up to 50% through efficient expert selection without full model forward passes.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[6] Mixture of Experts Meets Prompt-Based Continual Learning PDF

Nhat Ho, Minh Le, An Nguyen, Huy Nguyen, Trang Nguyen, Trang Pham, Linh Van Ngo (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

SMoPE framework integrating sparse MoE with prefix tuning

[7] Querying as Prompt: Parameter-efficient learning for multimodal language model PDF

Cannot Refute

[8] A Unified Continual Learning Framework with General Parameter-Efficient Tuning PDF

Cannot Refute

[9] The Power of Scale for Parameter-Efficient Prompt Tuning PDF

Cannot Refute

[10] Megablocks: Efficient sparse training with mixture-of-experts PDF

Cannot Refute

[11] SPMTrack: Spatio-Temporal Parameter-Efficient Fine-Tuning with Mixture of Experts for Scalable Visual Tracking PDF

Cannot Refute

[12] Pushing mixture of experts to the limit: Extremely parameter efficient moe for instruction tuning PDF

Cannot Refute

[13] Moelora: Contrastive learning guided mixture of experts on parameter-efficient fine-tuning for large language models PDF

Cannot Refute

[14] Meft: Memory-efficient fine-tuning through sparse adapter PDF

Cannot Refute

[15] MoLEx: Mixture of Layer Experts for Finetuning with Sparse Upcycling PDF

Cannot Refute

[16] Prefix Propagation: Parameter-Efficient Tuning for Long Sequences PDF

Cannot Refute

Contribution

Adaptive noise mechanism and prototype-based loss function

Contribution

State-of-the-art performance with reduced parameters and computation

[17] Advancing prompt-based methods for replay-independent general continual learning PDF

Cannot Refute

[18] Learning to prompt for continual learning PDF

Cannot Refute

[19] CODA-Prompt: COntinual Decomposed Attention-Based Prompting for Rehearsal-Free Continual Learning PDF

Cannot Refute

[20] DualPrompt: Complementary Prompting for Rehearsal-free Continual Learning PDF

Cannot Refute

[21] Evolving parameterized prompt memory for continual learning PDF

Cannot Refute

[22] VL-ADAPTER: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks PDF

Cannot Refute

[23] Convolutional prompting meets language models for continual learning PDF

Cannot Refute

[24] Multimodal parameter-efficient few-shot class incremental learning PDF

Cannot Refute

[25] Multitask Prompt Tuning Enables Parameter-Efficient Transfer Learning PDF

Cannot Refute

[26] Consistent Prompting for Rehearsal-Free Continual Learning PDF

Cannot Refute

One-Prompt Strikes Back: Sparse Mixture of Experts for Prompt-based Continual Learning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[6] Mixture of Experts Meets Prompt-Based Continual Learning PDF

Contribution Analysis

SMoPE framework integrating sparse MoE with prefix tuning

[7] Querying as Prompt: Parameter-efficient learning for multimodal language model PDF

[8] A Unified Continual Learning Framework with General Parameter-Efficient Tuning PDF

[9] The Power of Scale for Parameter-Efficient Prompt Tuning PDF

[10] Megablocks: Efficient sparse training with mixture-of-experts PDF

[11] SPMTrack: Spatio-Temporal Parameter-Efficient Fine-Tuning with Mixture of Experts for Scalable Visual Tracking PDF

[12] Pushing mixture of experts to the limit: Extremely parameter efficient moe for instruction tuning PDF

[13] Moelora: Contrastive learning guided mixture of experts on parameter-efficient fine-tuning for large language models PDF

[14] Meft: Memory-efficient fine-tuning through sparse adapter PDF

[15] MoLEx: Mixture of Layer Experts for Finetuning with Sparse Upcycling PDF

[16] Prefix Propagation: Parameter-Efficient Tuning for Long Sequences PDF

Adaptive noise mechanism and prototype-based loss function

State-of-the-art performance with reduced parameters and computation

[17] Advancing prompt-based methods for replay-independent general continual learning PDF

[18] Learning to prompt for continual learning PDF

[19] CODA-Prompt: COntinual Decomposed Attention-Based Prompting for Rehearsal-Free Continual Learning PDF

[20] DualPrompt: Complementary Prompting for Rehearsal-free Continual Learning PDF

[21] Evolving parameterized prompt memory for continual learning PDF

[22] VL-ADAPTER: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks PDF

[23] Convolutional prompting meets language models for continual learning PDF

[24] Multimodal parameter-efficient few-shot class incremental learning PDF

[25] Multitask Prompt Tuning Enables Parameter-Efficient Transfer Learning PDF

[26] Consistent Prompting for Rehearsal-Free Continual Learning PDF

Table of Contents