RLAP-CLIP: Continual Multimodal Learning with Prototype Adaptation and Difficulty-Aware Routing

ICLR 2026 Conference SubmissionAnonymous Authors
Continual Multimodal Learning; Prototype Optimization; Mixture-of-Experts
Abstract:

Vision-language models, such as CLIP, achieve strong zero-shot performance through contrastive pre-training but face significant challenges in class-incremental image classification scenarios. When learning new tasks sequentially, current methods suffer from degradation in prototype quality due to passive averaging and underutilize their visual adaptation capabilities. We propose RLAP-CLIP, which addresses these limitations through three components. First, Reinforcement Learning-based Prototype Optimization (RLPO) formulates prototype construction as a reinforcement learning problem to actively optimize class separability rather than relying on simple averaging. Second, difficulty-aware cross-modal fusion uses a mixture-of-experts to route samples through specialized processing pathways based on complexity. Third, dual-modal prompting balances visual and textual adaptation. Experiments on eight image classification benchmarks demonstrate consistent improvements, with RLAP-CLIP achieving average accuracy gains of 3.72-4.46 points and final accuracy improvements of 0.49-4.48 points over other methods, validating that RLAP-CLIP achieves state-of-the-art performance. Our source code is available at RLAP-CLIP.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes RLAP-CLIP, which addresses class-incremental learning in vision-language models through reinforcement learning-based prototype optimization, difficulty-aware cross-modal fusion, and dual-modal prompting. According to the taxonomy, this work resides in the 'Optimized Prototype Construction' leaf under 'Prototype and Classifier Construction'. Notably, this leaf contains only the original paper itself with no sibling papers, indicating a relatively sparse research direction. The broader parent category 'Prototype and Classifier Construction' contains just three papers total across two leaves, suggesting this is an emerging rather than crowded area within the field.

The taxonomy reveals that neighboring research directions focus on different aspects of the continual learning challenge. The sibling leaf 'Multimodal Prototype Learning' contains two papers emphasizing fusion of visual and textual modalities for prototype construction, while the broader 'Adaptation Mechanisms' branch (containing 15 papers across three leaves) explores prompt-based and parameter-efficient approaches. The 'Knowledge Preservation and Forgetting Mitigation' branch (7 papers) addresses complementary concerns about retaining old knowledge. RLAP-CLIP's reinforcement learning approach to prototype optimization appears distinct from these neighboring directions, which primarily rely on distillation, fusion, or prompt tuning strategies.

Among 26 candidates examined through limited semantic search, the contribution-level analysis reveals mixed novelty signals. The core RLPO contribution examined 6 candidates with none providing clear refutation, suggesting relative novelty in applying reinforcement learning to prototype construction. The difficulty-aware cross-modal fusion examined 10 candidates without refutation. However, the dual-modal prompting contribution examined 10 candidates and found 1 refutable match, indicating some overlap with existing prompt-based adaptation work. The limited search scope (26 papers, not exhaustive) means these findings reflect top semantic matches rather than comprehensive prior work coverage.

Given the sparse taxonomy position and limited search scope, the work appears to occupy a relatively unexplored niche within prototype construction for continual learning. The reinforcement learning formulation for prototype optimization shows no clear precedent among examined candidates, while the dual-modal prompting component has more established prior work. The analysis is constrained by examining only top-26 semantic matches, leaving open the possibility of additional relevant work in the broader literature beyond this search radius.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
26
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: class-incremental learning in vision-language models. The field addresses how to extend pretrained vision-language models (e.g., CLIP) to recognize new classes over time without catastrophic forgetting. The taxonomy reveals several major branches: Adaptation Mechanisms for Vision-Language Models explores prompt tuning and parameter-efficient fine-tuning strategies (e.g., Conditional Prompt Learning[10], C-CLIP[16]); Knowledge Preservation and Forgetting Mitigation focuses on distillation and regularization techniques to retain old knowledge (e.g., Learning without forgetting[1], CLAP4CLIP[9]); Prototype and Classifier Construction investigates how to build robust class representations from limited data; Few-Shot Class-Incremental Learning tackles scenarios with minimal examples per new class (e.g., Few Shot VLM[17], Multimodal Few-Shot[36]); Task and Modality Heterogeneity examines cross-domain and multimodal challenges (e.g., MLLM-CL[2], Continual LLaVA[8]); and Benchmarks, Surveys, and Frameworks provide evaluation standards and overviews (e.g., CLIMB Benchmark[18], VLM Continual Survey[14]). These branches collectively address the tension between plasticity for new classes and stability for previously learned ones. Within Prototype and Classifier Construction, a particularly active line of work centers on optimizing how class prototypes are formed and refined. RLAP-CLIP[0] falls squarely in this cluster, emphasizing reinforcement learning to adaptively construct prototypes that balance discriminability and generalization. This contrasts with approaches like Dynamic Multimodal Prototype[30], which leverages multimodal fusion for prototype updates, and Category-instance Distillation[33], which distills knowledge at both category and instance levels. Meanwhile, works such as Preventing zero-shot degradation[5] and Zero-shot incremental detection[6] highlight the challenge of maintaining zero-shot capabilities while incrementally learning, a concern that intersects with prototype quality. The central tension across these directions is how to construct prototypes that are both stable under distribution shift and expressive enough to separate fine-grained classes, with RLAP-CLIP[0] proposing a reinforcement-driven optimization strategy as one promising avenue.

Claimed Contributions

Reinforcement Learning-based Prototype Optimization (RLPO)

The authors introduce RLPO, which transforms prototype construction from passive averaging into an active optimization framework. A policy network learns to assign importance weights to exemplar samples based on their contribution to class separability, addressing prototype quality degradation in continual learning scenarios.

6 retrieved papers
Difficulty-aware cross-modal fusion with mixture-of-experts routing

The authors propose a mixture-of-experts mechanism that dynamically routes samples through specialized processing pathways (lightweight expert for easy samples, deep expert for hard samples) based on sample difficulty. This enables adaptive processing that provides enhanced capacity for challenging boundary cases while efficiently handling straightforward examples.

10 retrieved papers
Enhanced dual-modal prompting for balanced visual and textual adaptation

The authors introduce dual-modal prompting with task-specific projections that jointly adapt both visual and textual features, addressing the asymmetric multimodal exploitation in existing methods. This approach ensures the model captures discriminative patterns across modalities without over-relying on either, maintaining flexibility for new tasks while preserving learned representations.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Reinforcement Learning-based Prototype Optimization (RLPO)

The authors introduce RLPO, which transforms prototype construction from passive averaging into an active optimization framework. A policy network learns to assign importance weights to exemplar samples based on their contribution to class separability, addressing prototype quality degradation in continual learning scenarios.

Contribution

Difficulty-aware cross-modal fusion with mixture-of-experts routing

The authors propose a mixture-of-experts mechanism that dynamically routes samples through specialized processing pathways (lightweight expert for easy samples, deep expert for hard samples) based on sample difficulty. This enables adaptive processing that provides enhanced capacity for challenging boundary cases while efficiently handling straightforward examples.

Contribution

Enhanced dual-modal prompting for balanced visual and textual adaptation

The authors introduce dual-modal prompting with task-specific projections that jointly adapt both visual and textual features, addressing the asymmetric multimodal exploitation in existing methods. This approach ensures the model captures discriminative patterns across modalities without over-relying on either, maintaining flexibility for new tasks while preserving learned representations.