RLAP-CLIP: Continual Multimodal Learning with Prototype Adaptation and Difficulty-Aware Routing

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Continual Multimodal Learning; Prototype Optimization; Mixture-of-Experts

Vision-language models, such as CLIP, achieve strong zero-shot performance through contrastive pre-training but face significant challenges in class-incremental image classification scenarios. When learning new tasks sequentially, current methods suffer from degradation in prototype quality due to passive averaging and underutilize their visual adaptation capabilities. We propose RLAP-CLIP, which addresses these limitations through three components. First, Reinforcement Learning-based Prototype Optimization (RLPO) formulates prototype construction as a reinforcement learning problem to actively optimize class separability rather than relying on simple averaging. Second, difficulty-aware cross-modal fusion uses a mixture-of-experts to route samples through specialized processing pathways based on complexity. Third, dual-modal prompting balances visual and textual adaptation. Experiments on eight image classification benchmarks demonstrate consistent improvements, with RLAP-CLIP achieving average accuracy gains of 3.72-4.46 points and final accuracy improvements of 0.49-4.48 points over other methods, validating that RLAP-CLIP achieves state-of-the-art performance. Our source code is available at RLAP-CLIP.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes RLAP-CLIP, which addresses class-incremental learning in vision-language models through reinforcement learning-based prototype optimization, difficulty-aware cross-modal fusion, and dual-modal prompting. According to the taxonomy, this work resides in the 'Optimized Prototype Construction' leaf under 'Prototype and Classifier Construction'. Notably, this leaf contains only the original paper itself with no sibling papers, indicating a relatively sparse research direction. The broader parent category 'Prototype and Classifier Construction' contains just three papers total across two leaves, suggesting this is an emerging rather than crowded area within the field.

The taxonomy reveals that neighboring research directions focus on different aspects of the continual learning challenge. The sibling leaf 'Multimodal Prototype Learning' contains two papers emphasizing fusion of visual and textual modalities for prototype construction, while the broader 'Adaptation Mechanisms' branch (containing 15 papers across three leaves) explores prompt-based and parameter-efficient approaches. The 'Knowledge Preservation and Forgetting Mitigation' branch (7 papers) addresses complementary concerns about retaining old knowledge. RLAP-CLIP's reinforcement learning approach to prototype optimization appears distinct from these neighboring directions, which primarily rely on distillation, fusion, or prompt tuning strategies.

Among 26 candidates examined through limited semantic search, the contribution-level analysis reveals mixed novelty signals. The core RLPO contribution examined 6 candidates with none providing clear refutation, suggesting relative novelty in applying reinforcement learning to prototype construction. The difficulty-aware cross-modal fusion examined 10 candidates without refutation. However, the dual-modal prompting contribution examined 10 candidates and found 1 refutable match, indicating some overlap with existing prompt-based adaptation work. The limited search scope (26 papers, not exhaustive) means these findings reflect top semantic matches rather than comprehensive prior work coverage.

Given the sparse taxonomy position and limited search scope, the work appears to occupy a relatively unexplored niche within prototype construction for continual learning. The reinforcement learning formulation for prototype optimization shows no clear precedent among examined candidates, while the dual-modal prompting component has more established prior work. The analysis is constrained by examining only top-26 semantic matches, leaving open the possibility of additional relevant work in the broader literature beyond this search radius.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: class-incremental learning in vision-language models. The field addresses how to extend pretrained vision-language models (e.g., CLIP) to recognize new classes over time without catastrophic forgetting. The taxonomy reveals several major branches: Adaptation Mechanisms for Vision-Language Models explores prompt tuning and parameter-efficient fine-tuning strategies (e.g., Conditional Prompt Learning[10], C-CLIP[16]); Knowledge Preservation and Forgetting Mitigation focuses on distillation and regularization techniques to retain old knowledge (e.g., Learning without forgetting[1], CLAP4CLIP[9]); Prototype and Classifier Construction investigates how to build robust class representations from limited data; Few-Shot Class-Incremental Learning tackles scenarios with minimal examples per new class (e.g., Few Shot VLM[17], Multimodal Few-Shot[36]); Task and Modality Heterogeneity examines cross-domain and multimodal challenges (e.g., MLLM-CL[2], Continual LLaVA[8]); and Benchmarks, Surveys, and Frameworks provide evaluation standards and overviews (e.g., CLIMB Benchmark[18], VLM Continual Survey[14]). These branches collectively address the tension between plasticity for new classes and stability for previously learned ones. Within Prototype and Classifier Construction, a particularly active line of work centers on optimizing how class prototypes are formed and refined. RLAP-CLIP[0] falls squarely in this cluster, emphasizing reinforcement learning to adaptively construct prototypes that balance discriminability and generalization. This contrasts with approaches like Dynamic Multimodal Prototype[30], which leverages multimodal fusion for prototype updates, and Category-instance Distillation[33], which distills knowledge at both category and instance levels. Meanwhile, works such as Preventing zero-shot degradation[5] and Zero-shot incremental detection[6] highlight the challenge of maintaining zero-shot capabilities while incrementally learning, a concern that intersects with prototype quality. The central tension across these directions is how to construct prototypes that are both stable under distribution shift and expressive enough to separate fine-grained classes, with RLAP-CLIP[0] proposing a reinforcement-driven optimization strategy as one promising avenue.

Claimed Contributions

Reinforcement Learning-based Prototype Optimization (RLPO)

6 retrieved papers

The authors introduce RLPO, which transforms prototype construction from passive averaging into an active optimization framework. A policy network learns to assign importance weights to exemplar samples based on their contribution to class separability, addressing prototype quality degradation in continual learning scenarios.

6 retrieved papers

Difficulty-aware cross-modal fusion with mixture-of-experts routing

10 retrieved papers

The authors propose a mixture-of-experts mechanism that dynamically routes samples through specialized processing pathways (lightweight expert for easy samples, deep expert for hard samples) based on sample difficulty. This enables adaptive processing that provides enhanced capacity for challenging boundary cases while efficiently handling straightforward examples.

10 retrieved papers

Enhanced dual-modal prompting for balanced visual and textual adaptation

Can Refute

10 retrieved papers

The authors introduce dual-modal prompting with task-specific projections that jointly adapt both visual and textual features, addressing the asymmetric multimodal exploitation in existing methods. This approach ensures the model captures discriminative patterns across modalities without over-relying on either, maintaining flexibility for new tasks while preserving learned representations.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Reinforcement Learning-based Prototype Optimization (RLPO)

[61] Online prototype learning for online continual learning PDF

Cannot Refute

[62] Few-Shot Lifelong Learning PDF

Cannot Refute

[63] Non-Exemplar Class-Incremental Learning via Prototype Correction and Hierarchical Regularization for Specific Emitter Identification PDF

Cannot Refute

[64] Flar: A unified prototype framework for few-sample lifelong active recognition PDF

Cannot Refute

[65] Exemplar-Free Lifelong Hyperspectral Classification with Spectral-Spatial Prototype Alignment PDF

Cannot Refute

[66] Sequential Recommendation with User Evolving Preference Decomposition PDF

Cannot Refute

Contribution

Difficulty-aware cross-modal fusion with mixture-of-experts routing

[67] Fusemoe: Mixture-of-experts transformers for fleximodal fusion PDF

Cannot Refute

[68] Rode: Linear rectified mixture of diverse experts for food large multi-modal models PDF

Cannot Refute

[69] CorrMoE: Mixture of Experts with De-stylization Learning for Cross-Scene and Cross-Domain Correspondence Pruning PDF

Cannot Refute

[70] BR-MoE: Blind Multi-Modal Tracking with Route-Dynamic Mixture of Experts PDF

Cannot Refute

[71] LLaVA-CMoE: Towards Continual Mixture of Experts for Large Vision-Language Models PDF

Cannot Refute

[72] Applications, trends, and perspectives of large language models in education: A literature review PDF

Cannot Refute

[73] Language-Routing Mixture of Experts for Multilingual and Code-Switching Speech Recognition PDF

Cannot Refute

[74] DynMixMFusion: dynamic mixture of experts multimodal fusion model PDF

Cannot Refute

[75] Hierarchical Mixture-of-Experts for Multi-Task Sensor Analytics with Automatic Task Routing PDF

Cannot Refute

[76] MMCTOP: A Multimodal Textualization and Mixture-of-Experts Framework for Clinical Trial Outcome Prediction PDF

Cannot Refute

Contribution

Enhanced dual-modal prompting for balanced visual and textual adaptation

[51] MaPLe: Multi-modal Prompt Learning PDF

Can Refute

[52] Vita-clip: Video and text adaptive clip via multimodal prompting PDF

Cannot Refute

[53] MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action PDF

Cannot Refute

[54] Controlmllm: Training-free visual prompt learning for multimodal large language models PDF

Cannot Refute

[55] VIMA: General Robot Manipulation with Multimodal Prompts PDF

Cannot Refute

[56] Image Segmentation Using Text and Image Prompts PDF

Cannot Refute

[57] Domain-Agnostic Mutual Prompting for Unsupervised Domain Adaptation PDF

Cannot Refute

[58] Prompt-aware adapter: Learning adaptive visual tokens for multimodal large language models PDF

Cannot Refute

[59] Emotion-Oriented Cross-Modal Prompting and Alignment for Human-Centric Emotional Video Captioning PDF

Cannot Refute

[60] Mma: Multi-modal adapter for vision-language models PDF

Cannot Refute

RLAP-CLIP: Continual Multimodal Learning with Prototype Adaptation and Difficulty-Aware Routing

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Reinforcement Learning-based Prototype Optimization (RLPO)

[61] Online prototype learning for online continual learning PDF

[62] Few-Shot Lifelong Learning PDF

[63] Non-Exemplar Class-Incremental Learning via Prototype Correction and Hierarchical Regularization for Specific Emitter Identification PDF

[64] Flar: A unified prototype framework for few-sample lifelong active recognition PDF

[65] Exemplar-Free Lifelong Hyperspectral Classification with Spectral-Spatial Prototype Alignment PDF

[66] Sequential Recommendation with User Evolving Preference Decomposition PDF

Difficulty-aware cross-modal fusion with mixture-of-experts routing

[67] Fusemoe: Mixture-of-experts transformers for fleximodal fusion PDF

[68] Rode: Linear rectified mixture of diverse experts for food large multi-modal models PDF

[69] CorrMoE: Mixture of Experts with De-stylization Learning for Cross-Scene and Cross-Domain Correspondence Pruning PDF

[70] BR-MoE: Blind Multi-Modal Tracking with Route-Dynamic Mixture of Experts PDF

[71] LLaVA-CMoE: Towards Continual Mixture of Experts for Large Vision-Language Models PDF

[72] Applications, trends, and perspectives of large language models in education: A literature review PDF

[73] Language-Routing Mixture of Experts for Multilingual and Code-Switching Speech Recognition PDF

[74] DynMixMFusion: dynamic mixture of experts multimodal fusion model PDF

[75] Hierarchical Mixture-of-Experts for Multi-Task Sensor Analytics with Automatic Task Routing PDF

[76] MMCTOP: A Multimodal Textualization and Mixture-of-Experts Framework for Clinical Trial Outcome Prediction PDF

Enhanced dual-modal prompting for balanced visual and textual adaptation

[51] MaPLe: Multi-modal Prompt Learning PDF

[52] Vita-clip: Video and text adaptive clip via multimodal prompting PDF

[53] MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action PDF

[54] Controlmllm: Training-free visual prompt learning for multimodal large language models PDF

[55] VIMA: General Robot Manipulation with Multimodal Prompts PDF

[56] Image Segmentation Using Text and Image Prompts PDF

[57] Domain-Agnostic Mutual Prompting for Unsupervised Domain Adaptation PDF

[58] Prompt-aware adapter: Learning adaptive visual tokens for multimodal large language models PDF

[59] Emotion-Oriented Cross-Modal Prompting and Alignment for Human-Centric Emotional Video Captioning PDF

[60] Mma: Multi-modal adapter for vision-language models PDF

Table of Contents