Reversible Primitive–Composition Alignment for Continual Vision–Language Learning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

continual learningvision-language modelscatastrophic forgetting

Vision-language (VL) models are increasingly deployed in non-stationary settings, yet under sequential adaptation they often preserve primitive recognition while losing compositional structure, especially with tight rehearsal budgets and no task IDs. We address this gap by asking how a continual VL system can maintain structurally dependable behaviour while safeguarding zero-shot performance. We introduce Compo-ReAlign, a structure-first recipe built around three components: a reversible composer that maps primitive embeddings to compositions by design, a multi-positive InfoNCE that jointly aligns textual and composed views of the same target, and a spectral trust region that clips updates when alignment sensitivity inflates. Across compositional DIL and multi-domain MTIL retrieval, Compo-ReAlign sets a new state of the art, improves over the strongest prior by +2.4 R@1, and reduces forgetting by 40%. We provide a compact, reversible alignment head with geometry-aware training for compositionally robust VL continual learning.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Compo-ReAlign, a structure-first framework for continual vision-language learning that preserves compositional structure while maintaining zero-shot performance. It resides in the 'Cross-Modal Alignment Preservation in Continual Pre-training' leaf, which contains four papers total including the original work. This leaf sits within the broader 'Vision-Language Continual Pre-training and Foundation Models' branch, indicating a moderately populated research direction focused on maintaining cross-modal alignment during incremental pre-training rather than task-specific adaptation.

The taxonomy reveals neighboring leaves addressing related but distinct challenges: 'Modality-Incremental Vision-Language Pre-training' focuses on integrating new data streams, while 'Prompt-Based Continual Adaptation' emphasizes memory-efficient tuning of frozen backbones. The sibling papers in the same leaf—Compatible Momentum Contrast, Continual Multimodal Pretraining Guide, and one other—primarily address contrastive consistency and broad pre-training strategies. Compo-ReAlign diverges by emphasizing reversible compositional mechanisms and geometry-aware training, positioning it at the intersection of alignment preservation and compositional structure learning.

Among nine candidates examined across three contributions, the analysis found limited prior work overlap. The core Compo-ReAlign recipe examined four candidates with zero refutations, suggesting relative novelty in its structure-first approach. Phenomenon identification examined one candidate with no refutations. Text-centric micro-buffers examined four candidates with one refutation, indicating some overlap in rehearsal strategies. The search scope—nine papers from semantic matching—is modest, meaning these findings reflect top-ranked neighbors rather than exhaustive coverage of compositional continual learning literature.

Based on the limited search scope, the work appears to occupy a distinct position combining reversible composition with spectral trust regions for alignment-sensitive updates. The taxonomy structure shows this sits in a moderately explored area of continual pre-training, with clearer differentiation from prompt-based and modality-incremental neighbors. The contribution-level statistics suggest the structural recipe and diagnostic metrics are less anticipated by top-ranked prior work, though the micro-buffer strategy shows some precedent.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Continual vision-language learning with compositional structure preservation. The field addresses how multimodal systems can learn sequentially without forgetting previously acquired knowledge while maintaining the ability to compose visual and linguistic concepts. The taxonomy reveals several major branches: Vision-Language Continual Pre-training and Foundation Models focuses on large-scale pre-training strategies that preserve cross-modal alignment across tasks, exemplified by works like Compatible Momentum Contrast[3] and Continual Multimodal Pretraining Guide[15]. Compositional Structure Learning and Generalization emphasizes how systems can recombine primitives, as seen in Lifelong Compositional Structures[1] and Grounded Compositional Phrases[6]. Task-Specific Continual Multimodal Learning targets concrete applications such as visual question answering (VQACL Setting[8], Questions-Only Replay[7]) and audio-visual scenarios (Audio-Visual Class-Incremental[13]). Modular and Compositional Architectures explore parameter-efficient designs (Rehearsal-Free Modular[14], Dual-Purpose Mixture-of-Experts[25]), while other branches investigate encoding strategies (Byte-Pair Visual Encoding[5]), bio-inspired approaches (Bio-Inspired Robotics[11]), and domain-specific applications in robotics (Compositional Foundation Models[23]) and education (Multimodal Composition Technology[24]). A central tension across these branches involves balancing plasticity for new tasks against stability of existing knowledge, particularly when compositional structures must be preserved. Many studies explore prompt-based methods (CP-Prompt[10], Hierarchical Prompt Composition[21]) and memory-efficient replay strategies (Questions-Only Memory[12], MemEIC[16]) to mitigate catastrophic forgetting. Reversible Primitive Composition[0] sits within the Vision-Language Continual Pre-training branch, specifically addressing cross-modal alignment preservation. Compared to Compatible Momentum Contrast[3], which maintains contrastive consistency across incremental data, and Continual Multimodal Pretraining Guide[15], which surveys broader pre-training strategies, the original work emphasizes reversible mechanisms for composing and decomposing visual-linguistic primitives. This positions it as a structural approach to continual learning, contrasting with purely alignment-focused or survey-oriented neighbors, while sharing the branch's overarching goal of sustaining foundational multimodal representations over time.

Claimed Contributions

COMPO-REALIGN: a structure-first recipe for continual vision–language learning

4 retrieved papers

The authors propose COMPO-REALIGN, a minimal training framework for continual VL systems consisting of three core components: (1) a reversible composer using orthogonal transformations to map primitives to compositions, (2) a multi-positive InfoNCE objective treating textual and composed embeddings as joint positives, and (3) a spectral trust region that constrains gradient updates based on Jacobian sensitivity.

4 retrieved papers

Phenomenon identification and diagnostic metrics for compositional degradation

1 retrieved paper

The authors identify a specific failure mode where continual VL models preserve primitive recognition while losing compositional structure, and introduce diagnostic metrics including compositional retention ratios, cycle consistency error, and Jacobian spectral indicators to measure and predict this degradation.

1 retrieved paper

Demonstration of text-centric micro-buffers as structural anchors

Can Refute

4 retrieved papers

The authors demonstrate that using small text-centric buffers as symbolic scaffolds, combined with their alignment scheme, achieves superior compositional retention and reduced forgetting compared to image-centric replay and other baselines under identical memory constraints.

4 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[3] CTP: Towards Vision-Language Continual Pretraining via Compatible Momentum Contrast and Topology Preservation PDF

Hongguang Zhu, Yunchao Wei, Xiaodan Liang, Chunjie Zhang, Yao Zhao (2023)

[4] Continual learning for VLMs: A survey and taxonomy beyond forgetting PDF

Liu Yuyang, Huang Lin-lan, Goswami, Dipam, Liu Xia-lei, van de Weijer, Joost, Tian, Yonghong (2025)

[15] A Practitioner's Guide to Real-World Continual Multimodal Pretraining PDF

Zeynep Akata, Samuel Albanie, Matthias Bethge, Mehdi Cherti, Sebastian Dziadzio, Ameya Prabhu, Olivier HÃ©naff, Karsten Roth, Vishaal Udandarao, Oriol Vinyals, O. Vinyals (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

COMPO-REALIGN: a structure-first recipe for continual vision–language learning

[28] Roentgen: vision-language foundation model for chest x-ray generation PDF

Cannot Refute

[29] Clap4clip: Continual learning with probabilistic finetuning for vision-language models PDF

Cannot Refute

[30] Model developmental safety: A safety-centric method and applications in vision-language models PDF

Cannot Refute

[31] Self-Controlled Dynamic Expansion Model for Continual Learning PDF

Cannot Refute

Contribution

Phenomenon identification and diagnostic metrics for compositional degradation

[27] Continual Vision-Language Representation Learning with Off-Diagonal Information PDF

Cannot Refute

Contribution

Demonstration of text-centric micro-buffers as structural anchors

[12] No Images, No Problem: Retaining Knowledge in Continual VQA with Questions-Only Memory PDF

Can Refute

[4] Continual learning for VLMs: A survey and taxonomy beyond forgetting PDF

Cannot Refute

[7] Ask and remember: A questions-only replay strategy for continual visual question answering PDF

Cannot Refute

[26] ETHER: Aligning Emergent Communication for Hindsight Experience Replay PDF

Cannot Refute

Reversible Primitive–Composition Alignment for Continual Vision–Language Learning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[3] CTP: Towards Vision-Language Continual Pretraining via Compatible Momentum Contrast and Topology Preservation PDF

[4] Continual learning for VLMs: A survey and taxonomy beyond forgetting PDF

[15] A Practitioner's Guide to Real-World Continual Multimodal Pretraining PDF

Contribution Analysis

COMPO-REALIGN: a structure-first recipe for continual vision–language learning

[28] Roentgen: vision-language foundation model for chest x-ray generation PDF

[29] Clap4clip: Continual learning with probabilistic finetuning for vision-language models PDF

[30] Model developmental safety: A safety-centric method and applications in vision-language models PDF

[31] Self-Controlled Dynamic Expansion Model for Continual Learning PDF

Phenomenon identification and diagnostic metrics for compositional degradation

[27] Continual Vision-Language Representation Learning with Off-Diagonal Information PDF

Demonstration of text-centric micro-buffers as structural anchors

[12] No Images, No Problem: Retaining Knowledge in Continual VQA with Questions-Only Memory PDF

[4] Continual learning for VLMs: A survey and taxonomy beyond forgetting PDF

[7] Ask and remember: A questions-only replay strategy for continual visual question answering PDF

[26] ETHER: Aligning Emergent Communication for Hindsight Experience Replay PDF

Table of Contents