Reversible Primitive–Composition Alignment for Continual Vision–Language Learning
Overview
Overall Novelty Assessment
The paper introduces Compo-ReAlign, a structure-first framework for continual vision-language learning that preserves compositional structure while maintaining zero-shot performance. It resides in the 'Cross-Modal Alignment Preservation in Continual Pre-training' leaf, which contains four papers total including the original work. This leaf sits within the broader 'Vision-Language Continual Pre-training and Foundation Models' branch, indicating a moderately populated research direction focused on maintaining cross-modal alignment during incremental pre-training rather than task-specific adaptation.
The taxonomy reveals neighboring leaves addressing related but distinct challenges: 'Modality-Incremental Vision-Language Pre-training' focuses on integrating new data streams, while 'Prompt-Based Continual Adaptation' emphasizes memory-efficient tuning of frozen backbones. The sibling papers in the same leaf—Compatible Momentum Contrast, Continual Multimodal Pretraining Guide, and one other—primarily address contrastive consistency and broad pre-training strategies. Compo-ReAlign diverges by emphasizing reversible compositional mechanisms and geometry-aware training, positioning it at the intersection of alignment preservation and compositional structure learning.
Among nine candidates examined across three contributions, the analysis found limited prior work overlap. The core Compo-ReAlign recipe examined four candidates with zero refutations, suggesting relative novelty in its structure-first approach. Phenomenon identification examined one candidate with no refutations. Text-centric micro-buffers examined four candidates with one refutation, indicating some overlap in rehearsal strategies. The search scope—nine papers from semantic matching—is modest, meaning these findings reflect top-ranked neighbors rather than exhaustive coverage of compositional continual learning literature.
Based on the limited search scope, the work appears to occupy a distinct position combining reversible composition with spectral trust regions for alignment-sensitive updates. The taxonomy structure shows this sits in a moderately explored area of continual pre-training, with clearer differentiation from prompt-based and modality-incremental neighbors. The contribution-level statistics suggest the structural recipe and diagnostic metrics are less anticipated by top-ranked prior work, though the micro-buffer strategy shows some precedent.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors propose COMPO-REALIGN, a minimal training framework for continual VL systems consisting of three core components: (1) a reversible composer using orthogonal transformations to map primitives to compositions, (2) a multi-positive InfoNCE objective treating textual and composed embeddings as joint positives, and (3) a spectral trust region that constrains gradient updates based on Jacobian sensitivity.
The authors identify a specific failure mode where continual VL models preserve primitive recognition while losing compositional structure, and introduce diagnostic metrics including compositional retention ratios, cycle consistency error, and Jacobian spectral indicators to measure and predict this degradation.
The authors demonstrate that using small text-centric buffers as symbolic scaffolds, combined with their alignment scheme, achieves superior compositional retention and reduced forgetting compared to image-centric replay and other baselines under identical memory constraints.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[3] CTP: Towards Vision-Language Continual Pretraining via Compatible Momentum Contrast and Topology Preservation PDF
[4] Continual learning for VLMs: A survey and taxonomy beyond forgetting PDF
[15] A Practitioner's Guide to Real-World Continual Multimodal Pretraining PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
COMPO-REALIGN: a structure-first recipe for continual vision–language learning
The authors propose COMPO-REALIGN, a minimal training framework for continual VL systems consisting of three core components: (1) a reversible composer using orthogonal transformations to map primitives to compositions, (2) a multi-positive InfoNCE objective treating textual and composed embeddings as joint positives, and (3) a spectral trust region that constrains gradient updates based on Jacobian sensitivity.
[28] Roentgen: vision-language foundation model for chest x-ray generation PDF
[29] Clap4clip: Continual learning with probabilistic finetuning for vision-language models PDF
[30] Model developmental safety: A safety-centric method and applications in vision-language models PDF
[31] Self-Controlled Dynamic Expansion Model for Continual Learning PDF
Phenomenon identification and diagnostic metrics for compositional degradation
The authors identify a specific failure mode where continual VL models preserve primitive recognition while losing compositional structure, and introduce diagnostic metrics including compositional retention ratios, cycle consistency error, and Jacobian spectral indicators to measure and predict this degradation.
[27] Continual Vision-Language Representation Learning with Off-Diagonal Information PDF
Demonstration of text-centric micro-buffers as structural anchors
The authors demonstrate that using small text-centric buffers as symbolic scaffolds, combined with their alignment scheme, achieves superior compositional retention and reduced forgetting compared to image-centric replay and other baselines under identical memory constraints.