Abstract:

Continual post-training (CPT) is a popular and effective technique for adapting foundation models like multimodal large language models to specific and ever-evolving downstream tasks. While existing research has primarily concentrated on methods like data replay, model expansion, or parameter regularization, the fundamental role of the learning paradigm within CPT remains largely unexplored. This paper presents a comparative analysis of two core post-training paradigms: supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT), investigating their respective impacts on knowledge retention during CPT. Our experiments are conducted on a benchmark comprising seven diverse multimodal tasks, utilizing Qwen2.5-VL-7B-Instruct as the base model for continual post-training. The investigation yields two significant findings: (1) When continuously learning on downstream tasks, SFT leads to catastrophic forgetting of previously learned tasks. In contrast, RFT inherently preserves prior knowledge and achieve performance comparable to multi-task training. (2) RFT successfully protects and even enhances the model's general knowledge on standard benchmarks (e.g., MMMU and MMLU-Pro). Conversely, SFT degrades general model capabilities severely. Further analysis reveals that this stability is not primarily due to explicit mechanisms like KL penalty or chain-of-thought reasoning. Instead, we identify an implicit regularization mechanism inherent to RFT as a key contributing factor. Our theoretical analysis suggests that RFT's gradient updates are naturally scaled by the reward variance, acting as a data-dependent regularizer that inherently protects previously acquired knowledge. Finally, we propose a rollout-based instance filtering algorithm to enhance the stability and efficiency of RFT. Our comprehensive study demonstrates the superiority of RFT as a robust paradigm for continual post-training.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper investigates supervised fine-tuning versus reinforcement fine-tuning for continual post-training of multimodal large language models, specifically examining catastrophic forgetting across seven diverse tasks using Qwen2.5-VL-7B-Instruct. It resides in the 'Post-Training Paradigm Comparison and Optimization' leaf, which contains only two papers total. This represents a relatively sparse research direction within the broader taxonomy of fifty papers, suggesting the paradigm-level comparison between SFT and RFT for continual multimodal learning remains underexplored despite growing interest in parameter-efficient and modular approaches elsewhere in the field.

The taxonomy reveals neighboring work in parameter-efficient continual learning (six papers using LoRA and adapters) and mixture-of-experts architectures (four papers employing dynamic expert selection). The paper's focus on training paradigms rather than architectural modifications distinguishes it from these adjacent clusters. Nearby branches address forgetting mitigation through replay and distillation (Vision-Language Model Forgetting Prevention, Multimodal LLM Knowledge Preservation), while the paper emphasizes inherent regularization properties of reinforcement learning itself. This positions the work at the intersection of training methodology and knowledge retention, bridging paradigm-level questions with practical forgetting concerns.

Among thirty candidates examined, the comparative analysis of SFT versus RFT (Contribution 1) and the identification of implicit regularization in RFT (Contribution 2) each examined ten candidates with zero refutable prior work, suggesting these angles are relatively unexplored in the limited search scope. The rollout-based instance filtering algorithm (Contribution 3) examined ten candidates and found one potentially overlapping prior work, indicating some existing exploration of filtering mechanisms in reinforcement-based continual learning. The statistics suggest the paradigm comparison itself appears more novel than the specific algorithmic technique, though the search scope remains constrained to top-thirty semantic matches.

Based on the limited literature search covering thirty candidates from semantic retrieval, the work appears to occupy a sparsely populated niche comparing fundamental training paradigms for multimodal continual learning. The taxonomy structure confirms that while parameter-efficient methods and architectural solutions dominate the field, paradigm-level investigations remain less common. However, the analysis does not cover exhaustive citation networks or domain-specific venues, leaving open the possibility of relevant work outside the examined scope.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: continual post-training of multimodal large language models. The field addresses how to incrementally update multimodal LLMs—models that process both vision and language—without catastrophic forgetting of previously learned capabilities. The taxonomy reveals several main branches: Continual Learning Paradigms and Training Strategies explores diverse training recipes and optimization techniques, including parameter-efficient methods like LoRA-based approaches and mixture-of-experts architectures (e.g., MoE Adapters VLM[6], CL-MoE[28]); Forgetting Mitigation and Knowledge Retention focuses on replay mechanisms, regularization, and knowledge distillation to preserve old skills; Benchmarks, Evaluation, and Taxonomies provides systematic assessments and surveys (e.g., VLM Continual Learning Survey[20], Continual Learning MLLM Survey[4]); Domain-Specific Continual Learning Applications targets specialized areas such as biomedical imaging (Continual Retinal Vision-Language[32]) and whole-slide pathology (Lifelong Whole Slide[19]); Foundational and Theoretical Perspectives examines underlying principles; and Related Multimodal and LLM Research connects to broader vision-language and foundation model work (Cambrian-1[18], Foundation Language Models Survey[35]). A particularly active line of work contrasts supervised fine-tuning with reinforcement-based post-training methods, exploring trade-offs between task-specific adaptation and retention of general capabilities. Reinforcement Fine-Tuning Mitigates Forgetting[0] sits within the Post-Training Paradigm Comparison and Optimization cluster, emphasizing how reinforcement learning can reduce forgetting compared to standard supervised approaches—a theme echoed by Unsupervised Post-Training GRPO[14], which also investigates policy-gradient techniques for continual adaptation. Meanwhile, many studies adopt parameter-efficient strategies (Instance-Aware Prompting[10], Dynamic Curriculum LoRA[36]) or modular architectures (Lifelong Knowledge Editing MoE[5]) to isolate new knowledge from old. Open questions remain around balancing plasticity and stability, scaling to longer task sequences, and ensuring zero-shot generalization is not degraded (Preventing Zero-Shot Degradation[31]). The original paper's focus on reinforcement-based fine-tuning places it alongside emerging efforts that view post-training as a policy optimization problem, contrasting with earlier replay-heavy or distillation-centric methods (Learning Without Forgetting VLM[2], C-CLIP[3]).

Claimed Contributions

Comparative analysis of SFT and RFT for continual post-training

The authors conduct the first systematic comparison between supervised fine-tuning and reinforcement fine-tuning paradigms in continual post-training settings for multimodal large language models. They demonstrate that RFT inherently mitigates catastrophic forgetting on both downstream tasks and general benchmarks, while SFT suffers severe degradation.

10 retrieved papers
Identification of implicit regularization mechanism in RFT

The authors identify and theoretically analyze an implicit regularization mechanism inherent to RFT, showing that gradient updates are naturally scaled by reward variance. This data-dependent regularization protects previously acquired knowledge more effectively than explicit mechanisms like KL penalty or chain-of-thought reasoning.

10 retrieved papers
Rollout-based instance filtering algorithm (RIF-RFT)

The authors introduce RIF-RFT, a method that filters training instances based on rollout-generated reward signals before RFT training. This algorithm improves sample efficiency and training stability by removing incompetent samples that produce zero reward variance, while preserving the anti-forgetting properties of RFT.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Comparative analysis of SFT and RFT for continual post-training

The authors conduct the first systematic comparison between supervised fine-tuning and reinforcement fine-tuning paradigms in continual post-training settings for multimodal large language models. They demonstrate that RFT inherently mitigates catastrophic forgetting on both downstream tasks and general benchmarks, while SFT suffers severe degradation.

Contribution

Identification of implicit regularization mechanism in RFT

The authors identify and theoretically analyze an implicit regularization mechanism inherent to RFT, showing that gradient updates are naturally scaled by reward variance. This data-dependent regularization protects previously acquired knowledge more effectively than explicit mechanisms like KL penalty or chain-of-thought reasoning.

Contribution

Rollout-based instance filtering algorithm (RIF-RFT)

The authors introduce RIF-RFT, a method that filters training instances based on rollout-generated reward signals before RFT training. This algorithm improves sample efficiency and training stability by removing incompetent samples that produce zero reward variance, while preserving the anti-forgetting properties of RFT.