Reinforcement Fine-Tuning Naturally Mitigates Forgetting in Continual Post-Training

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Continual LearningMLLM

Continual post-training (CPT) is a popular and effective technique for adapting foundation models like multimodal large language models to specific and ever-evolving downstream tasks. While existing research has primarily concentrated on methods like data replay, model expansion, or parameter regularization, the fundamental role of the learning paradigm within CPT remains largely unexplored. This paper presents a comparative analysis of two core post-training paradigms: supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT), investigating their respective impacts on knowledge retention during CPT. Our experiments are conducted on a benchmark comprising seven diverse multimodal tasks, utilizing Qwen2.5-VL-7B-Instruct as the base model for continual post-training. The investigation yields two significant findings: (1) When continuously learning on downstream tasks, SFT leads to catastrophic forgetting of previously learned tasks. In contrast, RFT inherently preserves prior knowledge and achieve performance comparable to multi-task training. (2) RFT successfully protects and even enhances the model's general knowledge on standard benchmarks (e.g., MMMU and MMLU-Pro). Conversely, SFT degrades general model capabilities severely. Further analysis reveals that this stability is not primarily due to explicit mechanisms like KL penalty or chain-of-thought reasoning. Instead, we identify an implicit regularization mechanism inherent to RFT as a key contributing factor. Our theoretical analysis suggests that RFT's gradient updates are naturally scaled by the reward variance, acting as a data-dependent regularizer that inherently protects previously acquired knowledge. Finally, we propose a rollout-based instance filtering algorithm to enhance the stability and efficiency of RFT. Our comprehensive study demonstrates the superiority of RFT as a robust paradigm for continual post-training.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper investigates supervised fine-tuning versus reinforcement fine-tuning for continual post-training of multimodal large language models, specifically examining catastrophic forgetting across seven diverse tasks using Qwen2.5-VL-7B-Instruct. It resides in the 'Post-Training Paradigm Comparison and Optimization' leaf, which contains only two papers total. This represents a relatively sparse research direction within the broader taxonomy of fifty papers, suggesting the paradigm-level comparison between SFT and RFT for continual multimodal learning remains underexplored despite growing interest in parameter-efficient and modular approaches elsewhere in the field.

The taxonomy reveals neighboring work in parameter-efficient continual learning (six papers using LoRA and adapters) and mixture-of-experts architectures (four papers employing dynamic expert selection). The paper's focus on training paradigms rather than architectural modifications distinguishes it from these adjacent clusters. Nearby branches address forgetting mitigation through replay and distillation (Vision-Language Model Forgetting Prevention, Multimodal LLM Knowledge Preservation), while the paper emphasizes inherent regularization properties of reinforcement learning itself. This positions the work at the intersection of training methodology and knowledge retention, bridging paradigm-level questions with practical forgetting concerns.

Among thirty candidates examined, the comparative analysis of SFT versus RFT (Contribution 1) and the identification of implicit regularization in RFT (Contribution 2) each examined ten candidates with zero refutable prior work, suggesting these angles are relatively unexplored in the limited search scope. The rollout-based instance filtering algorithm (Contribution 3) examined ten candidates and found one potentially overlapping prior work, indicating some existing exploration of filtering mechanisms in reinforcement-based continual learning. The statistics suggest the paradigm comparison itself appears more novel than the specific algorithmic technique, though the search scope remains constrained to top-thirty semantic matches.

Based on the limited literature search covering thirty candidates from semantic retrieval, the work appears to occupy a sparsely populated niche comparing fundamental training paradigms for multimodal continual learning. The taxonomy structure confirms that while parameter-efficient methods and architectural solutions dominate the field, paradigm-level investigations remain less common. However, the analysis does not cover exhaustive citation networks or domain-specific venues, leaving open the possibility of relevant work outside the examined scope.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: continual post-training of multimodal large language models. The field addresses how to incrementally update multimodal LLMs—models that process both vision and language—without catastrophic forgetting of previously learned capabilities. The taxonomy reveals several main branches: Continual Learning Paradigms and Training Strategies explores diverse training recipes and optimization techniques, including parameter-efficient methods like LoRA-based approaches and mixture-of-experts architectures (e.g., MoE Adapters VLM[6], CL-MoE[28]); Forgetting Mitigation and Knowledge Retention focuses on replay mechanisms, regularization, and knowledge distillation to preserve old skills; Benchmarks, Evaluation, and Taxonomies provides systematic assessments and surveys (e.g., VLM Continual Learning Survey[20], Continual Learning MLLM Survey[4]); Domain-Specific Continual Learning Applications targets specialized areas such as biomedical imaging (Continual Retinal Vision-Language[32]) and whole-slide pathology (Lifelong Whole Slide[19]); Foundational and Theoretical Perspectives examines underlying principles; and Related Multimodal and LLM Research connects to broader vision-language and foundation model work (Cambrian-1[18], Foundation Language Models Survey[35]). A particularly active line of work contrasts supervised fine-tuning with reinforcement-based post-training methods, exploring trade-offs between task-specific adaptation and retention of general capabilities. Reinforcement Fine-Tuning Mitigates Forgetting[0] sits within the Post-Training Paradigm Comparison and Optimization cluster, emphasizing how reinforcement learning can reduce forgetting compared to standard supervised approaches—a theme echoed by Unsupervised Post-Training GRPO[14], which also investigates policy-gradient techniques for continual adaptation. Meanwhile, many studies adopt parameter-efficient strategies (Instance-Aware Prompting[10], Dynamic Curriculum LoRA[36]) or modular architectures (Lifelong Knowledge Editing MoE[5]) to isolate new knowledge from old. Open questions remain around balancing plasticity and stability, scaling to longer task sequences, and ensuring zero-shot generalization is not degraded (Preventing Zero-Shot Degradation[31]). The original paper's focus on reinforcement-based fine-tuning places it alongside emerging efforts that view post-training as a policy optimization problem, contrasting with earlier replay-heavy or distillation-centric methods (Learning Without Forgetting VLM[2], C-CLIP[3]).

Claimed Contributions

Comparative analysis of SFT and RFT for continual post-training

10 retrieved papers

The authors conduct the first systematic comparison between supervised fine-tuning and reinforcement fine-tuning paradigms in continual post-training settings for multimodal large language models. They demonstrate that RFT inherently mitigates catastrophic forgetting on both downstream tasks and general benchmarks, while SFT suffers severe degradation.

10 retrieved papers

Identification of implicit regularization mechanism in RFT

10 retrieved papers

The authors identify and theoretically analyze an implicit regularization mechanism inherent to RFT, showing that gradient updates are naturally scaled by reward variance. This data-dependent regularization protects previously acquired knowledge more effectively than explicit mechanisms like KL penalty or chain-of-thought reasoning.

10 retrieved papers

Rollout-based instance filtering algorithm (RIF-RFT)

Can Refute

10 retrieved papers

The authors introduce RIF-RFT, a method that filters training instances based on rollout-generated reward signals before RFT training. This algorithm improves sample efficiency and training stability by removing incompetent samples that produce zero reward variance, while preserving the anti-forgetting properties of RFT.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[14] Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO PDF

Wei Lai, Li Yuting, Wang Chen, Wang, Yue, Kong Linghe, Huang, Weiran, Sun, Lichao (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Comparative analysis of SFT and RFT for continual post-training

[1] Mllm-cl: Continual learning for multimodal large language models PDF

Cannot Refute

[13] Merge then Realign: Simple and Effective Modality-Incremental Continual Learning for Multimodal LLMs PDF

Cannot Refute

[35] Recent advances of foundation language models-based continual learning: A survey PDF

Cannot Refute

[61] Model tailor: Mitigating catastrophic forgetting in multi-modal large language models PDF

Cannot Refute

[62] An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-Tuning PDF

Cannot Refute

[63] Investigating the Catastrophic Forgetting in Multimodal Large Language Models PDF

Cannot Refute

[64] Keeping yourself is important in downstream tuning multimodal large language model PDF

Cannot Refute

[65] Modality-inconsistent continual learning of multimodal large language models PDF

Cannot Refute

[66] Analyzing and reducing catastrophic forgetting in parameter efficient tuning PDF

Cannot Refute

[67] Lorasculpt: Sculpting lora for harmonizing general and specialized knowledge in multimodal large language models PDF

Cannot Refute

Contribution

Identification of implicit regularization mechanism in RFT

[51] Training language models to self-correct via reinforcement learning PDF

Cannot Refute

[52] Offline regularised reinforcement learning for large language models alignment PDF

Cannot Refute

[53] CPGD: Toward Stable Rule-based Reinforcement Learning for Language Models PDF

Cannot Refute

[54] Cream: Consistency regularized self-rewarding language models PDF

Cannot Refute

[55] Adaptive Divergence Regularized Policy Optimization for Fine-tuning Generative Models PDF

Cannot Refute

[56] Language model adaption for reinforcement learning with natural language action space PDF

Cannot Refute

[57] Docthinker: Explainable multimodal large language models with rule-based reinforcement learning for document understanding PDF

Cannot Refute

[58] GRPO-LEAD: A Difficulty-Aware Reinforcement Learning Approach for Concise Mathematical Reasoning in Language Models PDF

Cannot Refute

[59] Logarithmic Regret for Online KL-Regularized Reinforcement Learning PDF

Cannot Refute

[60] Implicit regularization of sharpness-aware minimization for scale-invariant problems PDF

Cannot Refute

Contribution

Rollout-based instance filtering algorithm (RIF-RFT)

[68] Online difficulty filtering for reasoning oriented reinforcement learning PDF

Can Refute

[69] Tapered Off-Policy REINFORCE: Stable and efficient reinforcement learning for LLMs PDF

Cannot Refute

[70] Filtering learning histories enhances in-context reinforcement learning PDF

Cannot Refute

[71] Safety Filtering While Training: Improving the Performance and Sample Efficiency of Reinforcement Learning Agents PDF

Cannot Refute

[72] MagicGUI: A Foundational Mobile GUI Agent with Scalable Data Pipeline and Reinforcement Fine-tuning PDF

Cannot Refute

[73] Sample Efficient Reinforcement Learning with REINFORCE PDF

Cannot Refute

[74] Entropy-guided sequence weighting for efficient exploration in RL-based LLM fine-tuning PDF

Cannot Refute

[75] Data-efficient Fine-tuning for LLM-based Recommendation PDF

Cannot Refute

[76] Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement PDF

Cannot Refute

[77] Autonomous underwater vehicle docking under realistic assumptions using deep reinforcement learning PDF

Cannot Refute

Reinforcement Fine-Tuning Naturally Mitigates Forgetting in Continual Post-Training

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[14] Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO PDF

Contribution Analysis

Comparative analysis of SFT and RFT for continual post-training

[1] Mllm-cl: Continual learning for multimodal large language models PDF

[13] Merge then Realign: Simple and Effective Modality-Incremental Continual Learning for Multimodal LLMs PDF

[35] Recent advances of foundation language models-based continual learning: A survey PDF

[61] Model tailor: Mitigating catastrophic forgetting in multi-modal large language models PDF

[62] An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-Tuning PDF

[63] Investigating the Catastrophic Forgetting in Multimodal Large Language Models PDF

[64] Keeping yourself is important in downstream tuning multimodal large language model PDF

[65] Modality-inconsistent continual learning of multimodal large language models PDF

[66] Analyzing and reducing catastrophic forgetting in parameter efficient tuning PDF

[67] Lorasculpt: Sculpting lora for harmonizing general and specialized knowledge in multimodal large language models PDF

Identification of implicit regularization mechanism in RFT

[51] Training language models to self-correct via reinforcement learning PDF

[52] Offline regularised reinforcement learning for large language models alignment PDF

[53] CPGD: Toward Stable Rule-based Reinforcement Learning for Language Models PDF

[54] Cream: Consistency regularized self-rewarding language models PDF

[55] Adaptive Divergence Regularized Policy Optimization for Fine-tuning Generative Models PDF

[56] Language model adaption for reinforcement learning with natural language action space PDF

[57] Docthinker: Explainable multimodal large language models with rule-based reinforcement learning for document understanding PDF

[58] GRPO-LEAD: A Difficulty-Aware Reinforcement Learning Approach for Concise Mathematical Reasoning in Language Models PDF

[59] Logarithmic Regret for Online KL-Regularized Reinforcement Learning PDF

[60] Implicit regularization of sharpness-aware minimization for scale-invariant problems PDF

Rollout-based instance filtering algorithm (RIF-RFT)

[68] Online difficulty filtering for reasoning oriented reinforcement learning PDF

[69] Tapered Off-Policy REINFORCE: Stable and efficient reinforcement learning for LLMs PDF

[70] Filtering learning histories enhances in-context reinforcement learning PDF

[71] Safety Filtering While Training: Improving the Performance and Sample Efficiency of Reinforcement Learning Agents PDF

[72] MagicGUI: A Foundational Mobile GUI Agent with Scalable Data Pipeline and Reinforcement Fine-tuning PDF

[73] Sample Efficient Reinforcement Learning with REINFORCE PDF

[74] Entropy-guided sequence weighting for efficient exploration in RL-based LLM fine-tuning PDF

[75] Data-efficient Fine-tuning for LLM-based Recommendation PDF

[76] Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement PDF

[77] Autonomous underwater vehicle docking under realistic assumptions using deep reinforcement learning PDF

Table of Contents