RL's Razor: Why Online Reinforcement Learning Forgets Less

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Reinforcement LearningLarge Language ModelsCatastrophic Forgetting

Comparison of fine-tuning models with reinforcement learning (RL) and supervised fine-tuning (SFT) reveals that, despite similar performance at a new task, RL preserves prior knowledge and capabilities significantly better. We find that the degree of forgetting is determined by the distributional shift, measured as the KL-divergence between the fine-tuned and base policy evaluated on the new task. Our analysis reveals that on-policy RL is implicitly biased towards KL-minimal solutions among the many that solve the new task, whereas SFT can converge to distributions arbitrarily far from the base model. We validate these findings through experiments with large language models and robotic foundation models and further provide theoretical justification for why on-policy RL updates lead to a smaller KL change. We term this principle $\textit{RL’s Razor}$ : among all ways to solve a new task, RL prefers those closest in KL to the original model.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces RL's Razor, a principle explaining why reinforcement learning fine-tuning preserves prior knowledge better than supervised fine-tuning by implicitly minimizing KL divergence from the base model. It resides in the 'Scaling Laws and Distributional Analysis' leaf alongside two sibling papers examining forgetting through scaling and distributional lenses. This leaf sits within the broader 'Analysis and Characterization of Forgetting Phenomena' branch, which contains four leaves spanning scaling laws, multimodal forgetting, task interference, and feature preservation. The analytical focus distinguishes this work from the field's dominant mitigation-oriented branches.

The taxonomy reveals substantial activity in mitigation strategies, with three major branches dedicated to regularization, parameter-efficient fine-tuning, and rehearsal methods. The paper's analytical positioning connects it to neighboring leaves examining task interference mechanisms and feature preservation dynamics, yet diverges by focusing specifically on distributional shift quantification rather than task-level or representation-level analysis. The 'Domain-Specific and Continual Learning Applications' branch, containing five leaves, suggests active translation of forgetting insights to specialized settings, while the paper maintains a domain-agnostic theoretical stance grounded in KL-divergence characterization.

Among thirty candidates examined, the theoretical justification for on-policy RL's KL-minimal convergence encountered two potentially refutable prior works, while the empirical forgetting law and RL's Razor principle showed no clear refutation across ten candidates each. The limited search scope means these statistics reflect top-semantic-match coverage rather than exhaustive field review. The empirical law linking KL divergence to forgetting and the RL's Razor principle appear more novel within this candidate set, whereas the theoretical convergence claims face more substantial prior work overlap, suggesting this contribution may build incrementally on existing RL theory.

Based on the thirty-candidate search, the work appears to occupy a moderately explored analytical niche, with the empirical and conceptual contributions showing stronger novelty signals than the theoretical justification. The taxonomy structure indicates this is a growing but not yet saturated research direction, with only three papers in the immediate leaf. However, the analysis cannot assess novelty against the full literature landscape or specialized RL theory venues not captured in this semantic search.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: catastrophic forgetting in foundation model fine-tuning. The field addresses how large pretrained models lose previously acquired capabilities when adapted to new tasks or domains. The taxonomy organizes research into several major branches: mitigation strategies via regularization and constraint (e.g., preserving important parameters or enforcing stability), parameter-efficient fine-tuning methods (such as LoRA variants that update only small subsets of weights), rehearsal and data selection approaches (replaying or curating examples from original distributions), selective parameter updates (identifying which layers or modules to modify), analysis and characterization of forgetting phenomena (understanding when and why forgetting occurs), domain-specific and continual learning applications (applying these ideas to specialized settings), surveys and benchmarks (providing broader perspectives), and alternative perspectives including unlearning. Works like Catastrophic Forgetting Multimodal[1] and Half Fine Tuning[4] illustrate how different branches tackle the problem from complementary angles, whether through architectural choices or training protocols. A particularly active line of inquiry focuses on understanding the fundamental mechanisms and scaling behaviors of forgetting. Studies such as Scaling Laws Forgetting[3] and Scale Effect Forgetting[15] examine how model size, dataset characteristics, and training dynamics influence the severity of catastrophic forgetting, revealing that larger models or different data distributions can exhibit distinct forgetting patterns. RLs Razor[0] sits within this analytical branch, emphasizing distributional and scaling perspectives to characterize forgetting phenomena. Compared to neighboring works like Scaling Laws Forgetting[3], which may focus on empirical scaling trends, and Scale Effect Forgetting[15], which explores the interplay between model scale and retention, RLs Razor[0] appears to offer a complementary lens on how reinforcement learning or related optimization pressures interact with forgetting dynamics. This analytical focus contrasts with mitigation-oriented branches, highlighting ongoing questions about whether forgetting is an inevitable trade-off or a phenomenon that can be fundamentally reshaped through better understanding of model behavior and data geometry.

Claimed Contributions

Empirical forgetting law linking KL divergence to catastrophic forgetting

10 retrieved papers

The authors discover that the degree of catastrophic forgetting during fine-tuning can be reliably predicted by measuring the KL divergence between the fine-tuned and base policy on the new task distribution, independent of training algorithm or hyperparameters.

10 retrieved papers

RL's Razor principle explaining RL's implicit KL minimization

10 retrieved papers

The authors introduce RL's Razor, a principle stating that on-policy reinforcement learning methods are inherently biased toward solutions that minimize KL divergence from the base model among all high-reward solutions, unlike supervised fine-tuning which can converge to arbitrarily distant distributions.

10 retrieved papers

Theoretical justification for on-policy methods converging to KL-minimal solutions

Can Refute

9 retrieved papers

The authors provide theoretical analysis (Theorem 5.2) showing that policy gradient methods converge to KL-minimal optimal policies within the representable family, formalizing why on-policy training naturally produces smaller distributional shifts than offline methods.

9 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[3] Scaling laws for forgetting when fine-tuning large language models PDF

Kalajdzievski, Damjan, Damjan Kalajdzievski (2024)

[15] Effect of scale on catastrophic forgetting in neural networks PDF

VV Ramasesh, A Lewkowycz, E Dyer (2021)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Empirical forgetting law linking KL divergence to catastrophic forgetting

[60] Solving the catastrophic forgetting problem in generalized category discovery PDF

Cannot Refute

[61] Overcoming catastrophic forgetting by bayesian generative regularization PDF

Cannot Refute

[62] Context-Free Synthetic Data Mitigates Forgetting PDF

Cannot Refute

[63] Reducing catastrophic forgetting in neural networks via gaussian mixture approximation PDF

Cannot Refute

[64] Continual lifelong learning in neural systems: overcoming catastrophic forgetting and transferring knowledge for future learning PDF

Cannot Refute

[65] MoL for LLMs: Dual-Loss Optimization to Enhance Domain Expertise While Preserving General Capabilities PDF

Cannot Refute

[66] Distance-based weight transfer for fine-tuning from near-field to far-field speaker verification PDF

Cannot Refute

[67] Using personalized speech synthesis and neural language generator for rapid speaker adaptation PDF

Cannot Refute

[68] CPR: Classifier-Projection Regularization for Continual Learning PDF

Cannot Refute

[69] Continual learning: Overcoming catastrophic forgetting in neural networks-a survey PDF

Cannot Refute

Contribution

RL's Razor principle explaining RL's implicit KL minimization

[70] Categorical distributional reinforcement learning with kullback-leibler divergence: Convergence and asymptotics PDF

Cannot Refute

[71] Efficient Deep Reinforcement Learning With Imitative Expert Priors for Autonomous Driving PDF

Cannot Refute

[72] The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models PDF

Cannot Refute

[73] A survey on constraining policy updates using the KL divergence PDF

Cannot Refute

[74] Bond: Aligning llms with best-of-n distillation PDF

Cannot Refute

[75] Beyond Reverse KL: Generalizing Direct Preference Optimization with Diverse Divergence Constraints PDF

Cannot Refute

[76] Robust Offline Reinforcement Learning with Linearly Structured -Divergence Regularization PDF

Cannot Refute

[77] Aligning language models with preferences through f-divergence minimization PDF

Cannot Refute

[78] DiffPPO: Reinforcement Learning Fine-Tuning of Diffusion Models for Text-to-Image Generation PDF

Cannot Refute

[79] Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint PDF

Cannot Refute

Contribution

Theoretical justification for on-policy methods converging to KL-minimal solutions

[51] A Distributional Approach to Controlled Text Generation PDF

Can Refute

[55] Variational Inference for Policy Gradient PDF

Can Refute

[52] Rethinking kl regularization in rlhf: From value estimation to gradient optimization PDF

Cannot Refute

[53] Proximal policy optimization based hybrid recommender systems for large scale recommendations PDF

Cannot Refute

[54] A KL-regularization framework for learning to plan with adaptive priors PDF

Cannot Refute

[56] Automated Stock Trading using Trust Region Policy Optimization PDF

Cannot Refute

[57] Learning Robot Skill Embeddings PDF

Cannot Refute

[58] On Reward Maximization and Distribution Matching for Fine-Tuning Language Models PDF

Cannot Refute

[59] Learning Skill Embeddings for Transferable Robot Skills PDF

Cannot Refute

RL's Razor: Why Online Reinforcement Learning Forgets Less

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[3] Scaling laws for forgetting when fine-tuning large language models PDF

[15] Effect of scale on catastrophic forgetting in neural networks PDF

Contribution Analysis

Empirical forgetting law linking KL divergence to catastrophic forgetting

[60] Solving the catastrophic forgetting problem in generalized category discovery PDF

[61] Overcoming catastrophic forgetting by bayesian generative regularization PDF

[62] Context-Free Synthetic Data Mitigates Forgetting PDF

[63] Reducing catastrophic forgetting in neural networks via gaussian mixture approximation PDF

[64] Continual lifelong learning in neural systems: overcoming catastrophic forgetting and transferring knowledge for future learning PDF

[65] MoL for LLMs: Dual-Loss Optimization to Enhance Domain Expertise While Preserving General Capabilities PDF

[66] Distance-based weight transfer for fine-tuning from near-field to far-field speaker verification PDF

[67] Using personalized speech synthesis and neural language generator for rapid speaker adaptation PDF

[68] CPR: Classifier-Projection Regularization for Continual Learning PDF

[69] Continual learning: Overcoming catastrophic forgetting in neural networks-a survey PDF

RL's Razor principle explaining RL's implicit KL minimization

[70] Categorical distributional reinforcement learning with kullback-leibler divergence: Convergence and asymptotics PDF

[71] Efficient Deep Reinforcement Learning With Imitative Expert Priors for Autonomous Driving PDF

[72] The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models PDF

[73] A survey on constraining policy updates using the KL divergence PDF

[74] Bond: Aligning llms with best-of-n distillation PDF

[75] Beyond Reverse KL: Generalizing Direct Preference Optimization with Diverse Divergence Constraints PDF

[76] Robust Offline Reinforcement Learning with Linearly Structured -Divergence Regularization PDF

[77] Aligning language models with preferences through f-divergence minimization PDF

[78] DiffPPO: Reinforcement Learning Fine-Tuning of Diffusion Models for Text-to-Image Generation PDF

[79] Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint PDF

Theoretical justification for on-policy methods converging to KL-minimal solutions

[51] A Distributional Approach to Controlled Text Generation PDF

[55] Variational Inference for Policy Gradient PDF

[52] Rethinking kl regularization in rlhf: From value estimation to gradient optimization PDF

[53] Proximal policy optimization based hybrid recommender systems for large scale recommendations PDF

[54] A KL-regularization framework for learning to plan with adaptive priors PDF

[56] Automated Stock Trading using Trust Region Policy Optimization PDF

[57] Learning Robot Skill Embeddings PDF

[58] On Reward Maximization and Distribution Matching for Fine-Tuning Language Models PDF

[59] Learning Skill Embeddings for Transferable Robot Skills PDF

Table of Contents