RL's Razor: Why Online Reinforcement Learning Forgets Less

ICLR 2026 Conference SubmissionAnonymous Authors
Reinforcement LearningLarge Language ModelsCatastrophic Forgetting
Abstract:

Comparison of fine-tuning models with reinforcement learning (RL) and supervised fine-tuning (SFT) reveals that, despite similar performance at a new task, RL preserves prior knowledge and capabilities significantly better. We find that the degree of forgetting is determined by the distributional shift, measured as the KL-divergence between the fine-tuned and base policy evaluated on the new task. Our analysis reveals that on-policy RL is implicitly biased towards KL-minimal solutions among the many that solve the new task, whereas SFT can converge to distributions arbitrarily far from the base model. We validate these findings through experiments with large language models and robotic foundation models and further provide theoretical justification for why on-policy RL updates lead to a smaller KL change. We term this principle RL’s Razor\textit{RL’s Razor}: among all ways to solve a new task, RL prefers those closest in KL to the original model.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces RL's Razor, a principle explaining why reinforcement learning fine-tuning preserves prior knowledge better than supervised fine-tuning by implicitly minimizing KL divergence from the base model. It resides in the 'Scaling Laws and Distributional Analysis' leaf alongside two sibling papers examining forgetting through scaling and distributional lenses. This leaf sits within the broader 'Analysis and Characterization of Forgetting Phenomena' branch, which contains four leaves spanning scaling laws, multimodal forgetting, task interference, and feature preservation. The analytical focus distinguishes this work from the field's dominant mitigation-oriented branches.

The taxonomy reveals substantial activity in mitigation strategies, with three major branches dedicated to regularization, parameter-efficient fine-tuning, and rehearsal methods. The paper's analytical positioning connects it to neighboring leaves examining task interference mechanisms and feature preservation dynamics, yet diverges by focusing specifically on distributional shift quantification rather than task-level or representation-level analysis. The 'Domain-Specific and Continual Learning Applications' branch, containing five leaves, suggests active translation of forgetting insights to specialized settings, while the paper maintains a domain-agnostic theoretical stance grounded in KL-divergence characterization.

Among thirty candidates examined, the theoretical justification for on-policy RL's KL-minimal convergence encountered two potentially refutable prior works, while the empirical forgetting law and RL's Razor principle showed no clear refutation across ten candidates each. The limited search scope means these statistics reflect top-semantic-match coverage rather than exhaustive field review. The empirical law linking KL divergence to forgetting and the RL's Razor principle appear more novel within this candidate set, whereas the theoretical convergence claims face more substantial prior work overlap, suggesting this contribution may build incrementally on existing RL theory.

Based on the thirty-candidate search, the work appears to occupy a moderately explored analytical niche, with the empirical and conceptual contributions showing stronger novelty signals than the theoretical justification. The taxonomy structure indicates this is a growing but not yet saturated research direction, with only three papers in the immediate leaf. However, the analysis cannot assess novelty against the full literature landscape or specialized RL theory venues not captured in this semantic search.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
29
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: catastrophic forgetting in foundation model fine-tuning. The field addresses how large pretrained models lose previously acquired capabilities when adapted to new tasks or domains. The taxonomy organizes research into several major branches: mitigation strategies via regularization and constraint (e.g., preserving important parameters or enforcing stability), parameter-efficient fine-tuning methods (such as LoRA variants that update only small subsets of weights), rehearsal and data selection approaches (replaying or curating examples from original distributions), selective parameter updates (identifying which layers or modules to modify), analysis and characterization of forgetting phenomena (understanding when and why forgetting occurs), domain-specific and continual learning applications (applying these ideas to specialized settings), surveys and benchmarks (providing broader perspectives), and alternative perspectives including unlearning. Works like Catastrophic Forgetting Multimodal[1] and Half Fine Tuning[4] illustrate how different branches tackle the problem from complementary angles, whether through architectural choices or training protocols. A particularly active line of inquiry focuses on understanding the fundamental mechanisms and scaling behaviors of forgetting. Studies such as Scaling Laws Forgetting[3] and Scale Effect Forgetting[15] examine how model size, dataset characteristics, and training dynamics influence the severity of catastrophic forgetting, revealing that larger models or different data distributions can exhibit distinct forgetting patterns. RLs Razor[0] sits within this analytical branch, emphasizing distributional and scaling perspectives to characterize forgetting phenomena. Compared to neighboring works like Scaling Laws Forgetting[3], which may focus on empirical scaling trends, and Scale Effect Forgetting[15], which explores the interplay between model scale and retention, RLs Razor[0] appears to offer a complementary lens on how reinforcement learning or related optimization pressures interact with forgetting dynamics. This analytical focus contrasts with mitigation-oriented branches, highlighting ongoing questions about whether forgetting is an inevitable trade-off or a phenomenon that can be fundamentally reshaped through better understanding of model behavior and data geometry.

Claimed Contributions

Empirical forgetting law linking KL divergence to catastrophic forgetting

The authors discover that the degree of catastrophic forgetting during fine-tuning can be reliably predicted by measuring the KL divergence between the fine-tuned and base policy on the new task distribution, independent of training algorithm or hyperparameters.

10 retrieved papers
RL's Razor principle explaining RL's implicit KL minimization

The authors introduce RL's Razor, a principle stating that on-policy reinforcement learning methods are inherently biased toward solutions that minimize KL divergence from the base model among all high-reward solutions, unlike supervised fine-tuning which can converge to arbitrarily distant distributions.

10 retrieved papers
Theoretical justification for on-policy methods converging to KL-minimal solutions

The authors provide theoretical analysis (Theorem 5.2) showing that policy gradient methods converge to KL-minimal optimal policies within the representable family, formalizing why on-policy training naturally produces smaller distributional shifts than offline methods.

9 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Empirical forgetting law linking KL divergence to catastrophic forgetting

The authors discover that the degree of catastrophic forgetting during fine-tuning can be reliably predicted by measuring the KL divergence between the fine-tuned and base policy on the new task distribution, independent of training algorithm or hyperparameters.

Contribution

RL's Razor principle explaining RL's implicit KL minimization

The authors introduce RL's Razor, a principle stating that on-policy reinforcement learning methods are inherently biased toward solutions that minimize KL divergence from the base model among all high-reward solutions, unlike supervised fine-tuning which can converge to arbitrarily distant distributions.

Contribution

Theoretical justification for on-policy methods converging to KL-minimal solutions

The authors provide theoretical analysis (Theorem 5.2) showing that policy gradient methods converge to KL-minimal optimal policies within the representable family, formalizing why on-policy training naturally produces smaller distributional shifts than offline methods.