RL's Razor: Why Online Reinforcement Learning Forgets Less
Overview
Overall Novelty Assessment
The paper introduces RL's Razor, a principle explaining why reinforcement learning fine-tuning preserves prior knowledge better than supervised fine-tuning by implicitly minimizing KL divergence from the base model. It resides in the 'Scaling Laws and Distributional Analysis' leaf alongside two sibling papers examining forgetting through scaling and distributional lenses. This leaf sits within the broader 'Analysis and Characterization of Forgetting Phenomena' branch, which contains four leaves spanning scaling laws, multimodal forgetting, task interference, and feature preservation. The analytical focus distinguishes this work from the field's dominant mitigation-oriented branches.
The taxonomy reveals substantial activity in mitigation strategies, with three major branches dedicated to regularization, parameter-efficient fine-tuning, and rehearsal methods. The paper's analytical positioning connects it to neighboring leaves examining task interference mechanisms and feature preservation dynamics, yet diverges by focusing specifically on distributional shift quantification rather than task-level or representation-level analysis. The 'Domain-Specific and Continual Learning Applications' branch, containing five leaves, suggests active translation of forgetting insights to specialized settings, while the paper maintains a domain-agnostic theoretical stance grounded in KL-divergence characterization.
Among thirty candidates examined, the theoretical justification for on-policy RL's KL-minimal convergence encountered two potentially refutable prior works, while the empirical forgetting law and RL's Razor principle showed no clear refutation across ten candidates each. The limited search scope means these statistics reflect top-semantic-match coverage rather than exhaustive field review. The empirical law linking KL divergence to forgetting and the RL's Razor principle appear more novel within this candidate set, whereas the theoretical convergence claims face more substantial prior work overlap, suggesting this contribution may build incrementally on existing RL theory.
Based on the thirty-candidate search, the work appears to occupy a moderately explored analytical niche, with the empirical and conceptual contributions showing stronger novelty signals than the theoretical justification. The taxonomy structure indicates this is a growing but not yet saturated research direction, with only three papers in the immediate leaf. However, the analysis cannot assess novelty against the full literature landscape or specialized RL theory venues not captured in this semantic search.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors discover that the degree of catastrophic forgetting during fine-tuning can be reliably predicted by measuring the KL divergence between the fine-tuned and base policy on the new task distribution, independent of training algorithm or hyperparameters.
The authors introduce RL's Razor, a principle stating that on-policy reinforcement learning methods are inherently biased toward solutions that minimize KL divergence from the base model among all high-reward solutions, unlike supervised fine-tuning which can converge to arbitrarily distant distributions.
The authors provide theoretical analysis (Theorem 5.2) showing that policy gradient methods converge to KL-minimal optimal policies within the representable family, formalizing why on-policy training naturally produces smaller distributional shifts than offline methods.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
Contribution Analysis
Detailed comparisons for each claimed contribution
Empirical forgetting law linking KL divergence to catastrophic forgetting
The authors discover that the degree of catastrophic forgetting during fine-tuning can be reliably predicted by measuring the KL divergence between the fine-tuned and base policy on the new task distribution, independent of training algorithm or hyperparameters.
[60] Solving the catastrophic forgetting problem in generalized category discovery PDF
[61] Overcoming catastrophic forgetting by bayesian generative regularization PDF
[62] Context-Free Synthetic Data Mitigates Forgetting PDF
[63] Reducing catastrophic forgetting in neural networks via gaussian mixture approximation PDF
[64] Continual lifelong learning in neural systems: overcoming catastrophic forgetting and transferring knowledge for future learning PDF
[65] MoL for LLMs: Dual-Loss Optimization to Enhance Domain Expertise While Preserving General Capabilities PDF
[66] Distance-based weight transfer for fine-tuning from near-field to far-field speaker verification PDF
[67] Using personalized speech synthesis and neural language generator for rapid speaker adaptation PDF
[68] CPR: Classifier-Projection Regularization for Continual Learning PDF
[69] Continual learning: Overcoming catastrophic forgetting in neural networks-a survey PDF
RL's Razor principle explaining RL's implicit KL minimization
The authors introduce RL's Razor, a principle stating that on-policy reinforcement learning methods are inherently biased toward solutions that minimize KL divergence from the base model among all high-reward solutions, unlike supervised fine-tuning which can converge to arbitrarily distant distributions.
[70] Categorical distributional reinforcement learning with kullback-leibler divergence: Convergence and asymptotics PDF
[71] Efficient Deep Reinforcement Learning With Imitative Expert Priors for Autonomous Driving PDF
[72] The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models PDF
[73] A survey on constraining policy updates using the KL divergence PDF
[74] Bond: Aligning llms with best-of-n distillation PDF
[75] Beyond Reverse KL: Generalizing Direct Preference Optimization with Diverse Divergence Constraints PDF
[76] Robust Offline Reinforcement Learning with Linearly Structured -Divergence Regularization PDF
[77] Aligning language models with preferences through f-divergence minimization PDF
[78] DiffPPO: Reinforcement Learning Fine-Tuning of Diffusion Models for Text-to-Image Generation PDF
[79] Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint PDF
Theoretical justification for on-policy methods converging to KL-minimal solutions
The authors provide theoretical analysis (Theorem 5.2) showing that policy gradient methods converge to KL-minimal optimal policies within the representable family, formalizing why on-policy training naturally produces smaller distributional shifts than offline methods.