The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.6 Download Report PDF

Large Language ModelReinforcement Learning with Verifiable Rewardf divergence

A central paradox in fine-tuning Large Language Models (LLMs) with Reinforcement Learning with Verifiable Reward (RLVR) is the frequent degradation of multi-attempt performance (Pass@k) despite improvements in single-attempt accuracy (Pass@1). This is often accompanied by catastrophic forgetting, where models lose previously acquired skills. Despite numerous proposed methods, the community's focus on the standard reverse KL-divergence has led to a surprising oversight: the potential of alternative f-divergences as a proactive solution has been largely unexamined. We argue that standard RLVR objectives—both those using the mode-seeking reverse KL-divergence and those forgoing a divergence term entirely—lack a crucial mechanism for knowledge retention. The reverse-KL actively accelerates this decay by narrowing the policy, while its absence provides no safeguard against the model drifting from its diverse knowledge base. We propose a fundamental shift in perspective: using the divergence term itself as the solution. Our framework, Diversity-Preserving Hybrid RL (DPH-RL), leverages mass-covering f-divergences (like forward-KL and JS-divergence) to function as a 'rehearsal mechanism'. By continuously referencing the initial policy, this approach forces the model to maintain broad solution coverage. Math and SQL generation experiments show that DPH-RL both improves in-domain Pass@1 and Pass@k scores and effectively prevents catastrophic forgetting on out-of-domain tasks. Additionally, DPH-RL is more training-efficient because it computes f-divergence using generator functions, requiring only sampling from the initial policy and no online reference model. Our work highlights a crucial, overlooked axis for improving RLVR, demonstrating that the proper selection of a divergence measure is a powerful tool for building more general and diverse reasoning models.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes DPH-RL, a framework that replaces reverse-KL divergence with mass-covering f-divergences (forward-KL, JS-divergence) to mitigate diversity collapse in RLVR fine-tuning. It resides in the Alternative Divergence Measures leaf, which contains only two papers including this one. This represents a relatively sparse research direction within the broader taxonomy of 36 papers across multiple branches. The positioning suggests the paper addresses a fundamental design choice—which divergence measure to use—that has received limited direct attention despite being central to the diversity collapse problem.

The taxonomy reveals that most related work clusters in adjacent branches: Reverse-KL Analysis examines mode-seeking behavior theoretically, Entropy-Based Exploration focuses on decoding-time interventions, and Risk-Based Objectives explore distributional criteria. The Alternative Divergence Measures leaf explicitly excludes methods retaining reverse-KL or removing divergence entirely, positioning this work as exploring a distinct solution space. Neighboring leaves like Uncertainty-Aware Exploration and Process-Level Supervision address diversity through complementary mechanisms (exploration bonuses, intermediate rewards) rather than core objective redesign, suggesting the paper occupies a relatively underexplored intervention point in the causal chain leading to collapse.

Among 30 candidates examined, the systematic KL analysis contribution shows overlap with 4 papers, while the DPH-RL framework itself has 2 refutable candidates. The empirical validation contribution appears more distinctive with no clear refutations found. This pattern suggests the theoretical analysis of reverse-KL's role in collapse builds on established understanding, while the specific proposal to use forward-KL and JS-divergence as 'rehearsal mechanisms' represents a less-explored application. The limited search scope means these statistics reflect top-semantic-match overlap rather than exhaustive prior work coverage, and the sparse leaf population suggests this divergence-choice framing may be genuinely underexamined.

Given the limited 30-candidate search and the paper's position in a two-paper leaf, the work appears to address a recognized gap in how the field has approached diversity preservation. The taxonomy structure shows most solutions target downstream symptoms (exploration, credit assignment) rather than the divergence term itself, lending credibility to the paper's claim of an 'overlooked' solution space. However, the refutable overlaps indicate the core insights about reverse-KL's limitations and mass-covering alternatives have precedent, even if their systematic application to RLVR diversity collapse is less developed.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Mitigating diversity collapse in reinforcement learning with verifiable reward. The field addresses a fundamental challenge in RL-based fine-tuning of large models: when optimizing against verifiable rewards (such as correctness signals or automated evaluators), policies often collapse to narrow, repetitive solution modes rather than maintaining diverse high-quality outputs. The taxonomy organizes research into several complementary directions. Divergence Regularization and Objective Design explores how different divergence measures and penalty terms can preserve exploration breadth while still improving reward. Exploration Mechanisms and Sampling Strategies focuses on decoding-time interventions and stochastic sampling to encourage variety. Credit Assignment and Reward Shaping investigates how to design reward signals that incentivize both correctness and diversity. Data Selection and Curriculum Learning examines how training data composition influences collapse tendencies. Policy Optimization Frameworks and Training Dynamics studies algorithmic modifications to standard RL procedures, while Expert Guidance and Imitation Integration considers how to blend supervised signals with RL objectives. Domain-Specific Applications and Extensions, Self-Supervised and Label-Free Methods, and Cross-Domain and Auxiliary Applications address specialized settings, and Diversity Collapse Analysis and Measurement provides diagnostic tools to quantify the phenomenon. Several active lines of work reveal key trade-offs. One cluster emphasizes alternative divergence formulations: Reverse-KL RL[34] and related approaches explore how switching from forward-KL to reverse-KL or other measures affects mode-seeking versus mode-covering behavior, with works like KL-Regularized Mode Collapse[22] analyzing the theoretical implications. Another thread investigates entropy-based and exploration-driven methods, such as Entropy-eliciting Explore[14] and Diversity-incentivized Exploration[6], which directly inject diversity bonuses into the objective. A third direction focuses on reward signal quality and noise handling, examining how noisy or sparse verifiers contribute to collapse. Choice of Divergence[0] sits within the Alternative Divergence Measures branch, closely aligned with Reverse-KL RL[34] in exploring how divergence choice shapes the diversity-reward trade-off. Compared to works like Low-probability Tokens[1] or Assessing Diversity Collapse[2], which emphasize measurement and diagnosis, Choice of Divergence[0] focuses on the design decision of which divergence metric to regularize against, offering a complementary lens on preventing collapse through objective engineering rather than post-hoc analysis.

Claimed Contributions

Systematic analysis of diversity collapse in RLVR via KL divergence

Can Refute

10 retrieved papers

The authors analyze how the standard reverse-KL divergence causes diversity collapse in Reinforcement Learning with Verifiable Reward (RLVR). They demonstrate that its mode-seeking nature suppresses Pass@k performance, exacerbates catastrophic forgetting, and leads to poor out-of-domain generalization.

10 retrieved papers

Can Refute

DPH-RL framework using mass-covering f-divergences

Can Refute

10 retrieved papers

The authors propose DPH-RL (Diversity-Preserving Hybrid RL), a framework that reframes the divergence term as an active diversity-preserving mechanism rather than just a policy constraint. It employs mass-covering f-divergences such as Forward-KL and JS-divergence to function as a rehearsal mechanism that maintains broad solution coverage.

10 retrieved papers

Can Refute

Extensive empirical validation across models and reasoning tasks

10 retrieved papers

The authors conduct extensive experiments across multiple model sizes (7B to 32B parameters) and reasoning domains (mathematics and SQL) to validate DPH-RL. They demonstrate consistent improvements in both in-domain and out-of-domain benchmarks while mitigating the trade-off between greedy performance and solution diversity.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[34] Reverse-KL Reinforcement Learning Can Sample From Multiple Diverse Modes PDF

GXC Anthony, J Prakash, R Fergus (0)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Systematic analysis of diversity collapse in RLVR via KL divergence

[37] Aligning language models with preferences through f-divergence minimization PDF

Can Refute

[59] Attributing mode collapse in the fine-tuning of large language models PDF

Cannot Refute

[46] Variational inference with tail-adaptive f-divergence PDF

Cannot Refute

Contribution

Extensive empirical validation across models and reasoning tasks

The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[34] Reverse-KL Reinforcement Learning Can Sample From Multiple Diverse Modes PDF

Contribution Analysis

Systematic analysis of diversity collapse in RLVR via KL divergence

[37] Aligning language models with preferences through f-divergence minimization PDF

[59] Attributing mode collapse in the fine-tuning of large language models PDF

[61] Diverse preference learning for capabilities and alignment PDF

[64] Whatever Remains Must Be True: Filtering Drives Reasoning in LLMs, Shaping Diversity PDF

[7] Scaling up rl: Unlocking diverse reasoning in llms via prolonged training PDF

[57] The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models PDF

[58] Preserving diversity in supervised fine-tuning of large language models PDF

[60] Adaptive Divergence Regularized Policy Optimization for Fine-tuning Generative Models PDF

[62] Beyond reverse kl: Generalizing direct preference optimization with diverse divergence constraints PDF

[63] On the algorithmic bias of aligning large language models with rlhf: Preference collapse and matching regularization PDF

DPH-RL framework using mass-covering f-divergences

[37] Aligning language models with preferences through f-divergence minimization PDF

[41] f-Divergence constrained policy improvement PDF

[38] -PO: Generalizing Preference Optimization with -divergence Minimization PDF

[39] Dpo kernels: A semantically-aware, kernel-enhanced, and divergence-rich paradigm for direct preference optimization PDF

[40] Dual rl: Unification and new methods for reinforcement and imitation learning PDF

[42] Robust Offline Reinforcement Learning with Linearly Structured -Divergence Regularization PDF

[43] Towards a Sharp Analysis of Offline Policy Learning for -Divergence-Regularized Contextual Bandits PDF

[44] Inverse Reinforcement Learning from Demonstrations for LLM Alignment PDF

[45] f-Policy Gradients: A General Framework for Goal Conditioned RL using f-Divergences PDF

[46] Variational inference with tail-adaptive f-divergence PDF

Extensive empirical validation across models and reasoning tasks

[47] A survey of reinforcement learning for large reasoning models PDF

[48] Grpo-lead: A difficulty-aware reinforcement learning approach for concise mathematical reasoning in language models PDF

[49] Self-rewarding correction for mathematical reasoning PDF

[50] A survey on large language models for mathematical reasoning PDF

[51] Reasoning-table: Exploring reinforcement learning for table reasoning PDF

[52] Reasoning-SQL: Reinforcement Learning with SQL Tailored Partial Rewards for Reasoning-Enhanced Text-to-SQL PDF

[53] Deepseekmath: Pushing the limits of mathematical reasoning in open language models PDF

[54] Peano: learning formal mathematical reasoning PDF

[55] Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning PDF

[56] Scaling Behaviors of LLM Reinforcement Learning Post-Training: An Empirical Study in Mathematical Reasoning PDF

Table of Contents