Escaping Policy Contraction: Contraction-Aware PPO (CaPPO) for Stable Language Model Fine-Tuning

ICLR 2026 Conference SubmissionAnonymous Authors
Policy ContractionProximal Policy OptimizationLarge Language Models
Abstract:

Reinforcement learning from human feedback (RLHF) with proximal policy optimization (PPO) is widely used but often yields less diverse outputs than supervised fine-tuning, suggesting an effect in which the policy’s support contracts during on-policy optimization. We formalize this “policy contraction” with the Support Retention Ratio (SRR)—the share of SFT completions that retain non-negligible probability under the RL policy—and additionally track token-entropy, Kullback–Leibler (KL) divergence to the reference, and repetition. We propose Contraction-Aware PPO (CaPPO), a minimum-norm multi-gradient update that co-optimizes reward, entropy, and KL, paired with a controller that steers exploration toward a target token entropy. On HH-RLHF, Summarize-from-Feedback, and UltraFeedback with Qwen2-7B, Qwen2.5-14B, Mistral-7B-Instruct, and Llama-3-8B-Instruct, CaPPO increases win rate by 2 to 4 points over PPO and improves diversity, gaining 0.2 to 0.3 higher SRR. The gains persist under decoding sweeps and are robust to reward scaling and critic variance. Treating reward, diversity, and stability as first-class objectives, CaPPO mitigates contraction without sacrificing alignment performance.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
22
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: Mitigating policy contraction in reinforcement learning from human feedback. The field addresses a fundamental challenge in RLHF: as policies are optimized against learned reward models, they often exhibit pathological behaviors including mode collapse, reduced output diversity, and degraded capabilities. The taxonomy organizes research into twelve major branches. Policy Stability and Collapse Prevention focuses on entropy regularization and diversity maintenance techniques to prevent the policy from collapsing to narrow distributions. Reward Model Robustness and Reliability examines how to build more trustworthy reward signals that resist exploitation. Reward Over-Optimization Control studies mechanisms to prevent policies from gaming imperfect reward models, while Alignment Tax and Capability Preservation investigates trade-offs between alignment objectives and model performance. Additional branches cover preference learning quality, training efficiency, safety constraints, theoretical foundations, alternative optimization frameworks beyond standard RLHF, domain-specific applications, calibration methods, and practical implementation considerations. Representative works like Policy Collapse[1] and RLHF Open Problems[2] have documented these failure modes, while methods such as Auxiliary Network Stability[4] and Preference Collapse[5] propose targeted interventions. Several active research directions reveal key tensions in the field. One line emphasizes explicit entropy management and diversity preservation, recognizing that standard KL-penalty approaches may be insufficient when policies contract toward high-reward but narrow behaviors, as documented in Diversity Collapse[7]. Another direction explores reward model uncertainty and robustness, with works like RLHF Semantic Vulnerabilities[3] showing how policies exploit model weaknesses. CaPPO[0] sits within the policy stability branch alongside entropy-focused methods, proposing contraction-aware mechanisms to maintain policy expressiveness during optimization. Compared to neighboring approaches like M-GRPO[42], which modifies the optimization objective itself, CaPPO[0] emphasizes direct intervention in the policy update process to preserve distributional breadth. The broader challenge remains balancing alignment quality against the risk of over-optimization, with ongoing debate about whether solutions should modify rewards, constrain policy updates, or fundamentally rethink the RLHF training loop.

Claimed Contributions

Support Retention Ratio (SRR) metric

The authors formalize policy contraction by introducing the Support Retention Ratio (SRR), which measures the fraction of supervised fine-tuning completions that retain non-negligible probability under the RL policy. This metric is independent of decoding and comparable across prompts, providing a direct way to quantify support loss during on-policy optimization.

2 retrieved papers
Contraction-Aware PPO (CaPPO) algorithm

The authors propose CaPPO, a minimum-norm multi-gradient update method that treats reward, entropy, and KL divergence as peer objectives rather than using fixed scalarization. It computes parameter updates that approximate Pareto-improving steps, avoiding brittle trade-offs and ensuring progress on reward does not collapse entropy or cause uncontrolled KL drift.

10 retrieved papers
Entropy-scheduling controller

The authors develop an adaptive controller that tracks per-token entropy and dynamically adjusts the entropy coefficient during training. This controller steers exploration toward a target token entropy, complementing the multi-objective update by stabilizing entropy and preventing policy contraction without manual tuning.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Support Retention Ratio (SRR) metric

The authors formalize policy contraction by introducing the Support Retention Ratio (SRR), which measures the fraction of supervised fine-tuning completions that retain non-negligible probability under the RL policy. This metric is independent of decoding and comparable across prompts, providing a direct way to quantify support loss during on-policy optimization.

Contribution

Contraction-Aware PPO (CaPPO) algorithm

The authors propose CaPPO, a minimum-norm multi-gradient update method that treats reward, entropy, and KL divergence as peer objectives rather than using fixed scalarization. It computes parameter updates that approximate Pareto-improving steps, avoiding brittle trade-offs and ensuring progress on reward does not collapse entropy or cause uncontrolled KL drift.

Contribution

Entropy-scheduling controller

The authors develop an adaptive controller that tracks per-token entropy and dynamically adjusts the entropy coefficient during training. This controller steers exploration toward a target token entropy, complementing the multi-objective update by stabilizing entropy and preventing policy contraction without manual tuning.