Escaping Policy Contraction: Contraction-Aware PPO (CaPPO) for Stable Language Model Fine-Tuning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.7 Download Report PDF

Policy ContractionProximal Policy OptimizationLarge Language Models

Reinforcement learning from human feedback (RLHF) with proximal policy optimization (PPO) is widely used but often yields less diverse outputs than supervised fine-tuning, suggesting an effect in which the policy’s support contracts during on-policy optimization. We formalize this “policy contraction” with the Support Retention Ratio (SRR)—the share of SFT completions that retain non-negligible probability under the RL policy—and additionally track token-entropy, Kullback–Leibler (KL) divergence to the reference, and repetition. We propose Contraction-Aware PPO (CaPPO), a minimum-norm multi-gradient update that co-optimizes reward, entropy, and KL, paired with a controller that steers exploration toward a target token entropy. On HH-RLHF, Summarize-from-Feedback, and UltraFeedback with Qwen2-7B, Qwen2.5-14B, Mistral-7B-Instruct, and Llama-3-8B-Instruct, CaPPO increases win rate by 2 to 4 points over PPO and improves diversity, gaining 0.2 to 0.3 higher SRR. The gains persist under decoding sweeps and are robust to reward scaling and critic variance. Treating reward, diversity, and stability as first-class objectives, CaPPO mitigates contraction without sacrificing alignment performance.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Mitigating policy contraction in reinforcement learning from human feedback. The field addresses a fundamental challenge in RLHF: as policies are optimized against learned reward models, they often exhibit pathological behaviors including mode collapse, reduced output diversity, and degraded capabilities. The taxonomy organizes research into twelve major branches. Policy Stability and Collapse Prevention focuses on entropy regularization and diversity maintenance techniques to prevent the policy from collapsing to narrow distributions. Reward Model Robustness and Reliability examines how to build more trustworthy reward signals that resist exploitation. Reward Over-Optimization Control studies mechanisms to prevent policies from gaming imperfect reward models, while Alignment Tax and Capability Preservation investigates trade-offs between alignment objectives and model performance. Additional branches cover preference learning quality, training efficiency, safety constraints, theoretical foundations, alternative optimization frameworks beyond standard RLHF, domain-specific applications, calibration methods, and practical implementation considerations. Representative works like Policy Collapse[1] and RLHF Open Problems[2] have documented these failure modes, while methods such as Auxiliary Network Stability[4] and Preference Collapse[5] propose targeted interventions. Several active research directions reveal key tensions in the field. One line emphasizes explicit entropy management and diversity preservation, recognizing that standard KL-penalty approaches may be insufficient when policies contract toward high-reward but narrow behaviors, as documented in Diversity Collapse[7]. Another direction explores reward model uncertainty and robustness, with works like RLHF Semantic Vulnerabilities[3] showing how policies exploit model weaknesses. CaPPO[0] sits within the policy stability branch alongside entropy-focused methods, proposing contraction-aware mechanisms to maintain policy expressiveness during optimization. Compared to neighboring approaches like M-GRPO[42], which modifies the optimization objective itself, CaPPO[0] emphasizes direct intervention in the policy update process to preserve distributional breadth. The broader challenge remains balancing alignment quality against the risk of over-optimization, with ongoing debate about whether solutions should modify rewards, constrain policy updates, or fundamentally rethink the RLHF training loop.

Claimed Contributions

Support Retention Ratio (SRR) metric

2 retrieved papers

The authors formalize policy contraction by introducing the Support Retention Ratio (SRR), which measures the fraction of supervised fine-tuning completions that retain non-negligible probability under the RL policy. This metric is independent of decoding and comparable across prompts, providing a direct way to quantify support loss during on-policy optimization.

2 retrieved papers

Contraction-Aware PPO (CaPPO) algorithm

10 retrieved papers

The authors propose CaPPO, a minimum-norm multi-gradient update method that treats reward, entropy, and KL divergence as peer objectives rather than using fixed scalarization. It computes parameter updates that approximate Pareto-improving steps, avoiding brittle trade-offs and ensuring progress on reward does not collapse entropy or cause uncontrolled KL drift.

10 retrieved papers

Entropy-scheduling controller

Can Refute

10 retrieved papers

The authors develop an adaptive controller that tracks per-token entropy and dynamically adjusts the entropy coefficient during training. This controller steers exploration toward a target token entropy, complementing the multi-objective update by stabilizing entropy and preventing policy contraction without manual tuning.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[7] The choice of divergence: A neglected key to mitigating diversity collapse in reinforcement learning with verifiable reward PDF

Li Long, Long Li, Jiaran Hao, Zhou Zhi-jian, Jason Klein Liu, Zhijian Zhou, Pang Wei, Xiaoyu Tan, Tan Xiaoyu, Wei Chu, Chu Wei, Zhe Wang, Wang Zhe, Shirui Pan, Pan, Shirui, Chao Qu, Qu Chao, Yuan Qi, Qi Yuan (2025)

[42] M-GRPO: Stabilizing Self-Supervised Reinforcement Learning for Large Language Models with Momentum-Anchored Policy Optimization PDF

Bizhe Bai, Hongming Wu, Peng Ye, Tao Chen (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Support Retention Ratio (SRR) metric

[61] The invisible leash: Why rlvr may not escape its origin PDF

Cannot Refute

[62] The Invisible Leash: Why RLVR May or May Not Escape Its Origin PDF

Cannot Refute

Contribution

Contraction-Aware PPO (CaPPO) algorithm

[63] Fast global convergence of natural policy gradient methods with entropy regularization PDF

Cannot Refute

[64] Fast policy extragradient methods for competitive games with entropy regularization PDF

Cannot Refute

[65] Independent natural policy gradient methods for potential games: Finite-time global convergence with entropy regularization PDF

Cannot Refute

[66] Fast policy learning for linear quadratic control with entropy regularization PDF

Cannot Refute

[67] The entropy mechanism of reinforcement learning for reasoning language models PDF

Cannot Refute

[68] Flow density control: Generative optimization beyond entropy-regularized fine-tuning PDF

Cannot Refute

[69] Controlled decoding from language models PDF

Cannot Refute

[70] Uncertainty-aware multi-objective reinforcement learning-guided diffusion models for 3D de novo molecular design PDF

Cannot Refute

[71] EnTRPO: trust region policy optimization method with entropy regularization PDF

Cannot Refute

[72] Rethinking kl regularization in rlhf: From value estimation to gradient optimization PDF

Cannot Refute

Contribution

Entropy-scheduling controller

[54] Rediscovering entropy regularization: Adaptive coefficient unlocks its potential for llm reinforcement learning PDF

Can Refute

[55] On entropy control in llm-rl algorithms PDF

Can Refute

[51] An information entropy-driven evolutionary algorithm based on reinforcement learning for many-objective optimization PDF

Cannot Refute

[52] Off-policy asymptotic and adaptive maximum entropy deep reinforcement learning PDF

Cannot Refute

[53] An adaptive entropy-regularization framework for multi-agent reinforcement learning PDF

Cannot Refute

[56] Learning implicit credit assignment for cooperative multi-agent reinforcement learning PDF

Cannot Refute

[57] CE-GPPO: Coordinating Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning PDF

Cannot Refute

[58] Entropy-regularized diffusion policy with q-ensembles for offline reinforcement learning PDF

Cannot Refute

[59] MARL-MambaContour: Unleashing Multi-Agent Deep Reinforcement Learning for Active Contour Optimization in Medical Image Segmentation PDF

Cannot Refute

[60] Adaptive joint entropy reward: a mechanism to efficient exploration in reinforcement learning PDF

Cannot Refute

Escaping Policy Contraction: Contraction-Aware PPO (CaPPO) for Stable Language Model Fine-Tuning

Overview

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[7] The choice of divergence: A neglected key to mitigating diversity collapse in reinforcement learning with verifiable reward PDF

[42] M-GRPO: Stabilizing Self-Supervised Reinforcement Learning for Large Language Models with Momentum-Anchored Policy Optimization PDF

Contribution Analysis

Support Retention Ratio (SRR) metric

[61] The invisible leash: Why rlvr may not escape its origin PDF

[62] The Invisible Leash: Why RLVR May or May Not Escape Its Origin PDF

Contraction-Aware PPO (CaPPO) algorithm

[63] Fast global convergence of natural policy gradient methods with entropy regularization PDF

[64] Fast policy extragradient methods for competitive games with entropy regularization PDF

[65] Independent natural policy gradient methods for potential games: Finite-time global convergence with entropy regularization PDF

[66] Fast policy learning for linear quadratic control with entropy regularization PDF

[67] The entropy mechanism of reinforcement learning for reasoning language models PDF

[68] Flow density control: Generative optimization beyond entropy-regularized fine-tuning PDF

[69] Controlled decoding from language models PDF

[70] Uncertainty-aware multi-objective reinforcement learning-guided diffusion models for 3D de novo molecular design PDF

[71] EnTRPO: trust region policy optimization method with entropy regularization PDF

[72] Rethinking kl regularization in rlhf: From value estimation to gradient optimization PDF

Entropy-scheduling controller

[54] Rediscovering entropy regularization: Adaptive coefficient unlocks its potential for llm reinforcement learning PDF

[55] On entropy control in llm-rl algorithms PDF

[51] An information entropy-driven evolutionary algorithm based on reinforcement learning for many-objective optimization PDF

[52] Off-policy asymptotic and adaptive maximum entropy deep reinforcement learning PDF

[53] An adaptive entropy-regularization framework for multi-agent reinforcement learning PDF

[56] Learning implicit credit assignment for cooperative multi-agent reinforcement learning PDF

[57] CE-GPPO: Coordinating Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning PDF

[58] Entropy-regularized diffusion policy with q-ensembles for offline reinforcement learning PDF

[59] MARL-MambaContour: Unleashing Multi-Agent Deep Reinforcement Learning for Active Contour Optimization in Medical Image Segmentation PDF

[60] Adaptive joint entropy reward: a mechanism to efficient exploration in reinforcement learning PDF

Table of Contents