Escaping Policy Contraction: Contraction-Aware PPO (CaPPO) for Stable Language Model Fine-Tuning
Overview
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors formalize policy contraction by introducing the Support Retention Ratio (SRR), which measures the fraction of supervised fine-tuning completions that retain non-negligible probability under the RL policy. This metric is independent of decoding and comparable across prompts, providing a direct way to quantify support loss during on-policy optimization.
The authors propose CaPPO, a minimum-norm multi-gradient update method that treats reward, entropy, and KL divergence as peer objectives rather than using fixed scalarization. It computes parameter updates that approximate Pareto-improving steps, avoiding brittle trade-offs and ensuring progress on reward does not collapse entropy or cause uncontrolled KL drift.
The authors develop an adaptive controller that tracks per-token entropy and dynamically adjusts the entropy coefficient during training. This controller steers exploration toward a target token entropy, complementing the multi-objective update by stabilizing entropy and preventing policy contraction without manual tuning.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[7] The choice of divergence: A neglected key to mitigating diversity collapse in reinforcement learning with verifiable reward PDF
[42] M-GRPO: Stabilizing Self-Supervised Reinforcement Learning for Large Language Models with Momentum-Anchored Policy Optimization PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Support Retention Ratio (SRR) metric
The authors formalize policy contraction by introducing the Support Retention Ratio (SRR), which measures the fraction of supervised fine-tuning completions that retain non-negligible probability under the RL policy. This metric is independent of decoding and comparable across prompts, providing a direct way to quantify support loss during on-policy optimization.
Contraction-Aware PPO (CaPPO) algorithm
The authors propose CaPPO, a minimum-norm multi-gradient update method that treats reward, entropy, and KL divergence as peer objectives rather than using fixed scalarization. It computes parameter updates that approximate Pareto-improving steps, avoiding brittle trade-offs and ensuring progress on reward does not collapse entropy or cause uncontrolled KL drift.
[63] Fast global convergence of natural policy gradient methods with entropy regularization PDF
[64] Fast policy extragradient methods for competitive games with entropy regularization PDF
[65] Independent natural policy gradient methods for potential games: Finite-time global convergence with entropy regularization PDF
[66] Fast policy learning for linear quadratic control with entropy regularization PDF
[67] The entropy mechanism of reinforcement learning for reasoning language models PDF
[68] Flow density control: Generative optimization beyond entropy-regularized fine-tuning PDF
[69] Controlled decoding from language models PDF
[70] Uncertainty-aware multi-objective reinforcement learning-guided diffusion models for 3D de novo molecular design PDF
[71] EnTRPO: trust region policy optimization method with entropy regularization PDF
[72] Rethinking kl regularization in rlhf: From value estimation to gradient optimization PDF
Entropy-scheduling controller
The authors develop an adaptive controller that tracks per-token entropy and dynamically adjusts the entropy coefficient during training. This controller steers exploration toward a target token entropy, complementing the multi-objective update by stabilizing entropy and preventing policy contraction without manual tuning.