TROLL: Trust Regions Improve Reinforcement Learning for Large Language Models
Overview
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors introduce TROLL, a method that replaces PPO-style clipping with a fully differentiable trust region projection. This projection enforces per-token KL divergence constraints between successive policies by solving a convex optimization problem, providing a more principled alternative to heuristic clipping.
The authors develop a sparsification approach that retains only the most probable tokens (typically 5-10 tokens capturing over 99.999% probability mass), making the trust region projection computationally feasible for modern LLMs with vocabularies exceeding 100,000 entries while maintaining theoretical guarantees.
The authors provide comprehensive experimental evidence showing that TROLL consistently outperforms PPO-style clipping across multiple advantage estimation methods (GRPO, Dr.GRPO, GSPO, REINFORCE++), model families (Qwen, LLaMA, SmolLM, Apertus), and tasks (mathematical reasoning and code generation), achieving 3-10 percentage point improvements in success rates.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[5] Remax: A simple, effective, and efficient reinforcement learning method for aligning large language models PDF
[8] Secrets of rlhf in large language models part i: Ppo PDF
[37] Nested-ReFT: Efficient Reinforcement Learning for Large Language Model Fine-Tuning via Off-Policy Rollouts PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
TROLL: differentiable trust region projection for discrete distributions
The authors introduce TROLL, a method that replaces PPO-style clipping with a fully differentiable trust region projection. This projection enforces per-token KL divergence constraints between successive policies by solving a convex optimization problem, providing a more principled alternative to heuristic clipping.
[52] Ares: Multimodal adaptive reasoning via difficulty-aware token-level entropy shaping PDF
[53] Self-alignment of large video language models with refined regularized preference optimization PDF
[55] Reinforcement Learning based Hovering Control of a Buoyancy Driven Unmanned Underwater Vehicle with Discrete Inputs PDF
[56] Adaptive Cruise Control Based on Safe Deep Reinforcement Learning PDF
[59] Multi-Agent Constrained Policy Optimization for Conflict-Free Management of Connected Autonomous Vehicles at Unsignalized Intersections PDF
[60] PPO, GAE, and KL Control for RLHF in Large Language Models: A Mathematical Reference PDF
Sparsification scheme for scaling to large vocabularies
The authors develop a sparsification approach that retains only the most probable tokens (typically 5-10 tokens capturing over 99.999% probability mass), making the trust region projection computationally feasible for modern LLMs with vocabularies exceeding 100,000 entries while maintaining theoretical guarantees.
Empirical validation across methods, models, and tasks
The authors provide comprehensive experimental evidence showing that TROLL consistently outperforms PPO-style clipping across multiple advantage estimation methods (GRPO, Dr.GRPO, GSPO, REINFORCE++), model families (Qwen, LLaMA, SmolLM, Apertus), and tasks (mathematical reasoning and code generation), achieving 3-10 percentage point improvements in success rates.