TROLL: Trust Regions Improve Reinforcement Learning for Large Language Models

ICLR 2026 Conference SubmissionAnonymous Authors
RL from verifiable rewardsFinetuning LLMsTrust Regions
Abstract:

Reinforcement Learning (RL) with PPO-like clip objectives has become the standard choice for reward-based fine-tuning of large language models (LLMs). Although recent work has explored improved estimators of advantages and normalization, the clipping mechanism itself has remained untouched. Originally introduced as a proxy for principled KL-based trust regions, clipping is a crude approximation that often causes unstable updates and suboptimal performance. We replace the clip objective with a novel discrete differentiable trust region projection, which provides principled token-level KL constraints. The projection operates on a sparse subset of the model’s most important token logits to balance computational cost and projection effectiveness. Our approach, Trust Region Optimization for Large Language Models (TROLL), serves as a direct replacement for PPO-like clipping during training and does not alter the model’s inference behavior. Across mathematical reasoning and code generation tasks, model families, as well as advantage-estimation methods, TROLL consistently outperforms PPO-like clipping in terms of training speed, stability, and final success rates.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
14
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: reinforcement learning for large language model fine-tuning. The field has organized itself into several major branches that reflect both algorithmic innovation and practical deployment concerns. Core RL Algorithms and Optimization Methods focus on policy optimization techniques—including trust region methods like those explored in Secrets PPO[8] and Remax[5]—that ensure stable updates when fine-tuning large models. Reward and Preference Learning addresses how to elicit and model human preferences, with works ranging from direct preference optimization (Direct Preference[18]) to active querying strategies (Active Preference[13]) and robust reward modeling (Secrets Reward[16]). Application Domains and Task-Specific Adaptations demonstrate the breadth of deployment scenarios, from conversational assistants (OpenAssistant Conversations[15]) to specialized domains like finance (Fin-R1[42]) and code security (Code Security[34]). Training Infrastructure and Efficiency tackle the computational challenges of scaling RL to billion-parameter models, while Theoretical Foundations and Surveys (Technical Survey[2], Enhanced Survey[12]) provide conceptual grounding, and Auxiliary Techniques explore complementary methods such as prompt optimization and self-adaptation. Within the policy optimization landscape, a central tension emerges between sample efficiency and stability: some methods prioritize tight trust regions to prevent catastrophic forgetting, while others explore offline regularization (Offline Regularised[4]) or efficient prior incorporation (Efficient Priors[3]) to reduce the need for extensive online rollouts. TROLL[0] situates itself squarely in this trust region tradition, emphasizing controlled policy updates akin to the principles underlying Remax[5] and the practical insights from Secrets PPO[8]. Compared to Nested-ReFT[37], which explores nested representations for parameter-efficient tuning, TROLL[0] focuses more directly on the optimization dynamics that govern how far a policy can safely deviate from its initialization. This positioning reflects an ongoing debate in the community: whether the key to effective LLM fine-tuning lies in algorithmic safeguards during optimization or in architectural choices that constrain the hypothesis space from the outset.

Claimed Contributions

TROLL: differentiable trust region projection for discrete distributions

The authors introduce TROLL, a method that replaces PPO-style clipping with a fully differentiable trust region projection. This projection enforces per-token KL divergence constraints between successive policies by solving a convex optimization problem, providing a more principled alternative to heuristic clipping.

6 retrieved papers
Sparsification scheme for scaling to large vocabularies

The authors develop a sparsification approach that retains only the most probable tokens (typically 5-10 tokens capturing over 99.999% probability mass), making the trust region projection computationally feasible for modern LLMs with vocabularies exceeding 100,000 entries while maintaining theoretical guarantees.

2 retrieved papers
Empirical validation across methods, models, and tasks

The authors provide comprehensive experimental evidence showing that TROLL consistently outperforms PPO-style clipping across multiple advantage estimation methods (GRPO, Dr.GRPO, GSPO, REINFORCE++), model families (Qwen, LLaMA, SmolLM, Apertus), and tasks (mathematical reasoning and code generation), achieving 3-10 percentage point improvements in success rates.

6 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

TROLL: differentiable trust region projection for discrete distributions

The authors introduce TROLL, a method that replaces PPO-style clipping with a fully differentiable trust region projection. This projection enforces per-token KL divergence constraints between successive policies by solving a convex optimization problem, providing a more principled alternative to heuristic clipping.

Contribution

Sparsification scheme for scaling to large vocabularies

The authors develop a sparsification approach that retains only the most probable tokens (typically 5-10 tokens capturing over 99.999% probability mass), making the trust region projection computationally feasible for modern LLMs with vocabularies exceeding 100,000 entries while maintaining theoretical guarantees.

Contribution

Empirical validation across methods, models, and tasks

The authors provide comprehensive experimental evidence showing that TROLL consistently outperforms PPO-style clipping across multiple advantage estimation methods (GRPO, Dr.GRPO, GSPO, REINFORCE++), model families (Qwen, LLaMA, SmolLM, Apertus), and tasks (mathematical reasoning and code generation), achieving 3-10 percentage point improvements in success rates.