Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Reinforcement Learning for LLMsLLM ReasoningEfficient ReasoningPolicy Optimization

Large language models trained with reinforcement learning on verifiable rewards often inflate response length—trading brevity for accuracy. While longer reasoning can help on hard problems, many extra tokens are filler: verbose text making little progress. We introduce GFPO (Group Filtered Policy Optimization), which curbs this length explosion by sampling larger groups per problem and only training on responses filtered by (1) length and (2) token efficiency (reward per token). By sampling more during training time, GFPO teaches models to think less at inference time. On Phi-4-reasoning, GFPO cuts GRPO’s length inflation by up to 85% across STEM and coding benchmarks (AIME 24/25, GPQA, Omni-MATH, LiveCodeBench) while preserving accuracy. We find that GFPO also outperforms Dr. GRPO in both accuracy and length reduction and generalizes across model sizes and families. We further propose Adaptive Difficulty GFPO, which allocates more training exploration to harder problems, yielding better efficiency-accuracy trade-offs on challenging questions. With only a 7% increase in training time, GFPO reduces end-to-end latency by $\sim$ 30%, cutting response time on hard queries by 90 seconds. GFPO trades modest training-time increases for lasting gains in inference—an effective recipe for efficient reasoning.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces GFPO (Group Filtered Policy Optimization), which addresses response length inflation in RL-trained language models by filtering rollouts based on length and token efficiency before policy updates. It resides in the 'Token Efficiency and Self-Aligned Rewards' leaf under 'Length-Regularized Reward Design', sharing this leaf with only one sibling paper (Self Aligned Reward). This is a relatively sparse research direction within a broader taxonomy of 50 papers across 36 topics, suggesting the specific combination of token efficiency metrics and rollout filtering represents an emerging rather than saturated approach.

The taxonomy reveals substantial activity in neighboring areas. The parent branch 'Length-Regularized Reward Design' contains two other leaves: 'Fixed Length Penalty Approaches' (4 papers) and 'Adaptive Length Penalty Approaches' (4 papers), which modify reward functions directly rather than filtering rollouts. Adjacent branches like 'Rollout Filtering and Recomposition' (1 paper) and 'Difficulty-Aware Advantage Estimation' (1 paper) explore related filtering and reweighting strategies. The paper's dual focus on token efficiency and group-based filtering positions it at the intersection of reward design and data selection, diverging from pure penalty-based methods while sharing conceptual ground with rollout management approaches.

Among 30 candidates examined, the contribution-level analysis shows varied novelty signals. The core GFPO mechanism (10 candidates examined, 0 refutable) and token efficiency metric (10 candidates examined, 0 refutable) appear to lack direct prior overlap within the search scope. However, Adaptive Difficulty GFPO (10 candidates examined, 1 refutable) shows clearer precedent, likely overlapping with existing difficulty-aware advantage estimation work. The limited search scale means these findings reflect top-K semantic matches rather than exhaustive coverage, and the single sibling paper in the taxonomy leaf suggests the specific filtering approach may be less explored than adjacent reward-shaping methods.

Given the sparse taxonomy leaf and limited refutation signals across most contributions, the work appears to occupy a relatively novel position within the examined literature. The combination of group sampling, dual filtering criteria, and token efficiency metrics distinguishes it from both fixed penalty approaches and pure rollout recomposition methods. However, the analysis covers only 30 candidates from semantic search, leaving open the possibility of relevant work outside this scope, particularly in adjacent branches like adaptive reasoning depth or process-level reward modeling.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Reducing response length inflation in reinforcement learning for language model reasoning. The field addresses a pervasive challenge in RL-tuned language models: the tendency to produce unnecessarily verbose outputs that correlate with higher rewards but not necessarily better reasoning quality. The taxonomy organizes research into several major branches. Length-Regularized Reward Design explores explicit penalties or token-efficiency metrics to discourage verbosity, as seen in works like Controlling Reasoning Length[1] and Self Aligned Reward[29]. Difficulty-Aware Advantage Estimation and Process-Level Reward Modeling focus on finer-grained credit assignment to distinguish genuinely useful reasoning steps from filler tokens. Data Filtering and Selection for RL, along with Rollout Filtering and Recomposition, curate training signals by removing low-quality or excessively long trajectories. Fast-Slow and Adaptive Reasoning Depth methods, such as Fast Slow Thinking[5], dynamically allocate computational effort based on problem complexity. Self-Correction and Rewriting Mechanisms enable models to refine their outputs post-generation, while Multi-Stage Training Frameworks and Algorithmic Innovations in RL for Reasoning introduce novel optimization strategies. Additional branches cover inference-time interventions, compressed representations, domain-specific applications, and efficiency analysis. A particularly active line of work examines how to balance accuracy and conciseness through reward shaping and filtering. Verbosity Compensation Behavior[2] and Disentangling Length Bias[9] analyze the root causes of length inflation, revealing spurious correlations between token count and perceived quality. Meanwhile, GRPO Lead[3] and LearnAlign[4] propose group-based or learnable reward adjustments to mitigate these biases. Group Filtered Policy[0] sits within the Length-Regularized Reward Design branch, emphasizing token efficiency and self-aligned rewards. Compared to Self Aligned Reward[29], which learns to internalize efficiency preferences, Group Filtered Policy[0] applies explicit filtering criteria to candidate rollouts, aiming to prune verbose trajectories before they influence the policy update. This approach contrasts with adaptive depth methods like Fast Slow Thinking[5], which modulate reasoning effort dynamically rather than post-hoc filtering. The central tension across these branches remains how to preserve reasoning quality while curbing unnecessary verbosity, a trade-off that continues to drive methodological innovation.

Claimed Contributions

Group Filtered Policy Optimization (GFPO)

10 retrieved papers

GFPO is a reinforcement learning method that extends GRPO by sampling more candidate responses per problem and training only on a filtered subset selected by target metrics such as length or token efficiency. This filtering acts as implicit reward shaping to reduce response verbosity while preserving accuracy.

10 retrieved papers

Token Efficiency metric for response filtering

10 retrieved papers

The authors define token efficiency as reward divided by response length, which serves as a filtering metric in GFPO. This metric permits longer reasoning chains only when they achieve proportionately higher rewards, enabling more effective length control than filtering by length alone.

10 retrieved papers

Adaptive Difficulty GFPO

Can Refute

10 retrieved papers

This variant dynamically adjusts the number of retained responses based on estimated question difficulty, allocating more training signal to harder problems. It uses streaming difficulty estimates to assign different retention rates across difficulty buckets, improving the efficiency-accuracy trade-off on challenging questions.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[29] Self-Aligned Reward: Towards Effective and Efficient Reasoners PDF

Krishnan, Adit, Peixuan Han, Friedland, Gerald, Adit Krishnan, You, Jiaxuan, Gerald Friedland, Jiaxuan You, Chris Kong (2025) • arXiv.org

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Group Filtered Policy Optimization (GFPO)

[51] A minimalist approach to llm reasoning: from rejection sampling to reinforce PDF

Cannot Refute

[52] Provably good batch off-policy reinforcement learning without great exploration PDF

Cannot Refute

[53] Policy filtration for rlhf to mitigate noise in reward models PDF

Cannot Refute

[54] Memory-r1: Enhancing large language model agents to manage and utilize memories via reinforcement learning PDF

Cannot Refute

[55] LLaPipe: LLM-Guided Reinforcement Learning for Automated Data Preparation Pipeline Construction PDF

Cannot Refute

[56] Taming OOD actions for offline reinforcement learning: An advantage-based approach PDF

Cannot Refute

[57] Policy Agnostic RL: Offline RL and Online RL Fine-Tuning of Any Class and Backbone PDF

Cannot Refute

[58] Combining LLM decision and RL action selection to improve RL policy for adaptive interventions PDF

Cannot Refute

[59] Filtered Probabilistic Model Predictive Control-Based Reinforcement Learning for Unmanned Surface Vehicles PDF

Cannot Refute

[60] Learning filter selection policies for interpretable image denoising in parametrised action space PDF

Cannot Refute

Contribution

Token Efficiency metric for response filtering

[61] Rest-mcts*: Llm self-training via process reward guided tree search PDF

Cannot Refute

[62] Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning PDF

Cannot Refute

[63] Direct reasoning optimization: Llms can reward and refine their own reasoning for open-ended tasks PDF

Cannot Refute

[64] Rewarding Graph Reasoning Process makes LLMs more Generalized Reasoners PDF

Cannot Refute

[65] Selective preference optimization via token-level reward function estimation PDF

Cannot Refute

[66] Reinforcing Thinking through Reasoning-Enhanced Reward Models PDF

Cannot Refute

[67] Re-FORC: Adaptive Reward Prediction for Efficient Chain-of-Thought Reasoning PDF

Cannot Refute

[68] Linking Process to Outcome: Conditional Reward Modeling for LLM Reasoning PDF

Cannot Refute

[69] Towards Interpretable and Inference-Optimal COT Reasoning with Sparse Autoencoder-Guided Generation PDF

Cannot Refute

[70] Improving Chain-of-Thought Reasoning in LLMs via Generative Reward Modeling PDF

Cannot Refute

Contribution

Adaptive Difficulty GFPO

[77] Unlocking the Potential of Difficulty Prior in RL-based Multimodal Reasoning PDF

Can Refute

[71] DAST: Difficulty-Adaptive Slow-Thinking for Large Reasoning Models PDF

Cannot Refute

[72] CLPO: Curriculum Learning meets Policy Optimization for LLM Reasoning PDF

Cannot Refute

[73] CoDaPo: Confidence and difficulty-adaptive policy optimization for post-training language models PDF

Cannot Refute

[74] Vl-cogito: Progressive curriculum reinforcement learning for advanced multimodal reasoning PDF

Cannot Refute

[75] Online difficulty filtering for reasoning oriented reinforcement learning PDF

Cannot Refute

[76] Think smarter not harder: Adaptive reasoning with inference aware optimization PDF

Cannot Refute

[78] Curriculum Reinforcement Learning from Easy to Hard Tasks Improves LLM Reasoning PDF

Cannot Refute

[79] Ares: Multimodal adaptive reasoning via difficulty-aware token-level entropy shaping PDF

Cannot Refute

[80] Efficient reinforcement finetuning via adaptive curriculum learning PDF

Cannot Refute

Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[29] Self-Aligned Reward: Towards Effective and Efficient Reasoners PDF

Contribution Analysis

Group Filtered Policy Optimization (GFPO)

[51] A minimalist approach to llm reasoning: from rejection sampling to reinforce PDF

[52] Provably good batch off-policy reinforcement learning without great exploration PDF

[53] Policy filtration for rlhf to mitigate noise in reward models PDF

[54] Memory-r1: Enhancing large language model agents to manage and utilize memories via reinforcement learning PDF

[55] LLaPipe: LLM-Guided Reinforcement Learning for Automated Data Preparation Pipeline Construction PDF

[56] Taming OOD actions for offline reinforcement learning: An advantage-based approach PDF

[57] Policy Agnostic RL: Offline RL and Online RL Fine-Tuning of Any Class and Backbone PDF

[58] Combining LLM decision and RL action selection to improve RL policy for adaptive interventions PDF

[59] Filtered Probabilistic Model Predictive Control-Based Reinforcement Learning for Unmanned Surface Vehicles PDF

[60] Learning filter selection policies for interpretable image denoising in parametrised action space PDF

Token Efficiency metric for response filtering

[61] Rest-mcts*: Llm self-training via process reward guided tree search PDF

[62] Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning PDF

[63] Direct reasoning optimization: Llms can reward and refine their own reasoning for open-ended tasks PDF

[64] Rewarding Graph Reasoning Process makes LLMs more Generalized Reasoners PDF

[65] Selective preference optimization via token-level reward function estimation PDF

[66] Reinforcing Thinking through Reasoning-Enhanced Reward Models PDF

[67] Re-FORC: Adaptive Reward Prediction for Efficient Chain-of-Thought Reasoning PDF

[68] Linking Process to Outcome: Conditional Reward Modeling for LLM Reasoning PDF

[69] Towards Interpretable and Inference-Optimal COT Reasoning with Sparse Autoencoder-Guided Generation PDF

[70] Improving Chain-of-Thought Reasoning in LLMs via Generative Reward Modeling PDF

Adaptive Difficulty GFPO

[77] Unlocking the Potential of Difficulty Prior in RL-based Multimodal Reasoning PDF

[71] DAST: Difficulty-Adaptive Slow-Thinking for Large Reasoning Models PDF

[72] CLPO: Curriculum Learning meets Policy Optimization for LLM Reasoning PDF

[73] CoDaPo: Confidence and difficulty-adaptive policy optimization for post-training language models PDF

[74] Vl-cogito: Progressive curriculum reinforcement learning for advanced multimodal reasoning PDF

[75] Online difficulty filtering for reasoning oriented reinforcement learning PDF

[76] Think smarter not harder: Adaptive reasoning with inference aware optimization PDF

[78] Curriculum Reinforcement Learning from Easy to Hard Tasks Improves LLM Reasoning PDF

[79] Ares: Multimodal adaptive reasoning via difficulty-aware token-level entropy shaping PDF

[80] Efficient reinforcement finetuning via adaptive curriculum learning PDF

Table of Contents