Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning

ICLR 2026 Conference SubmissionAnonymous Authors
Reinforcement Learning for LLMsLLM ReasoningEfficient ReasoningPolicy Optimization
Abstract:

Large language models trained with reinforcement learning on verifiable rewards often inflate response length—trading brevity for accuracy. While longer reasoning can help on hard problems, many extra tokens are filler: verbose text making little progress. We introduce GFPO (Group Filtered Policy Optimization), which curbs this length explosion by sampling larger groups per problem and only training on responses filtered by (1) length and (2) token efficiency (reward per token). By sampling more during training time, GFPO teaches models to think less at inference time. On Phi-4-reasoning, GFPO cuts GRPO’s length inflation by up to 85% across STEM and coding benchmarks (AIME 24/25, GPQA, Omni-MATH, LiveCodeBench) while preserving accuracy. We find that GFPO also outperforms Dr. GRPO in both accuracy and length reduction and generalizes across model sizes and families. We further propose Adaptive Difficulty GFPO, which allocates more training exploration to harder problems, yielding better efficiency-accuracy trade-offs on challenging questions. With only a 7% increase in training time, GFPO reduces end-to-end latency by \sim30%, cutting response time on hard queries by 90 seconds. GFPO trades modest training-time increases for lasting gains in inference—an effective recipe for efficient reasoning.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces GFPO (Group Filtered Policy Optimization), which addresses response length inflation in RL-trained language models by filtering rollouts based on length and token efficiency before policy updates. It resides in the 'Token Efficiency and Self-Aligned Rewards' leaf under 'Length-Regularized Reward Design', sharing this leaf with only one sibling paper (Self Aligned Reward). This is a relatively sparse research direction within a broader taxonomy of 50 papers across 36 topics, suggesting the specific combination of token efficiency metrics and rollout filtering represents an emerging rather than saturated approach.

The taxonomy reveals substantial activity in neighboring areas. The parent branch 'Length-Regularized Reward Design' contains two other leaves: 'Fixed Length Penalty Approaches' (4 papers) and 'Adaptive Length Penalty Approaches' (4 papers), which modify reward functions directly rather than filtering rollouts. Adjacent branches like 'Rollout Filtering and Recomposition' (1 paper) and 'Difficulty-Aware Advantage Estimation' (1 paper) explore related filtering and reweighting strategies. The paper's dual focus on token efficiency and group-based filtering positions it at the intersection of reward design and data selection, diverging from pure penalty-based methods while sharing conceptual ground with rollout management approaches.

Among 30 candidates examined, the contribution-level analysis shows varied novelty signals. The core GFPO mechanism (10 candidates examined, 0 refutable) and token efficiency metric (10 candidates examined, 0 refutable) appear to lack direct prior overlap within the search scope. However, Adaptive Difficulty GFPO (10 candidates examined, 1 refutable) shows clearer precedent, likely overlapping with existing difficulty-aware advantage estimation work. The limited search scale means these findings reflect top-K semantic matches rather than exhaustive coverage, and the single sibling paper in the taxonomy leaf suggests the specific filtering approach may be less explored than adjacent reward-shaping methods.

Given the sparse taxonomy leaf and limited refutation signals across most contributions, the work appears to occupy a relatively novel position within the examined literature. The combination of group sampling, dual filtering criteria, and token efficiency metrics distinguishes it from both fixed penalty approaches and pure rollout recomposition methods. However, the analysis covers only 30 candidates from semantic search, leaving open the possibility of relevant work outside this scope, particularly in adjacent branches like adaptive reasoning depth or process-level reward modeling.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: Reducing response length inflation in reinforcement learning for language model reasoning. The field addresses a pervasive challenge in RL-tuned language models: the tendency to produce unnecessarily verbose outputs that correlate with higher rewards but not necessarily better reasoning quality. The taxonomy organizes research into several major branches. Length-Regularized Reward Design explores explicit penalties or token-efficiency metrics to discourage verbosity, as seen in works like Controlling Reasoning Length[1] and Self Aligned Reward[29]. Difficulty-Aware Advantage Estimation and Process-Level Reward Modeling focus on finer-grained credit assignment to distinguish genuinely useful reasoning steps from filler tokens. Data Filtering and Selection for RL, along with Rollout Filtering and Recomposition, curate training signals by removing low-quality or excessively long trajectories. Fast-Slow and Adaptive Reasoning Depth methods, such as Fast Slow Thinking[5], dynamically allocate computational effort based on problem complexity. Self-Correction and Rewriting Mechanisms enable models to refine their outputs post-generation, while Multi-Stage Training Frameworks and Algorithmic Innovations in RL for Reasoning introduce novel optimization strategies. Additional branches cover inference-time interventions, compressed representations, domain-specific applications, and efficiency analysis. A particularly active line of work examines how to balance accuracy and conciseness through reward shaping and filtering. Verbosity Compensation Behavior[2] and Disentangling Length Bias[9] analyze the root causes of length inflation, revealing spurious correlations between token count and perceived quality. Meanwhile, GRPO Lead[3] and LearnAlign[4] propose group-based or learnable reward adjustments to mitigate these biases. Group Filtered Policy[0] sits within the Length-Regularized Reward Design branch, emphasizing token efficiency and self-aligned rewards. Compared to Self Aligned Reward[29], which learns to internalize efficiency preferences, Group Filtered Policy[0] applies explicit filtering criteria to candidate rollouts, aiming to prune verbose trajectories before they influence the policy update. This approach contrasts with adaptive depth methods like Fast Slow Thinking[5], which modulate reasoning effort dynamically rather than post-hoc filtering. The central tension across these branches remains how to preserve reasoning quality while curbing unnecessary verbosity, a trade-off that continues to drive methodological innovation.

Claimed Contributions

Group Filtered Policy Optimization (GFPO)

GFPO is a reinforcement learning method that extends GRPO by sampling more candidate responses per problem and training only on a filtered subset selected by target metrics such as length or token efficiency. This filtering acts as implicit reward shaping to reduce response verbosity while preserving accuracy.

10 retrieved papers
Token Efficiency metric for response filtering

The authors define token efficiency as reward divided by response length, which serves as a filtering metric in GFPO. This metric permits longer reasoning chains only when they achieve proportionately higher rewards, enabling more effective length control than filtering by length alone.

10 retrieved papers
Adaptive Difficulty GFPO

This variant dynamically adjusts the number of retained responses based on estimated question difficulty, allocating more training signal to harder problems. It uses streaming difficulty estimates to assign different retention rates across difficulty buckets, improving the efficiency-accuracy trade-off on challenging questions.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Group Filtered Policy Optimization (GFPO)

GFPO is a reinforcement learning method that extends GRPO by sampling more candidate responses per problem and training only on a filtered subset selected by target metrics such as length or token efficiency. This filtering acts as implicit reward shaping to reduce response verbosity while preserving accuracy.

Contribution

Token Efficiency metric for response filtering

The authors define token efficiency as reward divided by response length, which serves as a filtering metric in GFPO. This metric permits longer reasoning chains only when they achieve proportionately higher rewards, enabling more effective length control than filtering by length alone.

Contribution

Adaptive Difficulty GFPO

This variant dynamically adjusts the number of retained responses based on estimated question difficulty, allocating more training signal to harder problems. It uses streaming difficulty estimates to assign different retention rates across difficulty buckets, improving the efficiency-accuracy trade-off on challenging questions.