Overthinking Reduction with Decoupled Rewards and Curriculum Data Scheduling

ICLR 2026 Conference SubmissionAnonymous Authors
efficient reasoning; curriculum sampling with decoupled reward
Abstract:

While large reasoning models trained with critic-free reinforcement learning and verifiable rewards (RLVR) represent the state-of-the-art, their practical utility is hampered by ``overthinking'', a critical issue where models generate excessively long reasoning paths without any performance benefit. Existing solutions that penalize length often fail, inducing performance degradation due to a fundamental misalignment between trajectory-level rewards and token-level optimization. In this work, we introduce a novel framework, DECS, built on our theoretical discovery of two previously unaddressed flaws in current length rewards: (1) the erroneous penalization of essential exploratory tokens and (2) the inadvertent rewarding of partial redundancy. Our framework's innovations include (i) a first-of-its-kind decoupled token-level reward mechanism that surgically distinguishes and penalizes redundant tokens, and (ii) a novel curriculum batch scheduling strategy to master the efficiency-efficacy equilibrium. Experimental results show DECS can achieve a dramatic reduction in reasoning tokens by over 50% across seven benchmarks while simultaneously maintaining or even improving performance. It demonstrates conclusively that substantial gains in reasoning efficiency can be achieved without compromising a model's underlying reasoning power.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces DECS, a framework addressing overthinking in large reasoning models through decoupled token-level rewards and curriculum batch scheduling. It resides in the 'Reward Engineering for Efficiency' leaf under 'Training-Based Overthinking Mitigation', alongside three sibling papers (REA-RL Reflection-Aware, Reasoning Shaping, and one other). This leaf represents a moderately populated research direction within a taxonomy of 50 papers across approximately 36 topics, indicating focused but not overcrowded attention to reward-based training solutions for reasoning efficiency.

The taxonomy reveals that reward engineering sits within a broader training-based mitigation branch, distinct from inference-time methods (e.g., early exit mechanisms, reasoning compression) and adaptive control approaches (e.g., difficulty-adaptive allocation). Neighboring leaves include 'Data-Centric Training Strategies' and 'Reasoning Pattern Guidance', which address efficiency through curated datasets and modular reasoning supervision respectively. DECS diverges from these by focusing on token-level reward decomposition rather than data curation or pattern-level guidance, positioning it as a training objective innovation rather than an architectural or data-driven solution.

Among 25 candidates examined across three contributions, no clearly refuting prior work was identified. The theoretical analysis of length-based reward misalignment examined 10 candidates with zero refutations, suggesting this specific framing may be novel within the limited search scope. The DECS framework and curriculum scheduling contributions each examined 5 and 10 candidates respectively, also without refutation. These statistics indicate that within the top-K semantic matches explored, the paper's specific combination of token-level reward decoupling and curriculum strategies appears distinct from existing reward engineering approaches.

Based on the limited literature search of 25 candidates, the work appears to offer a fresh perspective within reward engineering for reasoning efficiency. However, the analysis does not cover exhaustive prior work beyond top-K semantic matches and citation expansion. The taxonomy context suggests the paper contributes to an active but not saturated research direction, with its token-level reward decomposition distinguishing it from sibling works that may employ trajectory-level or step-level reward mechanisms.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
25
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: reducing overthinking in large reasoning models. The field has organized itself around a diverse set of strategies that span detection, control, training, and inference-time optimization. At the highest level, one branch focuses on Overthinking Detection and Analysis, identifying when models expend unnecessary computation, while Adaptive Reasoning Control and Training-Based Overthinking Mitigation develop mechanisms to modulate reasoning depth either during training or via reward engineering. Inference-Time Optimization and System-Level Inference Optimization address runtime efficiency through techniques like early exit and dynamic batching, and Efficiency Enhancement via Model Architecture explores structural changes such as layer pruning. Additional branches cover Context and Input Optimization, Domain-Specific and Application-Oriented Efficiency, Prompting and In-Context Learning for Efficiency, and cross-cutting themes in Comprehensive Surveys and Frameworks, Emerging and Cross-Cutting Approaches, Safety and Robustness Considerations, and Auxiliary System Components. Representative works illustrate these directions: Stop Overthinking Survey[5] and Reasoning Economy Survey[24] provide broad overviews, while Difficulty-Adaptive Slow-Thinking[3] and Dynamic Early Exit[6] exemplify adaptive control and inference-time methods. Within this landscape, a particularly active line of work centers on training-based reward engineering, where models learn to balance reasoning depth against computational cost. Overthinking Reduction[0] sits squarely in this cluster, emphasizing reward signals that discourage excessive deliberation during training. Nearby, REA-RL Reflection-Aware[19] and Reasoning Shaping[32] also leverage reinforcement learning to guide reasoning efficiency, while SmartThinker Step-Level Control[45] introduces finer-grained step-level interventions. These approaches contrast with inference-time methods like Dynamic Early Exit[6] or training-free techniques such as ThinkLess Training-Free[49], which avoid modifying the training objective. The central trade-off across these branches is whether to bake efficiency into the model's learned behavior or to impose it dynamically at test time. Overthinking Reduction[0] aligns closely with the former philosophy, sharing the reward-engineering emphasis of works like REA-RL[19] but differing in the specific signals used to penalize overthinking, thus contributing a distinct perspective on how to shape reasoning economy during the learning phase.

Claimed Contributions

Theoretical analysis of misalignment in length-based rewards

The authors provide a theoretical analysis revealing two critical flaws in existing length reward mechanisms: the erroneous penalization of essential exploratory high-entropy tokens and the inadvertent rewarding of partial redundancy. This misalignment between trajectory-level rewards and token-level optimization is shown to degrade reasoning performance and limit efficiency gains.

10 retrieved papers
DECS framework with decoupled token-level rewards

The authors propose DECS, a framework featuring a first-of-its-kind decoupled token-level reward mechanism that surgically distinguishes and penalizes redundant tokens generated after the necessary reasoning prefix, while preserving rewards for essential reasoning steps. This addresses the identified misalignment by operating at the token level rather than sequence level.

5 retrieved papers
Curriculum batch scheduling strategy

The authors introduce a dynamic batching strategy that adaptively adjusts the proportion of easy prompts in training batches based on the current NRP ratio. This curriculum approach mitigates over-penalization of exploratory behavior and maintains the balance between reasoning efficiency and model capability throughout training.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Theoretical analysis of misalignment in length-based rewards

The authors provide a theoretical analysis revealing two critical flaws in existing length reward mechanisms: the erroneous penalization of essential exploratory high-entropy tokens and the inadvertent rewarding of partial redundancy. This misalignment between trajectory-level rewards and token-level optimization is shown to degrade reasoning performance and limit efficiency gains.

Contribution

DECS framework with decoupled token-level rewards

The authors propose DECS, a framework featuring a first-of-its-kind decoupled token-level reward mechanism that surgically distinguishes and penalizes redundant tokens generated after the necessary reasoning prefix, while preserving rewards for essential reasoning steps. This addresses the identified misalignment by operating at the token level rather than sequence level.

Contribution

Curriculum batch scheduling strategy

The authors introduce a dynamic batching strategy that adaptively adjusts the proportion of easy prompts in training batches based on the current NRP ratio. This curriculum approach mitigates over-penalization of exploratory behavior and maintains the balance between reasoning efficiency and model capability throughout training.