Overthinking Reduction with Decoupled Rewards and Curriculum Data Scheduling

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

efficient reasoning; curriculum sampling with decoupled reward

While large reasoning models trained with critic-free reinforcement learning and verifiable rewards (RLVR) represent the state-of-the-art, their practical utility is hampered by ``overthinking'', a critical issue where models generate excessively long reasoning paths without any performance benefit. Existing solutions that penalize length often fail, inducing performance degradation due to a fundamental misalignment between trajectory-level rewards and token-level optimization. In this work, we introduce a novel framework, DECS, built on our theoretical discovery of two previously unaddressed flaws in current length rewards: (1) the erroneous penalization of essential exploratory tokens and (2) the inadvertent rewarding of partial redundancy. Our framework's innovations include (i) a first-of-its-kind decoupled token-level reward mechanism that surgically distinguishes and penalizes redundant tokens, and (ii) a novel curriculum batch scheduling strategy to master the efficiency-efficacy equilibrium. Experimental results show DECS can achieve a dramatic reduction in reasoning tokens by over 50% across seven benchmarks while simultaneously maintaining or even improving performance. It demonstrates conclusively that substantial gains in reasoning efficiency can be achieved without compromising a model's underlying reasoning power.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces DECS, a framework addressing overthinking in large reasoning models through decoupled token-level rewards and curriculum batch scheduling. It resides in the 'Reward Engineering for Efficiency' leaf under 'Training-Based Overthinking Mitigation', alongside three sibling papers (REA-RL Reflection-Aware, Reasoning Shaping, and one other). This leaf represents a moderately populated research direction within a taxonomy of 50 papers across approximately 36 topics, indicating focused but not overcrowded attention to reward-based training solutions for reasoning efficiency.

The taxonomy reveals that reward engineering sits within a broader training-based mitigation branch, distinct from inference-time methods (e.g., early exit mechanisms, reasoning compression) and adaptive control approaches (e.g., difficulty-adaptive allocation). Neighboring leaves include 'Data-Centric Training Strategies' and 'Reasoning Pattern Guidance', which address efficiency through curated datasets and modular reasoning supervision respectively. DECS diverges from these by focusing on token-level reward decomposition rather than data curation or pattern-level guidance, positioning it as a training objective innovation rather than an architectural or data-driven solution.

Among 25 candidates examined across three contributions, no clearly refuting prior work was identified. The theoretical analysis of length-based reward misalignment examined 10 candidates with zero refutations, suggesting this specific framing may be novel within the limited search scope. The DECS framework and curriculum scheduling contributions each examined 5 and 10 candidates respectively, also without refutation. These statistics indicate that within the top-K semantic matches explored, the paper's specific combination of token-level reward decoupling and curriculum strategies appears distinct from existing reward engineering approaches.

Based on the limited literature search of 25 candidates, the work appears to offer a fresh perspective within reward engineering for reasoning efficiency. However, the analysis does not cover exhaustive prior work beyond top-K semantic matches and citation expansion. The taxonomy context suggests the paper contributes to an active but not saturated research direction, with its token-level reward decomposition distinguishing it from sibling works that may employ trajectory-level or step-level reward mechanisms.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: reducing overthinking in large reasoning models. The field has organized itself around a diverse set of strategies that span detection, control, training, and inference-time optimization. At the highest level, one branch focuses on Overthinking Detection and Analysis, identifying when models expend unnecessary computation, while Adaptive Reasoning Control and Training-Based Overthinking Mitigation develop mechanisms to modulate reasoning depth either during training or via reward engineering. Inference-Time Optimization and System-Level Inference Optimization address runtime efficiency through techniques like early exit and dynamic batching, and Efficiency Enhancement via Model Architecture explores structural changes such as layer pruning. Additional branches cover Context and Input Optimization, Domain-Specific and Application-Oriented Efficiency, Prompting and In-Context Learning for Efficiency, and cross-cutting themes in Comprehensive Surveys and Frameworks, Emerging and Cross-Cutting Approaches, Safety and Robustness Considerations, and Auxiliary System Components. Representative works illustrate these directions: Stop Overthinking Survey[5] and Reasoning Economy Survey[24] provide broad overviews, while Difficulty-Adaptive Slow-Thinking[3] and Dynamic Early Exit[6] exemplify adaptive control and inference-time methods. Within this landscape, a particularly active line of work centers on training-based reward engineering, where models learn to balance reasoning depth against computational cost. Overthinking Reduction[0] sits squarely in this cluster, emphasizing reward signals that discourage excessive deliberation during training. Nearby, REA-RL Reflection-Aware[19] and Reasoning Shaping[32] also leverage reinforcement learning to guide reasoning efficiency, while SmartThinker Step-Level Control[45] introduces finer-grained step-level interventions. These approaches contrast with inference-time methods like Dynamic Early Exit[6] or training-free techniques such as ThinkLess Training-Free[49], which avoid modifying the training objective. The central trade-off across these branches is whether to bake efficiency into the model's learned behavior or to impose it dynamically at test time. Overthinking Reduction[0] aligns closely with the former philosophy, sharing the reward-engineering emphasis of works like REA-RL[19] but differing in the specific signals used to penalize overthinking, thus contributing a distinct perspective on how to shape reasoning economy during the learning phase.

Claimed Contributions

Theoretical analysis of misalignment in length-based rewards

10 retrieved papers

The authors provide a theoretical analysis revealing two critical flaws in existing length reward mechanisms: the erroneous penalization of essential exploratory high-entropy tokens and the inadvertent rewarding of partial redundancy. This misalignment between trajectory-level rewards and token-level optimization is shown to degrade reasoning performance and limit efficiency gains.

10 retrieved papers

DECS framework with decoupled token-level rewards

5 retrieved papers

The authors propose DECS, a framework featuring a first-of-its-kind decoupled token-level reward mechanism that surgically distinguishes and penalizes redundant tokens generated after the necessary reasoning prefix, while preserving rewards for essential reasoning steps. This addresses the identified misalignment by operating at the token level rather than sequence level.

5 retrieved papers

Curriculum batch scheduling strategy

10 retrieved papers

The authors introduce a dynamic batching strategy that adaptively adjusts the proportion of easy prompts in training batches based on the current NRP ratio. This curriculum approach mitigates over-penalization of exploratory behavior and maintains the balance between reasoning efficiency and model capability throughout training.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[19] REA-RL: Reflection-Aware Online Reinforcement Learning for Efficient Large Reasoning Models PDF

Deng, Hexuan, Jiao, Wenxiang, Hexuan Deng, Liu, Xuebo, Wenxiang Jiao, Rao Jun, Xuebo Liu, Zhang Min, Jun Rao, Min Zhang (2025) • arXiv.org

[32] Mitigating Overthinking through Reasoning Shaping PDF

Song, Feifan, Feifan Song, Gao, Bofei, Shaohang Wei, Wang Yejie, Bofei Gao, Luo Wen, Yejie Wang, Li Wei, Wen Luo, Yao, Linli, Wei Li, Xiong Wei-min, Linli Yao, Chen Liang, Weimin Xiong, Liu, Tianyu, Liang Chen, Wang, Houfeng, Tianyu Liu, Houfeng Wang (2025)

[45] SmartThinker: Learning to Compress and Preserve Reasoning by Step-Level Length Control PDF

He Xing-yang, Ling Xiao, Xingyang He, Liu Jie, Xiao Ling, Jie Liu (2025) • arXiv.org

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Theoretical analysis of misalignment in length-based rewards

[51] L1: Controlling how long a reasoning model thinks with reinforcement learning PDF

Cannot Refute

[52] Scaling laws for reward model overoptimization in direct alignment algorithms PDF

Cannot Refute

[53] Towards Flash Thinking via Decoupled Advantage Policy Optimization PDF

Cannot Refute

[54] LAPO: Internalizing Reasoning Efficiency via Length-Adaptive Policy Optimization PDF

Cannot Refute

[55] Hierarchical Budget Policy Optimization for Adaptive Reasoning PDF

Cannot Refute

[56] GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy PDF

Cannot Refute

[57] From Uniform to Heterogeneous: Tailoring Policy Optimization to Every Token's Nature PDF

Cannot Refute

[58] Soft Adaptive Policy Optimization PDF

Cannot Refute

[59] Tlcr: Token-level continuous reward for fine-grained reinforcement learning from human feedback PDF

Cannot Refute

[60] Delve into PPO: Implementation matters for stable RLHF PDF

Cannot Refute

Contribution

DECS framework with decoupled token-level rewards

[71] StepSearch: Igniting LLMs Search Ability via Step-Wise Proximal Policy Optimization PDF

Cannot Refute

[72] On the Current Landscape of Language Model Reward Modeling for Alignment PDF

Cannot Refute

[73] DRPO: Efficient Reasoning via Decoupled Reward Policy Optimization PDF

Cannot Refute

[74] UpSkill: Mutual Information Skill Learning for Structured Response Diversity in LLMs PDF

Cannot Refute

[75] Arithmetic accuracy and stability as a function of token reinforcement PDF

Cannot Refute

Contribution

Curriculum batch scheduling strategy

[61] Curriculum learning: A survey PDF

Cannot Refute

[62] Prompt Curriculum Learning for Efficient LLM Post-Training PDF

Cannot Refute

[63] Dump: Automated distribution-level curriculum learning for rl-based llm post-training PDF

Cannot Refute

[64] Emergent mechanisms for long timescales depend on training curriculum and affect performance in memory tasks PDF

Cannot Refute

[65] Curriculum guided reinforcement learning for efficient multi hop retrieval augmented generation PDF

Cannot Refute

[66] Research and Implementation of Education Resource Scheduling Algorithm Based on Machine Learning PDF

Cannot Refute

[67] Reinforcement learning for the adaptive scheduling of educational activities PDF

Cannot Refute

[68] Deep curriculum learning for polsar image classification PDF

Cannot Refute

[69] When do curricula work? PDF

Cannot Refute

[70] Curriculum learning by optimizing learning dynamics PDF

Cannot Refute

Overthinking Reduction with Decoupled Rewards and Curriculum Data Scheduling

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[19] REA-RL: Reflection-Aware Online Reinforcement Learning for Efficient Large Reasoning Models PDF

[32] Mitigating Overthinking through Reasoning Shaping PDF

[45] SmartThinker: Learning to Compress and Preserve Reasoning by Step-Level Length Control PDF

Contribution Analysis

Theoretical analysis of misalignment in length-based rewards

[51] L1: Controlling how long a reasoning model thinks with reinforcement learning PDF

[52] Scaling laws for reward model overoptimization in direct alignment algorithms PDF

[53] Towards Flash Thinking via Decoupled Advantage Policy Optimization PDF

[54] LAPO: Internalizing Reasoning Efficiency via Length-Adaptive Policy Optimization PDF

[55] Hierarchical Budget Policy Optimization for Adaptive Reasoning PDF

[56] GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy PDF

[57] From Uniform to Heterogeneous: Tailoring Policy Optimization to Every Token's Nature PDF

[58] Soft Adaptive Policy Optimization PDF

[59] Tlcr: Token-level continuous reward for fine-grained reinforcement learning from human feedback PDF

[60] Delve into PPO: Implementation matters for stable RLHF PDF

DECS framework with decoupled token-level rewards

[71] StepSearch: Igniting LLMs Search Ability via Step-Wise Proximal Policy Optimization PDF

[72] On the Current Landscape of Language Model Reward Modeling for Alignment PDF

[73] DRPO: Efficient Reasoning via Decoupled Reward Policy Optimization PDF

[74] UpSkill: Mutual Information Skill Learning for Structured Response Diversity in LLMs PDF

[75] Arithmetic accuracy and stability as a function of token reinforcement PDF

Curriculum batch scheduling strategy

[61] Curriculum learning: A survey PDF

[62] Prompt Curriculum Learning for Efficient LLM Post-Training PDF

[63] Dump: Automated distribution-level curriculum learning for rl-based llm post-training PDF

[64] Emergent mechanisms for long timescales depend on training curriculum and affect performance in memory tasks PDF

[65] Curriculum guided reinforcement learning for efficient multi hop retrieval augmented generation PDF

[66] Research and Implementation of Education Resource Scheduling Algorithm Based on Machine Learning PDF

[67] Reinforcement learning for the adaptive scheduling of educational activities PDF

[68] Deep curriculum learning for polsar image classification PDF

[69] When do curricula work? PDF

[70] Curriculum learning by optimizing learning dynamics PDF

Table of Contents