When More is Less: Understanding Chain-of-Thought Length in LLMs

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Chain-of-Thought reasoningSimplicity biasTest-time scalingReasoning length calibration

Large Language Models (LLMs) increasingly rely on Chain-of-Thought (CoT) reasoning to solve complex problems. Contrary to the common belief that longer CoTs always improve performance, we demonstrate that longer is not always better. Across both real-world LLMs and theoretical models, task accuracy follows an inverted U-shaped curve with respect to CoT length: performance rises initially but declines once reasoning chains become too long. Through controlled experiments, we uncover scaling behaviors of the optimal CoT length: it increases with task difficulty but decreases with model capability. This exposes a significant mismatch with current practice, where supervised training often reuses the same CoT data across models and tasks without adaptivity. We further show that Reinforcement Learning (RL) can mitigate this gap by dynamically calibrating CoT length, thereby improving accuracy and offering a new perspective on differences between supervised fine-tuning and RL training. To explain these phenomena, we introduce an error-accumulation analysis that characterizes how reasoning errors propagate across steps and derives the scaling behaviors of CoT length observed empirically. Building on these insights, we show that training with optimally sized CoTs and applying length-aware filtering during inference yields substantial improvements in performance. Taken together, these findings establish a principled explanation of the ''overthinking'' effect and yield practical guidelines for calibrating CoT length in accordance with task complexity and model capability.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper investigates the relationship between Chain-of-Thought length and task accuracy, demonstrating an inverted U-shaped curve where performance initially improves but declines with excessive reasoning steps. It resides in the 'Length-Performance Relationship Analysis' leaf alongside two sibling papers (Reasoning Step Length and Thinking Optimal Scaling), forming a small cluster within the broader 'Empirical Studies of CoT Length Effects' branch. This leaf represents a moderately active research direction, with three papers examining how CoT length correlates with model performance across varying task complexities.

The taxonomy reveals neighboring work in adjacent leaves: 'Overthinking and Underthinking Phenomena' explores redundancy and misalignment in reasoning depth, while 'CoT Length Optimization and Control Methods' develops active interventions like RL-based adaptation and compression techniques. The paper bridges empirical observation (its home leaf) with theoretical explanation, connecting to the 'Error Accumulation and Scaling Behavior Models' leaf through its analytical framework. Unlike optimization-focused neighbors, this work prioritizes characterizing the length-performance relationship rather than proposing control mechanisms, though it acknowledges RL's potential to mitigate observed mismatches.

Among thirty candidates examined, the first contribution (inverted U-shape) encountered two potentially overlapping papers from ten reviewed, while the second contribution (scaling laws) found one refutable candidate among ten. The third contribution (error-accumulation framework) showed no clear refutation across ten candidates, suggesting relative novelty in its theoretical approach. The limited search scope means these statistics reflect top-semantic-match overlap rather than exhaustive prior work coverage. Contributions one and two appear to build incrementally on existing empirical observations, whereas the theoretical framework may offer more distinctive analytical machinery.

Based on the thirty-candidate search, the work appears to synthesize and extend existing empirical findings about CoT length effects while introducing a theoretical lens less represented in prior literature. The analysis does not cover the full corpus of CoT research, particularly work outside top-semantic-match retrieval or recent preprints. The inverted U-shape and scaling behaviors align with emerging themes in the field, though the error-accumulation formalism may provide novel explanatory power within this moderately crowded empirical research direction.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: optimal Chain-of-Thought length in large language model reasoning. The field has organized around several complementary perspectives on how reasoning chain length affects model performance. One major branch focuses on CoT Length Optimization and Control Methods, developing techniques to dynamically adjust or compress reasoning steps during inference. A second branch comprises Empirical Studies of CoT Length Effects, systematically measuring how varying chain lengths influence accuracy, efficiency, and generalization across diverse tasks. Theoretical Foundations and Mechanistic Understanding seeks to explain why certain lengths work better, probing the internal representations and computational limits of reasoning. Training Strategies for CoT Length Calibration explores how to teach models appropriate reasoning depth during the learning phase, while Domain-Specific and Multimodal CoT Applications examines length considerations in specialized contexts such as medical reasoning or audio-visual tasks. Recent empirical work has revealed nuanced trade-offs between reasoning depth and performance. Studies like Reasoning Step Length[2] and Thinking Optimal Scaling[8] investigate how different problem complexities demand different chain lengths, while works such as ShorterBetter[4] and Don't Overthink[6] challenge the assumption that longer reasoning always improves outcomes. More is Less[0] contributes to this active debate by examining length-performance relationships, positioning itself alongside neighbors like Reasoning Step Length[2] and Thinking Optimal Scaling[8] within the empirical analysis cluster. Where some studies emphasize the benefits of extended reasoning for hard problems, More is Less[0] explores conditions under which excessive chain length may introduce noise or inefficiency, complementing findings from CoT Mirage[7] and Beyond Surface Reasoning[3] that question whether all reasoning tokens contribute equally to final answer quality.

Claimed Contributions

Demonstration of inverted U-shaped relationship between CoT length and accuracy

Can Refute

10 retrieved papers

The authors show empirically and theoretically that reasoning performance does not monotonically improve with longer Chain-of-Thought sequences. Instead, accuracy peaks at an optimal length and declines when chains become excessively long, challenging the prevailing assumption that longer reasoning is always better.

10 retrieved papers

Can Refute

Scaling laws for optimal CoT length with task difficulty and model capability

Can Refute

10 retrieved papers

Through controlled experiments on synthetic tasks and real-world LLMs, the authors systematically characterize how the optimal CoT length scales: harder tasks require longer chains, while more capable models achieve peak performance with shorter chains. This reveals a mismatch with current uniform training practices.

10 retrieved papers

Can Refute

Error-accumulation theoretical framework explaining CoT length phenomena

10 retrieved papers

The authors develop a theoretical model based on error accumulation across reasoning steps that formally explains the inverted U-shaped performance curve, derives the existence of an optimal CoT length, and recovers the empirically observed scaling laws relating optimal length to task difficulty and model capability.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[2] The impact of reasoning step length on large language models PDF

Du, Mengnan, Hua, Wenyue, Jin Ming-yu, Meng, Yanda, Shu Dong, Yu, Qinkai, Zhang, Yongfeng, Zhao Hai-yan (2024)

[8] Towards thinking-optimal scaling of test-time compute for llm reasoning PDF

Yang, Wenkai, Ma, Shuming, Wenkai Yang, Lin, Yankai, Shuming Ma, Wei, Furu, Yankai Lin, Furu Wei (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Demonstration of inverted U-shaped relationship between CoT length and accuracy

[12] Between Underthinking and Overthinking: An Empirical Study of Reasoning Length and correctness in LLMs PDF

Can Refute

[39] Critical Thinking: Which Kinds of Complexity Govern Optimal Reasoning Length? PDF

Can Refute

[19] An Empirical Study of Reasoning Steps in Thinking Code LLMs PDF

Cannot Refute

[34] Thinking Fast and Right: Balancing Accuracy and Reasoning Length with Adaptive Rewards PDF

Cannot Refute

[35] ReCUT: Balancing Reasoning Length and Accuracy in LLMs via Stepwise Trails and Preference Optimization PDF

Cannot Refute

[36] AALC: Large Language Model Efficient Reasoning via Adaptive Accuracy-Length Control PDF

Cannot Refute

[37] The Relationship Between Reasoning and Performance in Large Language Models - o3 (mini) Thinks Harder, Not Longer PDF

Cannot Refute

[38] Using the tools of cognitive science to understand large language models at different levels of analysis PDF

Cannot Refute

[40] Long Is More Important Than Difficult for Training Reasoning Models PDF

Cannot Refute

[41] LongReasonArena: A Long Reasoning Benchmark for Large Language Models PDF

Cannot Refute

Contribution

Scaling laws for optimal CoT length with task difficulty and model capability

[2] The impact of reasoning step length on large language models PDF

Can Refute

[19] An Empirical Study of Reasoning Steps in Thinking Code LLMs PDF

Cannot Refute

[40] Long Is More Important Than Difficult for Training Reasoning Models PDF

Cannot Refute

[42] Understanding transformer reasoning capabilities via graph algorithms PDF

Cannot Refute

[43] Observational Scaling Laws and the Predictability of Language Model Performance PDF

Cannot Refute

[44] Inverse scaling in test-time compute PDF

Cannot Refute

[45] Depth-breadth synergy in rlvr: Unlocking llm reasoning gains with adaptive exploration PDF

Cannot Refute

[46] Curriculum Reinforcement Learning from Easy to Hard Tasks Improves LLM Reasoning PDF

Cannot Refute

[47] Limits of deep learning: Sequence modeling through the lens of complexity theory PDF

Cannot Refute

[48] Reasoning in Large Language Models: From Chain-of-Thought to Massively Decomposed Agentic Processes PDF

Cannot Refute

Contribution

Error-accumulation theoretical framework explaining CoT length phenomena

[49] Demystifying Long Chain-of-Thought Reasoning in LLMs PDF

Cannot Refute

[50] Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning? PDF

Cannot Refute

[51] Deductive Verification of Chain-of-Thought Reasoning PDF

Cannot Refute

[52] GRACE: Discriminator-Guided Chain-of-Thought Reasoning PDF

Cannot Refute

[53] Critic-cot: Boosting the reasoning abilities of large language model via chain-of-thought critic PDF

Cannot Refute

[54] ReAct: Synergizing Reasoning and Acting in Language Models PDF

Cannot Refute

[55] Unveiling the statistical foundations of chain-of-thought prompting methods PDF

Cannot Refute

[56] Recursive decomposition of logical thoughts: Framework for superior reasoning and knowledge propagation in large language models PDF

Cannot Refute

[57] A theoretical understanding of chain-of-thought: Coherent reasoning and error-aware demonstration PDF

Cannot Refute

[58] Deductive beam search: Decoding deducible rationale for chain-of-thought reasoning PDF

Cannot Refute

When More is Less: Understanding Chain-of-Thought Length in LLMs

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[2] The impact of reasoning step length on large language models PDF

[8] Towards thinking-optimal scaling of test-time compute for llm reasoning PDF

Contribution Analysis

Demonstration of inverted U-shaped relationship between CoT length and accuracy

[12] Between Underthinking and Overthinking: An Empirical Study of Reasoning Length and correctness in LLMs PDF

[39] Critical Thinking: Which Kinds of Complexity Govern Optimal Reasoning Length? PDF

[19] An Empirical Study of Reasoning Steps in Thinking Code LLMs PDF

[34] Thinking Fast and Right: Balancing Accuracy and Reasoning Length with Adaptive Rewards PDF

[35] ReCUT: Balancing Reasoning Length and Accuracy in LLMs via Stepwise Trails and Preference Optimization PDF

[36] AALC: Large Language Model Efficient Reasoning via Adaptive Accuracy-Length Control PDF

[37] The Relationship Between Reasoning and Performance in Large Language Models - o3 (mini) Thinks Harder, Not Longer PDF

[38] Using the tools of cognitive science to understand large language models at different levels of analysis PDF

[40] Long Is More Important Than Difficult for Training Reasoning Models PDF

[41] LongReasonArena: A Long Reasoning Benchmark for Large Language Models PDF

Scaling laws for optimal CoT length with task difficulty and model capability

[2] The impact of reasoning step length on large language models PDF

[19] An Empirical Study of Reasoning Steps in Thinking Code LLMs PDF

[40] Long Is More Important Than Difficult for Training Reasoning Models PDF

[42] Understanding transformer reasoning capabilities via graph algorithms PDF

[43] Observational Scaling Laws and the Predictability of Language Model Performance PDF

[44] Inverse scaling in test-time compute PDF

[45] Depth-breadth synergy in rlvr: Unlocking llm reasoning gains with adaptive exploration PDF

[46] Curriculum Reinforcement Learning from Easy to Hard Tasks Improves LLM Reasoning PDF

[47] Limits of deep learning: Sequence modeling through the lens of complexity theory PDF

[48] Reasoning in Large Language Models: From Chain-of-Thought to Massively Decomposed Agentic Processes PDF

Error-accumulation theoretical framework explaining CoT length phenomena

[49] Demystifying Long Chain-of-Thought Reasoning in LLMs PDF

[50] Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning? PDF

[51] Deductive Verification of Chain-of-Thought Reasoning PDF

[52] GRACE: Discriminator-Guided Chain-of-Thought Reasoning PDF

[53] Critic-cot: Boosting the reasoning abilities of large language model via chain-of-thought critic PDF

[54] ReAct: Synergizing Reasoning and Acting in Language Models PDF

[55] Unveiling the statistical foundations of chain-of-thought prompting methods PDF

[56] Recursive decomposition of logical thoughts: Framework for superior reasoning and knowledge propagation in large language models PDF

[57] A theoretical understanding of chain-of-thought: Coherent reasoning and error-aware demonstration PDF

[58] Deductive beam search: Decoding deducible rationale for chain-of-thought reasoning PDF

Table of Contents