When More is Less: Understanding Chain-of-Thought Length in LLMs

ICLR 2026 Conference SubmissionAnonymous Authors
Chain-of-Thought reasoningSimplicity biasTest-time scalingReasoning length calibration
Abstract:

Large Language Models (LLMs) increasingly rely on Chain-of-Thought (CoT) reasoning to solve complex problems. Contrary to the common belief that longer CoTs always improve performance, we demonstrate that longer is not always better. Across both real-world LLMs and theoretical models, task accuracy follows an inverted U-shaped curve with respect to CoT length: performance rises initially but declines once reasoning chains become too long. Through controlled experiments, we uncover scaling behaviors of the optimal CoT length: it increases with task difficulty but decreases with model capability. This exposes a significant mismatch with current practice, where supervised training often reuses the same CoT data across models and tasks without adaptivity. We further show that Reinforcement Learning (RL) can mitigate this gap by dynamically calibrating CoT length, thereby improving accuracy and offering a new perspective on differences between supervised fine-tuning and RL training. To explain these phenomena, we introduce an error-accumulation analysis that characterizes how reasoning errors propagate across steps and derives the scaling behaviors of CoT length observed empirically. Building on these insights, we show that training with optimally sized CoTs and applying length-aware filtering during inference yields substantial improvements in performance. Taken together, these findings establish a principled explanation of the ''overthinking'' effect and yield practical guidelines for calibrating CoT length in accordance with task complexity and model capability.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper investigates the relationship between Chain-of-Thought length and task accuracy, demonstrating an inverted U-shaped curve where performance initially improves but declines with excessive reasoning steps. It resides in the 'Length-Performance Relationship Analysis' leaf alongside two sibling papers (Reasoning Step Length and Thinking Optimal Scaling), forming a small cluster within the broader 'Empirical Studies of CoT Length Effects' branch. This leaf represents a moderately active research direction, with three papers examining how CoT length correlates with model performance across varying task complexities.

The taxonomy reveals neighboring work in adjacent leaves: 'Overthinking and Underthinking Phenomena' explores redundancy and misalignment in reasoning depth, while 'CoT Length Optimization and Control Methods' develops active interventions like RL-based adaptation and compression techniques. The paper bridges empirical observation (its home leaf) with theoretical explanation, connecting to the 'Error Accumulation and Scaling Behavior Models' leaf through its analytical framework. Unlike optimization-focused neighbors, this work prioritizes characterizing the length-performance relationship rather than proposing control mechanisms, though it acknowledges RL's potential to mitigate observed mismatches.

Among thirty candidates examined, the first contribution (inverted U-shape) encountered two potentially overlapping papers from ten reviewed, while the second contribution (scaling laws) found one refutable candidate among ten. The third contribution (error-accumulation framework) showed no clear refutation across ten candidates, suggesting relative novelty in its theoretical approach. The limited search scope means these statistics reflect top-semantic-match overlap rather than exhaustive prior work coverage. Contributions one and two appear to build incrementally on existing empirical observations, whereas the theoretical framework may offer more distinctive analytical machinery.

Based on the thirty-candidate search, the work appears to synthesize and extend existing empirical findings about CoT length effects while introducing a theoretical lens less represented in prior literature. The analysis does not cover the full corpus of CoT research, particularly work outside top-semantic-match retrieval or recent preprints. The inverted U-shape and scaling behaviors align with emerging themes in the field, though the error-accumulation formalism may provide novel explanatory power within this moderately crowded empirical research direction.

Taxonomy

Core-task Taxonomy Papers
33
3
Claimed Contributions
30
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: optimal Chain-of-Thought length in large language model reasoning. The field has organized around several complementary perspectives on how reasoning chain length affects model performance. One major branch focuses on CoT Length Optimization and Control Methods, developing techniques to dynamically adjust or compress reasoning steps during inference. A second branch comprises Empirical Studies of CoT Length Effects, systematically measuring how varying chain lengths influence accuracy, efficiency, and generalization across diverse tasks. Theoretical Foundations and Mechanistic Understanding seeks to explain why certain lengths work better, probing the internal representations and computational limits of reasoning. Training Strategies for CoT Length Calibration explores how to teach models appropriate reasoning depth during the learning phase, while Domain-Specific and Multimodal CoT Applications examines length considerations in specialized contexts such as medical reasoning or audio-visual tasks. Recent empirical work has revealed nuanced trade-offs between reasoning depth and performance. Studies like Reasoning Step Length[2] and Thinking Optimal Scaling[8] investigate how different problem complexities demand different chain lengths, while works such as ShorterBetter[4] and Don't Overthink[6] challenge the assumption that longer reasoning always improves outcomes. More is Less[0] contributes to this active debate by examining length-performance relationships, positioning itself alongside neighbors like Reasoning Step Length[2] and Thinking Optimal Scaling[8] within the empirical analysis cluster. Where some studies emphasize the benefits of extended reasoning for hard problems, More is Less[0] explores conditions under which excessive chain length may introduce noise or inefficiency, complementing findings from CoT Mirage[7] and Beyond Surface Reasoning[3] that question whether all reasoning tokens contribute equally to final answer quality.

Claimed Contributions

Demonstration of inverted U-shaped relationship between CoT length and accuracy

The authors show empirically and theoretically that reasoning performance does not monotonically improve with longer Chain-of-Thought sequences. Instead, accuracy peaks at an optimal length and declines when chains become excessively long, challenging the prevailing assumption that longer reasoning is always better.

10 retrieved papers
Can Refute
Scaling laws for optimal CoT length with task difficulty and model capability

Through controlled experiments on synthetic tasks and real-world LLMs, the authors systematically characterize how the optimal CoT length scales: harder tasks require longer chains, while more capable models achieve peak performance with shorter chains. This reveals a mismatch with current uniform training practices.

10 retrieved papers
Can Refute
Error-accumulation theoretical framework explaining CoT length phenomena

The authors develop a theoretical model based on error accumulation across reasoning steps that formally explains the inverted U-shaped performance curve, derives the existence of an optimal CoT length, and recovers the empirically observed scaling laws relating optimal length to task difficulty and model capability.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Demonstration of inverted U-shaped relationship between CoT length and accuracy

The authors show empirically and theoretically that reasoning performance does not monotonically improve with longer Chain-of-Thought sequences. Instead, accuracy peaks at an optimal length and declines when chains become excessively long, challenging the prevailing assumption that longer reasoning is always better.

Contribution

Scaling laws for optimal CoT length with task difficulty and model capability

Through controlled experiments on synthetic tasks and real-world LLMs, the authors systematically characterize how the optimal CoT length scales: harder tasks require longer chains, while more capable models achieve peak performance with shorter chains. This reveals a mismatch with current uniform training practices.

Contribution

Error-accumulation theoretical framework explaining CoT length phenomena

The authors develop a theoretical model based on error accumulation across reasoning steps that formally explains the inverted U-shaped performance curve, derives the existence of an optimal CoT length, and recovers the empirically observed scaling laws relating optimal length to task difficulty and model capability.