When More is Less: Understanding Chain-of-Thought Length in LLMs
Overview
Overall Novelty Assessment
The paper investigates the relationship between Chain-of-Thought length and task accuracy, demonstrating an inverted U-shaped curve where performance initially improves but declines with excessive reasoning steps. It resides in the 'Length-Performance Relationship Analysis' leaf alongside two sibling papers (Reasoning Step Length and Thinking Optimal Scaling), forming a small cluster within the broader 'Empirical Studies of CoT Length Effects' branch. This leaf represents a moderately active research direction, with three papers examining how CoT length correlates with model performance across varying task complexities.
The taxonomy reveals neighboring work in adjacent leaves: 'Overthinking and Underthinking Phenomena' explores redundancy and misalignment in reasoning depth, while 'CoT Length Optimization and Control Methods' develops active interventions like RL-based adaptation and compression techniques. The paper bridges empirical observation (its home leaf) with theoretical explanation, connecting to the 'Error Accumulation and Scaling Behavior Models' leaf through its analytical framework. Unlike optimization-focused neighbors, this work prioritizes characterizing the length-performance relationship rather than proposing control mechanisms, though it acknowledges RL's potential to mitigate observed mismatches.
Among thirty candidates examined, the first contribution (inverted U-shape) encountered two potentially overlapping papers from ten reviewed, while the second contribution (scaling laws) found one refutable candidate among ten. The third contribution (error-accumulation framework) showed no clear refutation across ten candidates, suggesting relative novelty in its theoretical approach. The limited search scope means these statistics reflect top-semantic-match overlap rather than exhaustive prior work coverage. Contributions one and two appear to build incrementally on existing empirical observations, whereas the theoretical framework may offer more distinctive analytical machinery.
Based on the thirty-candidate search, the work appears to synthesize and extend existing empirical findings about CoT length effects while introducing a theoretical lens less represented in prior literature. The analysis does not cover the full corpus of CoT research, particularly work outside top-semantic-match retrieval or recent preprints. The inverted U-shape and scaling behaviors align with emerging themes in the field, though the error-accumulation formalism may provide novel explanatory power within this moderately crowded empirical research direction.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors show empirically and theoretically that reasoning performance does not monotonically improve with longer Chain-of-Thought sequences. Instead, accuracy peaks at an optimal length and declines when chains become excessively long, challenging the prevailing assumption that longer reasoning is always better.
Through controlled experiments on synthetic tasks and real-world LLMs, the authors systematically characterize how the optimal CoT length scales: harder tasks require longer chains, while more capable models achieve peak performance with shorter chains. This reveals a mismatch with current uniform training practices.
The authors develop a theoretical model based on error accumulation across reasoning steps that formally explains the inverted U-shaped performance curve, derives the existence of an optimal CoT length, and recovers the empirically observed scaling laws relating optimal length to task difficulty and model capability.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[2] The impact of reasoning step length on large language models PDF
[8] Towards thinking-optimal scaling of test-time compute for llm reasoning PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Demonstration of inverted U-shaped relationship between CoT length and accuracy
The authors show empirically and theoretically that reasoning performance does not monotonically improve with longer Chain-of-Thought sequences. Instead, accuracy peaks at an optimal length and declines when chains become excessively long, challenging the prevailing assumption that longer reasoning is always better.
[12] Between Underthinking and Overthinking: An Empirical Study of Reasoning Length and correctness in LLMs PDF
[39] Critical Thinking: Which Kinds of Complexity Govern Optimal Reasoning Length? PDF
[19] An Empirical Study of Reasoning Steps in Thinking Code LLMs PDF
[34] Thinking Fast and Right: Balancing Accuracy and Reasoning Length with Adaptive Rewards PDF
[35] ReCUT: Balancing Reasoning Length and Accuracy in LLMs via Stepwise Trails and Preference Optimization PDF
[36] AALC: Large Language Model Efficient Reasoning via Adaptive Accuracy-Length Control PDF
[37] The Relationship Between Reasoning and Performance in Large Language Models - o3 (mini) Thinks Harder, Not Longer PDF
[38] Using the tools of cognitive science to understand large language models at different levels of analysis PDF
[40] Long Is More Important Than Difficult for Training Reasoning Models PDF
[41] LongReasonArena: A Long Reasoning Benchmark for Large Language Models PDF
Scaling laws for optimal CoT length with task difficulty and model capability
Through controlled experiments on synthetic tasks and real-world LLMs, the authors systematically characterize how the optimal CoT length scales: harder tasks require longer chains, while more capable models achieve peak performance with shorter chains. This reveals a mismatch with current uniform training practices.
[2] The impact of reasoning step length on large language models PDF
[19] An Empirical Study of Reasoning Steps in Thinking Code LLMs PDF
[40] Long Is More Important Than Difficult for Training Reasoning Models PDF
[42] Understanding transformer reasoning capabilities via graph algorithms PDF
[43] Observational Scaling Laws and the Predictability of Language Model Performance PDF
[44] Inverse scaling in test-time compute PDF
[45] Depth-breadth synergy in rlvr: Unlocking llm reasoning gains with adaptive exploration PDF
[46] Curriculum Reinforcement Learning from Easy to Hard Tasks Improves LLM Reasoning PDF
[47] Limits of deep learning: Sequence modeling through the lens of complexity theory PDF
[48] Reasoning in Large Language Models: From Chain-of-Thought to Massively Decomposed Agentic Processes PDF
Error-accumulation theoretical framework explaining CoT length phenomena
The authors develop a theoretical model based on error accumulation across reasoning steps that formally explains the inverted U-shaped performance curve, derives the existence of an optimal CoT length, and recovers the empirically observed scaling laws relating optimal length to task difficulty and model capability.