The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs

ICLR 2026 Conference SubmissionAnonymous Authors
Large Language ModelsLong HorizonAgents
Abstract:

Does continued scaling of large language models (LLMs) yield diminishing returns? In this work, we show that short-task benchmarks may give an illusion of slowing progress, as even marginal gains in single-step accuracy can compound into exponential improvements in the length of tasks a model can successfully complete. Then, we argue that failures of LLMs when simple tasks are made longer arise from mistakes in execution, rather than an inability to reason. So, we propose isolating execution capability, by explicitly providing the knowledge and plan needed to solve a long-horizon task. First, we find that larger models can correctly execute significantly more turns even when small models have near-perfect single-turn accuracy. We then observe that the per-step accuracy of models degrades as the number of steps increases. This is not just due to long-context limitations---curiously, we observe a self-conditioning effect---models become more likely to make mistakes when the context contains their errors from prior turns. Self-conditioning does not reduce by just scaling the model size. But, we find that thinking mitigates self-conditioning, and also enables execution of much longer tasks in a single turn. We conclude by benchmarking frontier thinking models on the length of tasks they can execute in a single turn. Overall, by focusing on the ability to execute, we hope to reconcile debates on how LLMs can solve complex reasoning problems yet fail at simple tasks when made longer, and highlight the massive benefits of scaling model size and sequential test-time compute for long-horizon tasks.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper contributes a framework for isolating execution capability in long-horizon tasks, a mathematical analysis linking per-step accuracy to task horizon, and the discovery of a self-conditioning effect where models degrade when exposed to their own prior errors. It occupies a unique position in the taxonomy: the 'Execution Capability and Self-Conditioning Effects' leaf contains only this paper, making it the sole representative of this specific research direction. This isolation suggests the work addresses a relatively unexplored niche within the broader Execution and Error Analysis branch, which itself contains only three leaves across adaptive execution, self-conditioning, and reinforcement learning approaches.

The taxonomy reveals that neighboring work primarily focuses on adaptive feedback integration (six papers on iterative refinement and environmental feedback) and reinforcement learning for agents (two papers on sequential decision-making). The paper's emphasis on isolating execution from planning distinguishes it from these directions: adaptive execution studies like AgentGym and ReAct examine how agents refine behavior through interaction, while this work explicitly provides knowledge and plans to measure execution in isolation. The scope notes clarify that self-conditioning effects fall outside adaptive feedback mechanisms, positioning the paper at a boundary between execution analysis and reasoning evaluation branches, though it does not directly engage with chain-of-thought or reasoning optimization subtopics.

Among twenty-three candidates examined across three contributions, none were identified as clearly refuting the work. The framework for isolating execution examined ten candidates with zero refutable matches, the self-conditioning discovery examined ten with zero refutations, and the mathematical analysis examined three with zero refutations. This suggests that within the limited search scope—primarily top-K semantic matches and citation expansion—no prior work directly anticipates the combination of execution isolation, self-conditioning characterization, and mathematical modeling of horizon-accuracy relationships. The framework and self-conditioning contributions appear particularly novel given the breadth of candidates examined, though the search does not claim exhaustive coverage of all relevant execution-focused literature.

Based on the limited search of twenty-three candidates, the work appears to introduce a distinct perspective on long-horizon execution that existing literature does not directly address. The taxonomy structure confirms this impression: the paper occupies a singleton leaf, and nearby work emphasizes different mechanisms (feedback loops, reinforcement learning) rather than isolating execution capability. However, the analysis reflects only top-K semantic retrieval and does not guarantee that no relevant work exists in adjacent areas such as error propagation in sequential reasoning or context-dependent performance degradation, which may not have surfaced in the candidate set.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
23
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: measuring long-horizon execution capabilities in large language models. The field has organized itself around five major branches that collectively address how LLMs handle extended, multi-step problems. Task Planning and Decomposition focuses on breaking down complex goals into manageable subgoals, with works like AdaPlanner[6] and Planning Abilities Investigation[5] exploring hierarchical strategies and plan generation. Execution and Error Analysis examines how models actually carry out plans and where they fail, including studies on self-conditioning effects and execution traces. Reasoning Evaluation and Mechanisms investigates the underlying cognitive processes, from chain-of-thought methods (Demystifying Chain Thought[11]) to broader reasoning frameworks (Reasoning Era Survey[15]). Multi-Turn and Tool-Use Interaction captures the dynamic aspects of agent behavior, including conversational memory (Conversational Memory[48]) and tool integration across multiple turns (TurnBench MS[38]). Finally, Benchmarking and Context Handling addresses evaluation infrastructure and the challenge of managing long contexts (Long Context Survey[18], HelloBench[4]), ensuring that assessments reflect realistic task horizons. Several active lines of work reveal key trade-offs in this landscape. One tension involves whether to emphasize upfront planning versus iterative execution with feedback, as seen in contrasts between structured decomposition approaches and more reactive agent designs like AgentGym RL[28]. Another theme concerns the balance between general reasoning capabilities and task-specific execution skills, with some efforts targeting domain-grounded benchmarks (EmbodiedBench[24]) while others pursue broad cognitive evaluations. Illusion Diminishing Returns[0] sits within the Execution and Error Analysis branch, specifically examining execution capability and self-conditioning effects. Its focus on how repeated attempts or self-generated context may yield diminishing improvements aligns closely with concerns about long-context struggle (Long Context Struggle[25]) and contrasts with works like Successive Attempts Efficiency[50], which explore whether iterative refinement genuinely enhances performance or merely creates an illusion of progress in extended task sequences.

Claimed Contributions

Framework for isolating and measuring long-horizon execution in LLMs

The authors introduce a controlled experimental framework that decouples execution from planning and knowledge by providing models with explicit plans (as keys in a key-value dictionary) and required knowledge in context. This allows systematic measurement of how many steps models can reliably execute without confounding factors from reasoning or planning failures.

10 retrieved papers
Discovery and characterization of the self-conditioning effect

The authors identify a novel failure mode where LLMs condition on their own previous errors, leading to increased likelihood of future mistakes. Through counterfactual experiments manipulating chat history error rates, they demonstrate this effect is distinct from long-context degradation and is not mitigated by scaling model size alone, though thinking models can overcome it.

10 retrieved papers
Mathematical analysis relating step accuracy to horizon length

The authors provide a mathematical formulation (Proposition 1) showing that horizon length grows hyperbolically with step accuracy. This demonstrates how diminishing returns on single-step performance can translate into exponential gains in task length beyond certain accuracy thresholds, reconciling apparent contradictions between benchmark saturation and continued scaling benefits.

3 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Framework for isolating and measuring long-horizon execution in LLMs

The authors introduce a controlled experimental framework that decouples execution from planning and knowledge by providing models with explicit plans (as keys in a key-value dictionary) and required knowledge in context. This allows systematic measurement of how many steps models can reliably execute without confounding factors from reasoning or planning failures.

Contribution

Discovery and characterization of the self-conditioning effect

The authors identify a novel failure mode where LLMs condition on their own previous errors, leading to increased likelihood of future mistakes. Through counterfactual experiments manipulating chat history error rates, they demonstrate this effect is distinct from long-context degradation and is not mitigated by scaling model size alone, though thinking models can overcome it.

Contribution

Mathematical analysis relating step accuracy to horizon length

The authors provide a mathematical formulation (Proposition 1) showing that horizon length grows hyperbolically with step accuracy. This demonstrates how diminishing returns on single-step performance can translate into exponential gains in task length beyond certain accuracy thresholds, reconciling apparent contradictions between benchmark saturation and continued scaling benefits.