R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?

ICLR 2026 Conference SubmissionAnonymous Authors
Large Reasoning ModelsLong Horizon Reasoning
Abstract:

Recent trends in test-time scaling for reasoning models (e.g., OpenAI o1, DeepSeek-R1) have led to remarkable improvements through long Chain-of-Thought (CoT). However, existing benchmarks mainly focus on immediate, single-horizon tasks, failing to adequately evaluate models’ ability to understand and respond to complex, long-horizon scenarios. To address this incomplete evaluation of Large Reasoning Models (LRMs), we propose R-HORIZON, a method designed to stimulate long-horizon reasoning behaviors in LRMs through query composition. Based on R-HORIZON, we construct a long-horizon reasoning benchmark, comprising complex multi-step reasoning tasks with interdependent problems that span long reasoning horizons. Through comprehensive evaluation of LRMs using the R-HORIZON benchmark, we find that even the most advanced LRMs suffer significant performance degradation. Our analysis reveals that LRMs exhibit limited effective reasoning length and struggle to allocate thinking budget across multiple problems appropriately. Recognizing these limitations, we use R-HORIZON to construct long-horizon reasoning data for reinforcement learning with verified rewards (RLVR). Compared to training with single-horizon data, RLVR with R-HORIZON not only substantially improves performance on the multi-horizon reasoning tasks, but also promotes accuracy on standard reasoning tasks (+7.5 on AIME2024). These results position R-HORIZON as a scalable, controllable, and low-cost paradigm for enhancing and evaluating the long-horizon reasoning capabilities of LRMs.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces R-HORIZON, a method and benchmark for evaluating long-horizon reasoning in large reasoning models through query composition. It resides in the Evaluation and Benchmarking leaf, which contains five papers total, making this a relatively sparse research direction within the broader taxonomy of fifty papers. The work addresses a gap in existing benchmarks that primarily assess single-horizon tasks, proposing instead to test models on interdependent multi-step problems that require sustained reasoning across extended horizons.

The taxonomy reveals that R-HORIZON sits adjacent to several related but distinct research directions. The Chain-of-Thought Mechanisms branch explores how models generate extended reasoning traces, while Application Domains encompasses task-specific reasoning in areas like web navigation and mathematical problem-solving. R-HORIZON bridges these areas by providing evaluation infrastructure rather than proposing new training methods or domain-specific architectures. Its sibling papers in Evaluation and Benchmarking include HeroBench and WebAgent Long Context, which focus on web-agent capabilities, and BABILong, which emphasizes memory and retrieval—suggesting R-HORIZON occupies a broader, less domain-constrained evaluation niche.

Among thirty candidates examined across three contributions, none were found to clearly refute the proposed work. The R-HORIZON method for query composition examined ten candidates with zero refutable matches, as did the benchmark construction and the training data generation for reinforcement learning. This suggests that within the limited search scope, the specific combination of query composition for long-horizon evaluation, the resulting benchmark design, and the application to verified-reward reinforcement learning appears relatively unexplored. However, the analysis explicitly covers only top-K semantic matches and does not represent an exhaustive literature review.

Based on the limited search scope of thirty semantically similar papers, R-HORIZON appears to occupy a distinct position within the sparse Evaluation and Benchmarking leaf. The absence of refutable prior work among examined candidates suggests novelty in its specific approach to long-horizon assessment, though this conclusion is constrained by the search methodology and does not preclude relevant work outside the candidate set.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: long-horizon reasoning in large reasoning models. The field organizes around four main branches that capture distinct facets of how models tackle extended reasoning problems. Chain-of-Thought Mechanisms and Training focuses on designing and learning step-by-step reasoning strategies, including works that explore how to elicit or train models to produce coherent multi-step rationales (e.g., Demystifying Long CoT[1], Step-DPO[16]). Inference Efficiency and Optimization addresses the computational challenges of generating lengthy reasoning traces, with studies like Efficient Reasoning Inference[6] examining trade-offs between depth and speed. Application Domains and Task-Specific Reasoning encompasses diverse settings—from mathematical problem-solving (Scaling Mathematical Reasoning[20], Olympiad Math Agent[15]) to web navigation (WebThinker[2], WebResearcher[45]) and embodied or temporal tasks (VideoTree[12], Temporal Reasoning Framework[36])—where domain structure shapes reasoning demands. Finally, Evaluation and Benchmarking develops metrics and testbeds to measure reasoning quality over extended horizons, including benchmarks for web agents (HeroBench[24], WebAgent Long Context[29]), memory-intensive tasks (BABILong[31]), and multimodal theory-of-mind scenarios (Multimodal Theory-of-Mind[41]). Several active lines of work highlight contrasting priorities and open questions. One thread investigates how to scale reasoning depth without prohibitive costs, balancing longer chains against inference overhead (System 1 to 2[3], Scaling Reasoning Survey[17]). Another examines domain-specific adaptations, asking whether general-purpose reasoning transfers or requires tailored architectures and data (Large Reasoning Models[21], Multi-step Reasoning Survey[9]). R-Horizon[0] sits squarely within the Evaluation and Benchmarking branch, proposing new ways to assess reasoning over extended horizons. It shares thematic ground with HeroBench[24] and WebAgent Long Context[29], which also probe long-context agent capabilities, yet R-Horizon[0] emphasizes a broader evaluation framework rather than focusing exclusively on web tasks. Compared to BABILong[31], which stresses memory and retrieval, R-Horizon[0] appears more concerned with the interplay between reasoning depth and task complexity across varied domains.

Claimed Contributions

R-HORIZON method for stimulating long-horizon reasoning via query composition

The authors introduce R-HORIZON, a technique that concatenates existing single-horizon tasks into complex multi-horizon reasoning scenarios by establishing dependencies between problems. This method transforms isolated problems into sequential, interdependent tasks requiring models to solve multiple problems in order.

10 retrieved papers
R-HORIZON Benchmark for evaluating long-horizon reasoning capabilities

The authors construct an evaluation benchmark spanning 6 datasets across mathematics, code generation, and agent applications. This benchmark comprises multi-step reasoning tasks with interdependent problems designed to assess models' abilities in extended reasoning scenarios.

10 retrieved papers
R-HORIZON training data construction for reinforcement learning with verified rewards

The authors leverage R-HORIZON to generate training data for reinforcement learning that includes multi-horizon problems. Training with this data substantially improves performance on both multi-horizon reasoning tasks and standard reasoning benchmarks compared to training with single-horizon data.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

R-HORIZON method for stimulating long-horizon reasoning via query composition

The authors introduce R-HORIZON, a technique that concatenates existing single-horizon tasks into complex multi-horizon reasoning scenarios by establishing dependencies between problems. This method transforms isolated problems into sequential, interdependent tasks requiring models to solve multiple problems in order.

Contribution

R-HORIZON Benchmark for evaluating long-horizon reasoning capabilities

The authors construct an evaluation benchmark spanning 6 datasets across mathematics, code generation, and agent applications. This benchmark comprises multi-step reasoning tasks with interdependent problems designed to assess models' abilities in extended reasoning scenarios.

Contribution

R-HORIZON training data construction for reinforcement learning with verified rewards

The authors leverage R-HORIZON to generate training data for reinforcement learning that includes multi-horizon problems. Training with this data substantially improves performance on both multi-horizon reasoning tasks and standard reasoning benchmarks compared to training with single-horizon data.

R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth? | Novelty Validation