R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Large Reasoning ModelsLong Horizon Reasoning

Recent trends in test-time scaling for reasoning models (e.g., OpenAI o1, DeepSeek-R1) have led to remarkable improvements through long Chain-of-Thought (CoT). However, existing benchmarks mainly focus on immediate, single-horizon tasks, failing to adequately evaluate models’ ability to understand and respond to complex, long-horizon scenarios. To address this incomplete evaluation of Large Reasoning Models (LRMs), we propose R-HORIZON, a method designed to stimulate long-horizon reasoning behaviors in LRMs through query composition. Based on R-HORIZON, we construct a long-horizon reasoning benchmark, comprising complex multi-step reasoning tasks with interdependent problems that span long reasoning horizons. Through comprehensive evaluation of LRMs using the R-HORIZON benchmark, we find that even the most advanced LRMs suffer significant performance degradation. Our analysis reveals that LRMs exhibit limited effective reasoning length and struggle to allocate thinking budget across multiple problems appropriately. Recognizing these limitations, we use R-HORIZON to construct long-horizon reasoning data for reinforcement learning with verified rewards (RLVR). Compared to training with single-horizon data, RLVR with R-HORIZON not only substantially improves performance on the multi-horizon reasoning tasks, but also promotes accuracy on standard reasoning tasks (+7.5 on AIME2024). These results position R-HORIZON as a scalable, controllable, and low-cost paradigm for enhancing and evaluating the long-horizon reasoning capabilities of LRMs.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces R-HORIZON, a method and benchmark for evaluating long-horizon reasoning in large reasoning models through query composition. It resides in the Evaluation and Benchmarking leaf, which contains five papers total, making this a relatively sparse research direction within the broader taxonomy of fifty papers. The work addresses a gap in existing benchmarks that primarily assess single-horizon tasks, proposing instead to test models on interdependent multi-step problems that require sustained reasoning across extended horizons.

The taxonomy reveals that R-HORIZON sits adjacent to several related but distinct research directions. The Chain-of-Thought Mechanisms branch explores how models generate extended reasoning traces, while Application Domains encompasses task-specific reasoning in areas like web navigation and mathematical problem-solving. R-HORIZON bridges these areas by providing evaluation infrastructure rather than proposing new training methods or domain-specific architectures. Its sibling papers in Evaluation and Benchmarking include HeroBench and WebAgent Long Context, which focus on web-agent capabilities, and BABILong, which emphasizes memory and retrieval—suggesting R-HORIZON occupies a broader, less domain-constrained evaluation niche.

Among thirty candidates examined across three contributions, none were found to clearly refute the proposed work. The R-HORIZON method for query composition examined ten candidates with zero refutable matches, as did the benchmark construction and the training data generation for reinforcement learning. This suggests that within the limited search scope, the specific combination of query composition for long-horizon evaluation, the resulting benchmark design, and the application to verified-reward reinforcement learning appears relatively unexplored. However, the analysis explicitly covers only top-K semantic matches and does not represent an exhaustive literature review.

Based on the limited search scope of thirty semantically similar papers, R-HORIZON appears to occupy a distinct position within the sparse Evaluation and Benchmarking leaf. The absence of refutable prior work among examined candidates suggests novelty in its specific approach to long-horizon assessment, though this conclusion is constrained by the search methodology and does not preclude relevant work outside the candidate set.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: long-horizon reasoning in large reasoning models. The field organizes around four main branches that capture distinct facets of how models tackle extended reasoning problems. Chain-of-Thought Mechanisms and Training focuses on designing and learning step-by-step reasoning strategies, including works that explore how to elicit or train models to produce coherent multi-step rationales (e.g., Demystifying Long CoT[1], Step-DPO[16]). Inference Efficiency and Optimization addresses the computational challenges of generating lengthy reasoning traces, with studies like Efficient Reasoning Inference[6] examining trade-offs between depth and speed. Application Domains and Task-Specific Reasoning encompasses diverse settings—from mathematical problem-solving (Scaling Mathematical Reasoning[20], Olympiad Math Agent[15]) to web navigation (WebThinker[2], WebResearcher[45]) and embodied or temporal tasks (VideoTree[12], Temporal Reasoning Framework[36])—where domain structure shapes reasoning demands. Finally, Evaluation and Benchmarking develops metrics and testbeds to measure reasoning quality over extended horizons, including benchmarks for web agents (HeroBench[24], WebAgent Long Context[29]), memory-intensive tasks (BABILong[31]), and multimodal theory-of-mind scenarios (Multimodal Theory-of-Mind[41]). Several active lines of work highlight contrasting priorities and open questions. One thread investigates how to scale reasoning depth without prohibitive costs, balancing longer chains against inference overhead (System 1 to 2[3], Scaling Reasoning Survey[17]). Another examines domain-specific adaptations, asking whether general-purpose reasoning transfers or requires tailored architectures and data (Large Reasoning Models[21], Multi-step Reasoning Survey[9]). R-Horizon[0] sits squarely within the Evaluation and Benchmarking branch, proposing new ways to assess reasoning over extended horizons. It shares thematic ground with HeroBench[24] and WebAgent Long Context[29], which also probe long-context agent capabilities, yet R-Horizon[0] emphasizes a broader evaluation framework rather than focusing exclusively on web tasks. Compared to BABILong[31], which stresses memory and retrieval, R-Horizon[0] appears more concerned with the interplay between reasoning depth and task complexity across varied domains.

Claimed Contributions

R-HORIZON method for stimulating long-horizon reasoning via query composition

10 retrieved papers

The authors introduce R-HORIZON, a technique that concatenates existing single-horizon tasks into complex multi-horizon reasoning scenarios by establishing dependencies between problems. This method transforms isolated problems into sequential, interdependent tasks requiring models to solve multiple problems in order.

10 retrieved papers

R-HORIZON Benchmark for evaluating long-horizon reasoning capabilities

10 retrieved papers

The authors construct an evaluation benchmark spanning 6 datasets across mathematics, code generation, and agent applications. This benchmark comprises multi-step reasoning tasks with interdependent problems designed to assess models' abilities in extended reasoning scenarios.

10 retrieved papers

R-HORIZON training data construction for reinforcement learning with verified rewards

10 retrieved papers

The authors leverage R-HORIZON to generate training data for reinforcement learning that includes multi-horizon problems. Training with this data substantially improves performance on both multi-horizon reasoning tasks and standard reasoning benchmarks compared to training with single-horizon data.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[24] HeroBench: A Benchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds PDF

Petr Anokhin, Roman Khalikov, Volkov, Viktor, Stefan Rebrikov, Sorokin, Artyom, Viktor Volkov, Bissonnette, Vincent, Artyom Y. Sorokin, Vincent Bissonnette (2025) • arXiv.org

[29] Evaluating Long-Context Reasoning in LLM-Based WebAgents PDF

Andy Chung, Yichi Zhang, Kaixiang Lin, Aditya Rawal, Qiaozi Gao, Joyce Chai (2025)

[31] BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack PDF

Petr Anokhin, Aydar Bulatov, Mikhail Burtsev, Yuri Kuratov, Ivan Rodkin, Artyom Sorokin, Dmitry Sorokin (2024)

[41] Overcoming Multi-step Complexity in Multimodal Theory-of-Mind Reasoning: A Scalable Bayesian Planner PDF

Zhang Chunhui, Ouyang, Zhongyu, Chunhui Zhang, Lee Kwonjoon, Z. Ouyang, Agarwal, Nakul, Kwonjoon Lee, Nakul Agarwal, Vosoughi, Soroush, Sean Dae Houlihan, Lo, Shao-Yuan, Soroush Vosoughi, Shao-Yuan Lo (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

R-HORIZON method for stimulating long-horizon reasoning via query composition

[69] Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models PDF

Cannot Refute

[70] Tree-of-reasoning question decomposition for complex question answering with large language models PDF

Cannot Refute

[71] Quantifying risks in multi-turn conversation with large language models PDF

Cannot Refute

[72] Raw text is all you need: Knowledge-intensive multi-turn instruction tuning for large language model PDF

Cannot Refute

[73] Interactive-kbqa: Multi-turn interactions for knowledge base question answering with large language models PDF

Cannot Refute

[74] Autoprm: Automating procedural supervision for multi-step reasoning via controllable question decomposition PDF

Cannot Refute

[75] Prompting Is Programming: A Query Language for Large Language Models PDF

Cannot Refute

[76] Reinforcement Fine-Tuning for Reasoning towards Multi-Step Multi-Source Search in Large Language Models PDF

Cannot Refute

[77] Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs PDF

Cannot Refute

[78] From Query to Logic: Ontology-Driven Multi-Hop Reasoning in LLMs PDF

Cannot Refute

Contribution

R-HORIZON Benchmark for evaluating long-horizon reasoning capabilities

[8] Llamav-o1: Rethinking step-by-step visual reasoning in llms PDF

Cannot Refute

[59] RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning PDF

Cannot Refute

[61] MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI PDF

Cannot Refute

[62] MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback PDF

Cannot Refute

[63] Complexity-Based Prompting for Multi-Step Reasoning PDF

Cannot Refute

[64] A sequential matching framework for multi-turn response selection in retrieval-based chatbots PDF

Cannot Refute

[65] Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues PDF

Cannot Refute

[66] VerifyBench: A Systematic Benchmark for Evaluating Reasoning Verifiers Across Domains PDF

Cannot Refute

[67] MCoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought PDF

Cannot Refute

[68] MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark PDF

Cannot Refute

Contribution

R-HORIZON training data construction for reinforcement learning with verified rewards

[51] Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs PDF

Cannot Refute

[52] RLVMR: Reinforcement Learning with Verifiable Meta-Reasoning Rewards for Robust Long-Horizon Agents PDF

Cannot Refute

[53] Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning PDF

Cannot Refute

[54] Absolute zero: Reinforced self-play reasoning with zero data PDF

Cannot Refute

[55] Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains PDF

Cannot Refute

[56] Deeptravel: An end-to-end agentic reinforcement learning framework for autonomous travel planning agents PDF

Cannot Refute

[57] Reinforcement Learning with Verifiable Rewards: GRPO's Effective Loss, Dynamics, and Success Amplification PDF

Cannot Refute

[58] Perception-Aware Policy Optimization for Multimodal Reasoning PDF

Cannot Refute

[59] RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning PDF

Cannot Refute

[60] SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks PDF

Cannot Refute

R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[24] HeroBench: A Benchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds PDF

[29] Evaluating Long-Context Reasoning in LLM-Based WebAgents PDF

[31] BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack PDF

[41] Overcoming Multi-step Complexity in Multimodal Theory-of-Mind Reasoning: A Scalable Bayesian Planner PDF

Contribution Analysis

R-HORIZON method for stimulating long-horizon reasoning via query composition

[69] Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models PDF

[70] Tree-of-reasoning question decomposition for complex question answering with large language models PDF

[71] Quantifying risks in multi-turn conversation with large language models PDF

[72] Raw text is all you need: Knowledge-intensive multi-turn instruction tuning for large language model PDF

[73] Interactive-kbqa: Multi-turn interactions for knowledge base question answering with large language models PDF

[74] Autoprm: Automating procedural supervision for multi-step reasoning via controllable question decomposition PDF

[75] Prompting Is Programming: A Query Language for Large Language Models PDF

[76] Reinforcement Fine-Tuning for Reasoning towards Multi-Step Multi-Source Search in Large Language Models PDF

[77] Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs PDF

[78] From Query to Logic: Ontology-Driven Multi-Hop Reasoning in LLMs PDF

R-HORIZON Benchmark for evaluating long-horizon reasoning capabilities

[8] Llamav-o1: Rethinking step-by-step visual reasoning in llms PDF

[59] RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning PDF

[61] MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI PDF

[62] MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback PDF

[63] Complexity-Based Prompting for Multi-Step Reasoning PDF

[64] A sequential matching framework for multi-turn response selection in retrieval-based chatbots PDF

[65] Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues PDF

[66] VerifyBench: A Systematic Benchmark for Evaluating Reasoning Verifiers Across Domains PDF

[67] MCoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought PDF

[68] MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark PDF

R-HORIZON training data construction for reinforcement learning with verified rewards

[51] Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs PDF

[52] RLVMR: Reinforcement Learning with Verifiable Meta-Reasoning Rewards for Robust Long-Horizon Agents PDF

[53] Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning PDF

[54] Absolute zero: Reinforced self-play reasoning with zero data PDF

[55] Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains PDF

[56] Deeptravel: An end-to-end agentic reinforcement learning framework for autonomous travel planning agents PDF

[57] Reinforcement Learning with Verifiable Rewards: GRPO's Effective Loss, Dynamics, and Success Amplification PDF

[58] Perception-Aware Policy Optimization for Multimodal Reasoning PDF

[59] RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning PDF

[60] SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks PDF

Table of Contents