R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?
Overview
Overall Novelty Assessment
The paper introduces R-HORIZON, a method and benchmark for evaluating long-horizon reasoning in large reasoning models through query composition. It resides in the Evaluation and Benchmarking leaf, which contains five papers total, making this a relatively sparse research direction within the broader taxonomy of fifty papers. The work addresses a gap in existing benchmarks that primarily assess single-horizon tasks, proposing instead to test models on interdependent multi-step problems that require sustained reasoning across extended horizons.
The taxonomy reveals that R-HORIZON sits adjacent to several related but distinct research directions. The Chain-of-Thought Mechanisms branch explores how models generate extended reasoning traces, while Application Domains encompasses task-specific reasoning in areas like web navigation and mathematical problem-solving. R-HORIZON bridges these areas by providing evaluation infrastructure rather than proposing new training methods or domain-specific architectures. Its sibling papers in Evaluation and Benchmarking include HeroBench and WebAgent Long Context, which focus on web-agent capabilities, and BABILong, which emphasizes memory and retrieval—suggesting R-HORIZON occupies a broader, less domain-constrained evaluation niche.
Among thirty candidates examined across three contributions, none were found to clearly refute the proposed work. The R-HORIZON method for query composition examined ten candidates with zero refutable matches, as did the benchmark construction and the training data generation for reinforcement learning. This suggests that within the limited search scope, the specific combination of query composition for long-horizon evaluation, the resulting benchmark design, and the application to verified-reward reinforcement learning appears relatively unexplored. However, the analysis explicitly covers only top-K semantic matches and does not represent an exhaustive literature review.
Based on the limited search scope of thirty semantically similar papers, R-HORIZON appears to occupy a distinct position within the sparse Evaluation and Benchmarking leaf. The absence of refutable prior work among examined candidates suggests novelty in its specific approach to long-horizon assessment, though this conclusion is constrained by the search methodology and does not preclude relevant work outside the candidate set.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors introduce R-HORIZON, a technique that concatenates existing single-horizon tasks into complex multi-horizon reasoning scenarios by establishing dependencies between problems. This method transforms isolated problems into sequential, interdependent tasks requiring models to solve multiple problems in order.
The authors construct an evaluation benchmark spanning 6 datasets across mathematics, code generation, and agent applications. This benchmark comprises multi-step reasoning tasks with interdependent problems designed to assess models' abilities in extended reasoning scenarios.
The authors leverage R-HORIZON to generate training data for reinforcement learning that includes multi-horizon problems. Training with this data substantially improves performance on both multi-horizon reasoning tasks and standard reasoning benchmarks compared to training with single-horizon data.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[24] HeroBench: A Benchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds PDF
[29] Evaluating Long-Context Reasoning in LLM-Based WebAgents PDF
[31] BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack PDF
[41] Overcoming Multi-step Complexity in Multimodal Theory-of-Mind Reasoning: A Scalable Bayesian Planner PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
R-HORIZON method for stimulating long-horizon reasoning via query composition
The authors introduce R-HORIZON, a technique that concatenates existing single-horizon tasks into complex multi-horizon reasoning scenarios by establishing dependencies between problems. This method transforms isolated problems into sequential, interdependent tasks requiring models to solve multiple problems in order.
[69] Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models PDF
[70] Tree-of-reasoning question decomposition for complex question answering with large language models PDF
[71] Quantifying risks in multi-turn conversation with large language models PDF
[72] Raw text is all you need: Knowledge-intensive multi-turn instruction tuning for large language model PDF
[73] Interactive-kbqa: Multi-turn interactions for knowledge base question answering with large language models PDF
[74] Autoprm: Automating procedural supervision for multi-step reasoning via controllable question decomposition PDF
[75] Prompting Is Programming: A Query Language for Large Language Models PDF
[76] Reinforcement Fine-Tuning for Reasoning towards Multi-Step Multi-Source Search in Large Language Models PDF
[77] Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs PDF
[78] From Query to Logic: Ontology-Driven Multi-Hop Reasoning in LLMs PDF
R-HORIZON Benchmark for evaluating long-horizon reasoning capabilities
The authors construct an evaluation benchmark spanning 6 datasets across mathematics, code generation, and agent applications. This benchmark comprises multi-step reasoning tasks with interdependent problems designed to assess models' abilities in extended reasoning scenarios.
[8] Llamav-o1: Rethinking step-by-step visual reasoning in llms PDF
[59] RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning PDF
[61] MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI PDF
[62] MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback PDF
[63] Complexity-Based Prompting for Multi-Step Reasoning PDF
[64] A sequential matching framework for multi-turn response selection in retrieval-based chatbots PDF
[65] Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues PDF
[66] VerifyBench: A Systematic Benchmark for Evaluating Reasoning Verifiers Across Domains PDF
[67] MCoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought PDF
[68] MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark PDF
R-HORIZON training data construction for reinforcement learning with verified rewards
The authors leverage R-HORIZON to generate training data for reinforcement learning that includes multi-horizon problems. Training with this data substantially improves performance on both multi-horizon reasoning tasks and standard reasoning benchmarks compared to training with single-horizon data.