LoongRL: Reinforcement Learning for Advanced Reasoning over Long Contexts

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

Long Context ReasoningReinforcement Learning

Reasoning over long contexts is essential for large language models. While reinforcement learning (RL) enhances short-context reasoning by inducing "Aha" moments in chain-of-thought, the advanced thinking patterns required for long-context reasoning remain largely unexplored, and high-difficulty RL data are scarce. In this paper, we introduce LoongRL, a data-driven RL method for advanced long-context reasoning. Central to LoongRL is KeyChain, a synthesis approach that transforms short multi-hop QA into high-difficulty long-context tasks by inserting UUID chains that hide the true question among large collections of distracting documents. Solving these tasks requires the model to trace the correct chain step-by-step, identify the true question, retrieve relevant facts and reason over them to answer correctly. RL training on KeyChain data induces an emergent plan–retrieve–reason–recheck reasoning pattern that generalizes far beyond training length. Models trained at 16K effectively solve 128K tasks without prohibitive full-length RL rollout costs. On Qwen2.5-7B and 14B, LoongRL substantially improves long-context multi-hop QA accuracy by +23.5% and +21.1% absolute gains. The resulting LoongRL-14B reaches a score of 74.2, rivaling much larger frontier models such as o3-mini (74.5) and DeepSeek-R1 (74.9). It also improves long-context retrieval, passes all 128K needle-in-a-haystack stress tests, and preserves short-context reasoning capabilities.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces LoongRL, a data-driven RL method for long-context reasoning, alongside KeyChain, a synthesis approach that transforms short multi-hop QA into high-difficulty long-context tasks using UUID chains. Within the taxonomy, this work resides in the 'RL-Based Long-Context Training' leaf under 'Long-Context Reasoning and Memory Management'. This leaf contains only three papers total, including the original work, indicating a relatively sparse research direction. The sibling papers (QwenLong Recipe and QwenLong-L1) also focus on RL-driven optimization for extended context windows, suggesting this is an emerging but not yet crowded subfield.

The taxonomy reveals that neighboring leaves address complementary challenges: 'Memory Architectures and External Memory Banks' explores external memory systems for long-horizon reasoning, while 'Context Compression and Summarization' focuses on reducing context length. The broader parent branch 'Long-Context Reasoning and Memory Management' sits alongside 'RL Training Methodologies for Reasoning Enhancement', which houses general-purpose RL frameworks and process-level supervision methods. LoongRL bridges these areas by applying RL specifically to long-context scenarios, diverging from general short-context RL methods and memory-based architectures that do not emphasize RL training.

Among 30 candidates examined, none clearly refute the three core contributions: the LoongRL method (10 candidates, 0 refutable), the KeyChain synthesis approach (10 candidates, 0 refutable), and the two-way substring exact match verifier (10 candidates, 0 refutable). This suggests that within the limited search scope, the specific combination of data-driven RL for long-context reasoning, UUID-chain-based task synthesis, and the proposed verification mechanism appears novel. However, the search scale is modest, and the analysis does not claim exhaustive coverage of all potentially relevant prior work.

Given the sparse taxonomy leaf and the absence of refuting candidates among the 30 examined, the work appears to occupy a relatively unexplored niche within long-context RL training. The limited search scope means this assessment is provisional; a broader literature review might uncover additional overlapping methods. Nonetheless, the combination of RL-driven training, synthetic task generation via UUID chains, and emergent reasoning patterns at extended lengths represents a distinctive approach within the current field structure.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: reinforcement learning for long-context reasoning in language models. The field has organized itself around several complementary branches that address different facets of this challenge. RL Training Methodologies for Reasoning Enhancement focuses on algorithmic innovations—policy gradient techniques, reward shaping, and self-play mechanisms—that enable models to learn complex reasoning behaviors, as seen in works like Spell Self-play[2] and Effective RL Reasoning[10]. Long-Context Reasoning and Memory Management tackles the architectural and data-handling side, exploring how models can maintain coherent reasoning over extended sequences through memory mechanisms and context compression strategies, exemplified by Kimi[5] and QwenLong-L1[25]. Inference-Time Computation and Scaling examines how to allocate computational resources during test time—via search, iterative refinement, or adaptive depth—to improve reasoning quality without retraining, while Application Domains and Task-Specific Adaptations and Post-Training and Model Optimization address deployment contexts and fine-tuning recipes that bridge research prototypes and production systems. Within this landscape, a particularly active line of work centers on integrating RL directly into long-context training pipelines, balancing the need for extended memory with the sample efficiency and stability challenges of reinforcement learning. LoongRL[0] sits squarely in this cluster, emphasizing RL-based training specifically designed for long-context scenarios, closely aligned with QwenLong Recipe[49] and QwenLong-L1[25], which also explore how to scale context windows while maintaining reasoning fidelity through RL-driven optimization. In contrast, broader surveys like RL Large Reasoning Survey[1] and LLM Post-training Deep Dive[3] provide overarching perspectives on reasoning enhancement and post-training strategies, situating long-context RL as one specialized direction among many. The central tension across these branches remains how to efficiently train models that reason deeply over long horizons without prohibitive computational costs or unstable learning dynamics, a question that LoongRL[0] addresses by focusing on RL techniques tailored to extended context lengths.

Claimed Contributions

LoongRL: data-driven RL method for advanced long-context reasoning

10 retrieved papers

The authors propose LoongRL, a reinforcement learning approach that enables models to acquire effective thinking patterns for long-context reasoning tasks. The method trains models to develop emergent plan-retrieve-reason-recheck reasoning patterns that generalize beyond training length.

10 retrieved papers

KeyChain: synthesis approach for high-difficulty long-context tasks

10 retrieved papers

The authors introduce KeyChain, a data synthesis method that converts short multi-hop question-answering tasks into challenging long-context problems by inserting UUID chains that hide the true question among distracting documents, requiring models to trace chains step-by-step.

10 retrieved papers

Two-way substring exact match verifier for RL training

10 retrieved papers

The authors design a rule-based reward verification method that checks whether the extracted answer contains the ground truth as a substring or vice versa, enabling reliable reinforcement learning training on general question-answering tasks without requiring LLM-based judgment.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[25] QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning PDF

Wan, Fanqi, Shen Weizhou, Fanqi Wan, Weizhou Shen, Shi Yingcheng, Shengyi Liao, Li, Chenliang, Yingcheng Shi, Yang Ziyi, Chenliang Li, Zhang Ji, Ziyi Yang, Huang Fei, Ji Zhang, Zhou, Jingren, Fei Huang, Yan Ming, Jingren Zhou, Ming Yan (2025)

[49] QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management PDF

Weizhou Shen, Ziyi Yang, Chenliang Li, Zhiyuan Lu, Miao Peng, Huashan Sun, Yingcheng Shi, Shengyi Liao, Shaopeng Lai, Bo Zhang, Dayiheng Liu, Fei Huang, Jingren Zhou, Ming Yan (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

LoongRL: data-driven RL method for advanced long-context reasoning

[2] Spell: Self-play reinforcement learning for evolving long-context language models PDF

Cannot Refute

[5] Kimi k1.5: Scaling Reinforcement Learning with LLMs PDF

Cannot Refute

[51] Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models PDF

Cannot Refute

[52] Chain of agents: Large language models collaborating on long-context tasks PDF

Cannot Refute

[53] The pokeagent challenge: Competitive and long-context learning at scale PDF

Cannot Refute

[54] Retrieval-augmented hierarchical in-context reinforcement learning and hindsight modular reflections for task planning with llms PDF

Cannot Refute

[55] Robohorizon: An llm-assisted multi-view world model for long-horizon robotic manipulation PDF

Cannot Refute

[56] Amago: Scalable in-context reinforcement learning for adaptive agents PDF

Cannot Refute

[57] Large language models are learnable planners for long-term recommendation PDF

Cannot Refute

[58] Large Language Models Post-training: Surveying Techniques from Alignment to Reasoning PDF

Cannot Refute

Contribution

KeyChain: synthesis approach for high-difficulty long-context tasks

[14] WebExplorer: Explore and Evolve for Training Long-Horizon Web Agents PDF

Cannot Refute

[69] Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks PDF

Cannot Refute

[70] Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks PDF

Cannot Refute

[71] What are the essential factors in crafting effective long context multi-hop instruction datasets? insights and best practices PDF

Cannot Refute

[72] Odysseybench: Evaluating llm agents on long-horizon complex office application workflows PDF

Cannot Refute

[73] Wildlong: Synthesizing realistic long-context instruction data at scale PDF

Cannot Refute

[74] Generating Multi-turn Clarification for Web Information Seeking PDF

Cannot Refute

[75] Deepdive: Advancing deep search agents with knowledge graphs and multi-turn rl PDF

Cannot Refute

[76] Generalizing from short to long: Effective data synthesis for long-context instruction tuning PDF

Cannot Refute

[77] Multi-Document Grounded Multi-Turn Synthetic Dialog Generation PDF

Cannot Refute

Contribution

Two-way substring exact match verifier for RL training

[59] VerIF: Verification Engineering for Reinforcement Learning in Instruction Following PDF

Cannot Refute

[60] Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning PDF

Cannot Refute

[61] Sparks of tabular reasoning via text2sql reinforcement learning PDF

Cannot Refute

[62] SR: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning PDF

Cannot Refute

[63] SATURN: SAT-based Reinforcement Learning to Unleash LLMs Reasoning PDF

Cannot Refute

[64] DocThinker: Explainable Multimodal Large Language Models with Rule-based Reinforcement Learning for Document Understanding PDF

Cannot Refute

[65] MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning PDF

Cannot Refute

[66] WeThink: Toward General-purpose Vision-Language Reasoning via Reinforcement Learning PDF

Cannot Refute

[67] DialogueReason: Rule-Based RL Sparks Dialogue Reasoning in LLMs PDF

Cannot Refute

[68] Prompt-Based Length Controlled Generation with Reinforcement Learning PDF

Cannot Refute

LoongRL: Reinforcement Learning for Advanced Reasoning over Long Contexts

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[25] QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning PDF

[49] QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management PDF

Contribution Analysis

LoongRL: data-driven RL method for advanced long-context reasoning

[2] Spell: Self-play reinforcement learning for evolving long-context language models PDF

[5] Kimi k1.5: Scaling Reinforcement Learning with LLMs PDF

[51] Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models PDF

[52] Chain of agents: Large language models collaborating on long-context tasks PDF

[53] The pokeagent challenge: Competitive and long-context learning at scale PDF

[54] Retrieval-augmented hierarchical in-context reinforcement learning and hindsight modular reflections for task planning with llms PDF

[55] Robohorizon: An llm-assisted multi-view world model for long-horizon robotic manipulation PDF

[56] Amago: Scalable in-context reinforcement learning for adaptive agents PDF

[57] Large language models are learnable planners for long-term recommendation PDF

[58] Large Language Models Post-training: Surveying Techniques from Alignment to Reasoning PDF

KeyChain: synthesis approach for high-difficulty long-context tasks

[14] WebExplorer: Explore and Evolve for Training Long-Horizon Web Agents PDF

[69] Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks PDF

[70] Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks PDF

[71] What are the essential factors in crafting effective long context multi-hop instruction datasets? insights and best practices PDF

[72] Odysseybench: Evaluating llm agents on long-horizon complex office application workflows PDF

[73] Wildlong: Synthesizing realistic long-context instruction data at scale PDF

[74] Generating Multi-turn Clarification for Web Information Seeking PDF

[75] Deepdive: Advancing deep search agents with knowledge graphs and multi-turn rl PDF

[76] Generalizing from short to long: Effective data synthesis for long-context instruction tuning PDF

[77] Multi-Document Grounded Multi-Turn Synthetic Dialog Generation PDF

Two-way substring exact match verifier for RL training

[59] VerIF: Verification Engineering for Reinforcement Learning in Instruction Following PDF

[60] Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning PDF

[61] Sparks of tabular reasoning via text2sql reinforcement learning PDF

[62] SR: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning PDF

[63] SATURN: SAT-based Reinforcement Learning to Unleash LLMs Reasoning PDF

[64] DocThinker: Explainable Multimodal Large Language Models with Rule-based Reinforcement Learning for Document Understanding PDF

[65] MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning PDF

[66] WeThink: Toward General-purpose Vision-Language Reasoning via Reinforcement Learning PDF

[67] DialogueReason: Rule-Based RL Sparks Dialogue Reasoning in LLMs PDF

[68] Prompt-Based Length Controlled Generation with Reinforcement Learning PDF

Table of Contents