LoongRL: Reinforcement Learning for Advanced Reasoning over Long Contexts

ICLR 2026 Conference SubmissionAnonymous Authors
Long Context ReasoningReinforcement Learning
Abstract:

Reasoning over long contexts is essential for large language models. While reinforcement learning (RL) enhances short-context reasoning by inducing "Aha" moments in chain-of-thought, the advanced thinking patterns required for long-context reasoning remain largely unexplored, and high-difficulty RL data are scarce. In this paper, we introduce LoongRL, a data-driven RL method for advanced long-context reasoning. Central to LoongRL is KeyChain, a synthesis approach that transforms short multi-hop QA into high-difficulty long-context tasks by inserting UUID chains that hide the true question among large collections of distracting documents. Solving these tasks requires the model to trace the correct chain step-by-step, identify the true question, retrieve relevant facts and reason over them to answer correctly. RL training on KeyChain data induces an emergent plan–retrieve–reason–recheck reasoning pattern that generalizes far beyond training length. Models trained at 16K effectively solve 128K tasks without prohibitive full-length RL rollout costs. On Qwen2.5-7B and 14B, LoongRL substantially improves long-context multi-hop QA accuracy by +23.5% and +21.1% absolute gains. The resulting LoongRL-14B reaches a score of 74.2, rivaling much larger frontier models such as o3-mini (74.5) and DeepSeek-R1 (74.9). It also improves long-context retrieval, passes all 128K needle-in-a-haystack stress tests, and preserves short-context reasoning capabilities.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces LoongRL, a data-driven RL method for long-context reasoning, alongside KeyChain, a synthesis approach that transforms short multi-hop QA into high-difficulty long-context tasks using UUID chains. Within the taxonomy, this work resides in the 'RL-Based Long-Context Training' leaf under 'Long-Context Reasoning and Memory Management'. This leaf contains only three papers total, including the original work, indicating a relatively sparse research direction. The sibling papers (QwenLong Recipe and QwenLong-L1) also focus on RL-driven optimization for extended context windows, suggesting this is an emerging but not yet crowded subfield.

The taxonomy reveals that neighboring leaves address complementary challenges: 'Memory Architectures and External Memory Banks' explores external memory systems for long-horizon reasoning, while 'Context Compression and Summarization' focuses on reducing context length. The broader parent branch 'Long-Context Reasoning and Memory Management' sits alongside 'RL Training Methodologies for Reasoning Enhancement', which houses general-purpose RL frameworks and process-level supervision methods. LoongRL bridges these areas by applying RL specifically to long-context scenarios, diverging from general short-context RL methods and memory-based architectures that do not emphasize RL training.

Among 30 candidates examined, none clearly refute the three core contributions: the LoongRL method (10 candidates, 0 refutable), the KeyChain synthesis approach (10 candidates, 0 refutable), and the two-way substring exact match verifier (10 candidates, 0 refutable). This suggests that within the limited search scope, the specific combination of data-driven RL for long-context reasoning, UUID-chain-based task synthesis, and the proposed verification mechanism appears novel. However, the search scale is modest, and the analysis does not claim exhaustive coverage of all potentially relevant prior work.

Given the sparse taxonomy leaf and the absence of refuting candidates among the 30 examined, the work appears to occupy a relatively unexplored niche within long-context RL training. The limited search scope means this assessment is provisional; a broader literature review might uncover additional overlapping methods. Nonetheless, the combination of RL-driven training, synthetic task generation via UUID chains, and emergent reasoning patterns at extended lengths represents a distinctive approach within the current field structure.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: reinforcement learning for long-context reasoning in language models. The field has organized itself around several complementary branches that address different facets of this challenge. RL Training Methodologies for Reasoning Enhancement focuses on algorithmic innovations—policy gradient techniques, reward shaping, and self-play mechanisms—that enable models to learn complex reasoning behaviors, as seen in works like Spell Self-play[2] and Effective RL Reasoning[10]. Long-Context Reasoning and Memory Management tackles the architectural and data-handling side, exploring how models can maintain coherent reasoning over extended sequences through memory mechanisms and context compression strategies, exemplified by Kimi[5] and QwenLong-L1[25]. Inference-Time Computation and Scaling examines how to allocate computational resources during test time—via search, iterative refinement, or adaptive depth—to improve reasoning quality without retraining, while Application Domains and Task-Specific Adaptations and Post-Training and Model Optimization address deployment contexts and fine-tuning recipes that bridge research prototypes and production systems. Within this landscape, a particularly active line of work centers on integrating RL directly into long-context training pipelines, balancing the need for extended memory with the sample efficiency and stability challenges of reinforcement learning. LoongRL[0] sits squarely in this cluster, emphasizing RL-based training specifically designed for long-context scenarios, closely aligned with QwenLong Recipe[49] and QwenLong-L1[25], which also explore how to scale context windows while maintaining reasoning fidelity through RL-driven optimization. In contrast, broader surveys like RL Large Reasoning Survey[1] and LLM Post-training Deep Dive[3] provide overarching perspectives on reasoning enhancement and post-training strategies, situating long-context RL as one specialized direction among many. The central tension across these branches remains how to efficiently train models that reason deeply over long horizons without prohibitive computational costs or unstable learning dynamics, a question that LoongRL[0] addresses by focusing on RL techniques tailored to extended context lengths.

Claimed Contributions

LoongRL: data-driven RL method for advanced long-context reasoning

The authors propose LoongRL, a reinforcement learning approach that enables models to acquire effective thinking patterns for long-context reasoning tasks. The method trains models to develop emergent plan-retrieve-reason-recheck reasoning patterns that generalize beyond training length.

10 retrieved papers
KeyChain: synthesis approach for high-difficulty long-context tasks

The authors introduce KeyChain, a data synthesis method that converts short multi-hop question-answering tasks into challenging long-context problems by inserting UUID chains that hide the true question among distracting documents, requiring models to trace chains step-by-step.

10 retrieved papers
Two-way substring exact match verifier for RL training

The authors design a rule-based reward verification method that checks whether the extracted answer contains the ground truth as a substring or vice versa, enabling reliable reinforcement learning training on general question-answering tasks without requiring LLM-based judgment.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

LoongRL: data-driven RL method for advanced long-context reasoning

The authors propose LoongRL, a reinforcement learning approach that enables models to acquire effective thinking patterns for long-context reasoning tasks. The method trains models to develop emergent plan-retrieve-reason-recheck reasoning patterns that generalize beyond training length.

Contribution

KeyChain: synthesis approach for high-difficulty long-context tasks

The authors introduce KeyChain, a data synthesis method that converts short multi-hop question-answering tasks into challenging long-context problems by inserting UUID chains that hide the true question among distracting documents, requiring models to trace chains step-by-step.

Contribution

Two-way substring exact match verifier for RL training

The authors design a rule-based reward verification method that checks whether the extracted answer contains the ground truth as a substring or vice versa, enabling reliable reinforcement learning training on general question-answering tasks without requiring LLM-based judgment.