SPELL: Self-Play Reinforcement Learning for evolving Long-Context Language Models

ICLR 2026 Conference SubmissionAnonymous Authors
Self-PlayReinforcement LearningLong-Context ReasoningLarge Language Models
Abstract:

Progress in long-context reasoning for large language models (LLMs) has lagged behind other recent advances. This gap arises not only from the intrinsic difficulty of processing long texts, but also from the scarcity of reliable human annotations and programmatically verifiable reward signals. In this paper, we propose SPELL, a multi-role self-play reinforcement learning framework that enables scalable, label-free optimization for long-context reasoning. SPELL integrates three cyclical roles—questioner, responder, and verifier—within a single model to enable continual self-improvement. The questioner generates questions from raw documents paired with reference answers; the responder learns to solve these questions based on the documents; and the verifier evaluates semantic equivalence between the responder’s output and the questioner's reference answer, producing reward signals to guide continual training. To stabilize training, we introduce an automated curriculum that gradually increases document length and a reward function that adapts question difficulty to the model’s evolving capabilities. Extensive experiments on six long-context benchmarks show that SPELL consistently improves performance across diverse LLMs and outperforms equally sized models fine-tuned on large-scale annotated data. Notably, SPELL achieves an average 7.6-point gain in pass@8 on the strong reasoning model Qwen3-30B-A3B-Thinking, raising its performance ceiling and showing promise for scaling to even more capable models.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces SPELL, a multi-role self-play reinforcement learning framework designed to improve long-context reasoning in large language models without human annotations. Within the taxonomy, it resides in the 'Deep Learning and Representation Learning' leaf under 'Computational Methods and Machine Learning'. This leaf contains only three papers total, including the original work, indicating a relatively sparse research direction focused on novel training paradigms and architectural innovations rather than a crowded subfield with extensive prior work.

The taxonomy reveals that SPELL's parent branch, 'Computational Methods and Machine Learning', encompasses diverse neighboring areas including computer vision surveys, task analysis and system design, and text analysis methods. While sibling papers like Complement Training and Hyperbolic Deep Learning explore augmented training regimes and non-Euclidean geometries respectively, SPELL diverges by addressing self-play reinforcement learning for long-context tasks. The taxonomy's scope notes clarify that this leaf excludes computer vision surveys and task-specific applications, positioning SPELL within a methodological innovation space rather than domain-specific problem-solving.

Among the three contributions analyzed across 25 candidate papers, the core SPELL framework examined 5 candidates with no clear refutations, suggesting relative novelty in its multi-role self-play approach. However, the automated curriculum contribution examined 10 candidates and found 1 refutable match, while the verifier training via self-consistency also examined 10 candidates with 1 refutable match. This indicates that while the overall framework may be novel, specific components like curriculum learning and self-consistency-based verification have more substantial overlaps with prior work within the limited search scope examined.

Based on the limited literature search of 25 candidates, SPELL appears to offer a moderately novel contribution, particularly in its integrated multi-role framework, though individual technical components show some overlap with existing methods. The sparse taxonomy leaf and low refutation rate for the core framework suggest meaningful differentiation from prior work, though the analysis does not cover exhaustive literature beyond top-K semantic matches and citation expansion.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
25
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: The paper addresses an unspecified research problem within the broader landscape of computational methods and machine learning. The taxonomy organizes research into several major branches, including Research Problem Identification and Formulation, Research Methodology and Paradigms, Applied Research Studies with Defined Objectives, Computational Methods and Machine Learning, Standardization and Guidelines, Specialized Domain Applications, and Unspecified or Generic Study Descriptions. Within Computational Methods and Machine Learning, a substantial body of work focuses on Deep Learning and Representation Learning, where researchers develop novel architectures, training strategies, and theoretical foundations. This branch contrasts with Applied Research Studies, which emphasize domain-specific objectives such as healthcare optimization or precision agriculture, and with Research Problem Identification, which centers on formulating research questions and methodological frameworks. Representative works like Study Objectives[1] and Main Objectives[2] illustrate how researchers articulate goals, while Theory Problem[3] and Research Methodology Guide[20] provide foundational perspectives on problem formulation. A particularly active line of work explores the intersection of deep learning techniques with representation learning challenges, where trade-offs between model expressiveness, computational efficiency, and generalization remain central. SPELL[0] situates itself within this Deep Learning and Representation Learning cluster, sharing methodological concerns with nearby efforts such as Complement Training[24] and Hyperbolic Deep Learning[36]. While Complement Training[24] emphasizes augmenting standard training regimes and Hyperbolic Deep Learning[36] explores non-Euclidean geometries for hierarchical data, SPELL[0] appears to address representation learning from a distinct angle, potentially focusing on novel learning paradigms or architectural innovations. This positioning reflects ongoing debates about how to balance theoretical rigor with practical applicability, a theme that resonates across many studies in this densely populated branch of computational research.

Claimed Contributions

SPELL: Multi-role self-play RL framework for long-context reasoning

The authors introduce SPELL, a framework where a single language model alternates among three roles (questioner, responder, and verifier) to autonomously generate questions from documents, solve them, and evaluate solutions. This enables continual self-improvement in long-context reasoning without requiring human annotations or programmatically verifiable rewards.

5 retrieved papers
Automated curriculum with adaptive difficulty control

The framework incorporates a history memory mechanism that progressively increases context length and a Gaussian-shaped reward function that calibrates question difficulty around the responder's competence frontier. This ensures questions remain neither too easy nor impossibly difficult, maintaining optimal learning efficiency throughout training.

10 retrieved papers
Can Refute
Verifier trained via self-consistency for stable reward signals

The authors develop a verification mechanism where the verifier learns to produce reliable semantic equivalence judgments through majority voting and self-consistency training on rule-verifiable tasks. This overcomes the brittleness of string matching and provides stable reward signals for non-verifiable outputs in long-context reasoning.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

SPELL: Multi-role self-play RL framework for long-context reasoning

The authors introduce SPELL, a framework where a single language model alternates among three roles (questioner, responder, and verifier) to autonomously generate questions from documents, solve them, and evaluate solutions. This enables continual self-improvement in long-context reasoning without requiring human annotations or programmatically verifiable rewards.

Contribution

Automated curriculum with adaptive difficulty control

The framework incorporates a history memory mechanism that progressively increases context length and a Gaussian-shaped reward function that calibrates question difficulty around the responder's competence frontier. This ensures questions remain neither too easy nor impossibly difficult, maintaining optimal learning efficiency throughout training.

Contribution

Verifier trained via self-consistency for stable reward signals

The authors develop a verification mechanism where the verifier learns to produce reliable semantic equivalence judgments through majority voting and self-consistency training on rule-verifiable tasks. This overcomes the brittleness of string matching and provides stable reward signals for non-verifiable outputs in long-context reasoning.

SPELL: Self-Play Reinforcement Learning for evolving Long-Context Language Models | Novelty Validation