SPELL: Self-Play Reinforcement Learning for evolving Long-Context Language Models

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Self-PlayReinforcement LearningLong-Context ReasoningLarge Language Models

Progress in long-context reasoning for large language models (LLMs) has lagged behind other recent advances. This gap arises not only from the intrinsic difficulty of processing long texts, but also from the scarcity of reliable human annotations and programmatically verifiable reward signals. In this paper, we propose SPELL, a multi-role self-play reinforcement learning framework that enables scalable, label-free optimization for long-context reasoning. SPELL integrates three cyclical roles—questioner, responder, and verifier—within a single model to enable continual self-improvement. The questioner generates questions from raw documents paired with reference answers; the responder learns to solve these questions based on the documents; and the verifier evaluates semantic equivalence between the responder’s output and the questioner's reference answer, producing reward signals to guide continual training. To stabilize training, we introduce an automated curriculum that gradually increases document length and a reward function that adapts question difficulty to the model’s evolving capabilities. Extensive experiments on six long-context benchmarks show that SPELL consistently improves performance across diverse LLMs and outperforms equally sized models fine-tuned on large-scale annotated data. Notably, SPELL achieves an average 7.6-point gain in pass@8 on the strong reasoning model Qwen3-30B-A3B-Thinking, raising its performance ceiling and showing promise for scaling to even more capable models.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces SPELL, a multi-role self-play reinforcement learning framework designed to improve long-context reasoning in large language models without human annotations. Within the taxonomy, it resides in the 'Deep Learning and Representation Learning' leaf under 'Computational Methods and Machine Learning'. This leaf contains only three papers total, including the original work, indicating a relatively sparse research direction focused on novel training paradigms and architectural innovations rather than a crowded subfield with extensive prior work.

The taxonomy reveals that SPELL's parent branch, 'Computational Methods and Machine Learning', encompasses diverse neighboring areas including computer vision surveys, task analysis and system design, and text analysis methods. While sibling papers like Complement Training and Hyperbolic Deep Learning explore augmented training regimes and non-Euclidean geometries respectively, SPELL diverges by addressing self-play reinforcement learning for long-context tasks. The taxonomy's scope notes clarify that this leaf excludes computer vision surveys and task-specific applications, positioning SPELL within a methodological innovation space rather than domain-specific problem-solving.

Among the three contributions analyzed across 25 candidate papers, the core SPELL framework examined 5 candidates with no clear refutations, suggesting relative novelty in its multi-role self-play approach. However, the automated curriculum contribution examined 10 candidates and found 1 refutable match, while the verifier training via self-consistency also examined 10 candidates with 1 refutable match. This indicates that while the overall framework may be novel, specific components like curriculum learning and self-consistency-based verification have more substantial overlaps with prior work within the limited search scope examined.

Based on the limited literature search of 25 candidates, SPELL appears to offer a moderately novel contribution, particularly in its integrated multi-role framework, though individual technical components show some overlap with existing methods. The sparse taxonomy leaf and low refutation rate for the core framework suggest meaningful differentiation from prior work, though the analysis does not cover exhaustive literature beyond top-K semantic matches and citation expansion.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: The paper addresses an unspecified research problem within the broader landscape of computational methods and machine learning. The taxonomy organizes research into several major branches, including Research Problem Identification and Formulation, Research Methodology and Paradigms, Applied Research Studies with Defined Objectives, Computational Methods and Machine Learning, Standardization and Guidelines, Specialized Domain Applications, and Unspecified or Generic Study Descriptions. Within Computational Methods and Machine Learning, a substantial body of work focuses on Deep Learning and Representation Learning, where researchers develop novel architectures, training strategies, and theoretical foundations. This branch contrasts with Applied Research Studies, which emphasize domain-specific objectives such as healthcare optimization or precision agriculture, and with Research Problem Identification, which centers on formulating research questions and methodological frameworks. Representative works like Study Objectives[1] and Main Objectives[2] illustrate how researchers articulate goals, while Theory Problem[3] and Research Methodology Guide[20] provide foundational perspectives on problem formulation. A particularly active line of work explores the intersection of deep learning techniques with representation learning challenges, where trade-offs between model expressiveness, computational efficiency, and generalization remain central. SPELL[0] situates itself within this Deep Learning and Representation Learning cluster, sharing methodological concerns with nearby efforts such as Complement Training[24] and Hyperbolic Deep Learning[36]. While Complement Training[24] emphasizes augmenting standard training regimes and Hyperbolic Deep Learning[36] explores non-Euclidean geometries for hierarchical data, SPELL[0] appears to address representation learning from a distinct angle, potentially focusing on novel learning paradigms or architectural innovations. This positioning reflects ongoing debates about how to balance theoretical rigor with practical applicability, a theme that resonates across many studies in this densely populated branch of computational research.

Claimed Contributions

SPELL: Multi-role self-play RL framework for long-context reasoning

5 retrieved papers

The authors introduce SPELL, a framework where a single language model alternates among three roles (questioner, responder, and verifier) to autonomously generate questions from documents, solve them, and evaluate solutions. This enables continual self-improvement in long-context reasoning without requiring human annotations or programmatically verifiable rewards.

5 retrieved papers

Automated curriculum with adaptive difficulty control

Can Refute

10 retrieved papers

The framework incorporates a history memory mechanism that progressively increases context length and a Gaussian-shaped reward function that calibrates question difficulty around the responder's competence frontier. This ensures questions remain neither too easy nor impossibly difficult, maintaining optimal learning efficiency throughout training.

10 retrieved papers

Can Refute

Verifier trained via self-consistency for stable reward signals

Can Refute

10 retrieved papers

The authors develop a verification mechanism where the verifier learns to produce reliable semantic equivalence judgments through majority voting and self-consistency training on rule-verifiable tasks. This overcomes the brittleness of string matching and provides stable reward signals for non-verifiable outputs in long-context reasoning.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[24] Complement objective training PDF

Chen, Hao-Yun, Hao-Yun Chen, Liu, Chun-Hao, Pei-Hsin Wang, Chang, Shih-Chieh, Chun-Hao Liu, Pan, Jia-Yu, Shih-Chieh Chang, Chen Yu Ting, Jia-Yu Pan, Wei Wei, Yutian Chen, Juan, Da-Cheng, Da-Cheng Juan (2019)

[36] Hyperbolic Deep Learning in Computer Vision: A Survey PDF

Pascal Mettes, Mina Ghadimi Atigh, Martin Keller-Ressel, Jeffrey Gu, Serena Yeung (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

SPELL: Multi-role self-play RL framework for long-context reasoning

[71] Language Model Self-improvement by Reinforcement Learning Contemplation PDF

Cannot Refute

[72] Self-playing Adversarial Language Game Enhances LLM Reasoning PDF

Cannot Refute

[73] Large language model-based data science agent: A survey PDF

Cannot Refute

[74] MARSHAL: Incentivizing Multi-Agent Reasoning via Self-Play with Strategic LLMs PDF

Cannot Refute

[75] Large Reasoning Models: A Survey of Techniques, Applications, and Future Challenges in Structured AI Reasoning PDF

Cannot Refute

Contribution

Automated curriculum with adaptive difficulty control

[70] Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learning PDF

Can Refute

[61] Self-Adapting Language Models PDF

Cannot Refute

[62] Fisher information-based efficient curriculum federated learning with large language models PDF

Cannot Refute

[63] Writing-RL: Advancing Long-form Writing via Adaptive Curriculum Reinforcement Learning PDF

Cannot Refute

[64] Automatic curriculum expert iteration for reliable llm reasoning PDF

Cannot Refute

[65] Learning Like Humans: Advancing LLM Reasoning Capabilities via Adaptive Difficulty Curriculum Learning and Expert-Guided Self-Reformulation PDF

Cannot Refute

[66] Your pretrained model tells the difficulty itself: A self-adaptive curriculum learning paradigm for natural language understanding PDF

Cannot Refute

[67] EvoCurr: Self-evolving Curriculum with Behavior Code Generation for Complex Decision-making PDF

Cannot Refute

[68] Review and arrange: Curriculum learning for natural language understanding PDF

Cannot Refute

[69] Ghpo: Adaptive guidance for stable and efficient llm reinforcement learning PDF

Cannot Refute

Contribution

Verifier trained via self-consistency for stable reward signals

[51] V-star: Training verifiers for self-taught reasoners PDF

Can Refute

[52] SR: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning PDF

Cannot Refute

[53] S2R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning PDF

Cannot Refute

[54] Verify when uncertain: Beyond self-consistency in black box hallucination detection PDF

Cannot Refute

[55] Self-consistency improves chain of thought reasoning in language models PDF

Cannot Refute

[56] On the Emergence of Thinking in LLMs I: Searching for the Right Intuition PDF

Cannot Refute

[57] Solver-Informed RL: Grounding Large Language Models for Authentic Optimization Modeling PDF

Cannot Refute

[58] RLVER: Reinforcement Learning with Verifiable Emotion Rewards for Empathetic Agents PDF

Cannot Refute

[59] Trust, But Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards PDF

Cannot Refute

[60] Revise: Learning to refine at test-time via intrinsic self-verification PDF

Cannot Refute

SPELL: Self-Play Reinforcement Learning for evolving Long-Context Language Models

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[24] Complement objective training PDF

[36] Hyperbolic Deep Learning in Computer Vision: A Survey PDF

Contribution Analysis

SPELL: Multi-role self-play RL framework for long-context reasoning

[71] Language Model Self-improvement by Reinforcement Learning Contemplation PDF

[72] Self-playing Adversarial Language Game Enhances LLM Reasoning PDF

[73] Large language model-based data science agent: A survey PDF

[74] MARSHAL: Incentivizing Multi-Agent Reasoning via Self-Play with Strategic LLMs PDF

[75] Large Reasoning Models: A Survey of Techniques, Applications, and Future Challenges in Structured AI Reasoning PDF

Automated curriculum with adaptive difficulty control

[70] Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learning PDF

[61] Self-Adapting Language Models PDF

[62] Fisher information-based efficient curriculum federated learning with large language models PDF

[63] Writing-RL: Advancing Long-form Writing via Adaptive Curriculum Reinforcement Learning PDF

[64] Automatic curriculum expert iteration for reliable llm reasoning PDF

[65] Learning Like Humans: Advancing LLM Reasoning Capabilities via Adaptive Difficulty Curriculum Learning and Expert-Guided Self-Reformulation PDF

[66] Your pretrained model tells the difficulty itself: A self-adaptive curriculum learning paradigm for natural language understanding PDF

[67] EvoCurr: Self-evolving Curriculum with Behavior Code Generation for Complex Decision-making PDF

[68] Review and arrange: Curriculum learning for natural language understanding PDF

[69] Ghpo: Adaptive guidance for stable and efficient llm reinforcement learning PDF

Verifier trained via self-consistency for stable reward signals

[51] V-star: Training verifiers for self-taught reasoners PDF

[52] SR: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning PDF

[53] S2R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning PDF

[54] Verify when uncertain: Beyond self-consistency in black box hallucination detection PDF

[55] Self-consistency improves chain of thought reasoning in language models PDF

[56] On the Emergence of Thinking in LLMs I: Searching for the Right Intuition PDF

[57] Solver-Informed RL: Grounding Large Language Models for Authentic Optimization Modeling PDF

[58] RLVER: Reinforcement Learning with Verifiable Emotion Rewards for Empathetic Agents PDF

[59] Trust, But Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards PDF

[60] Revise: Learning to refine at test-time via intrinsic self-verification PDF

Table of Contents