R-Zero: Self-Evolving Reasoning LLM from Zero Data

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

large language modelreinforcement learningself-evolvingreasoning

Self-evolving Large Language Models (LLMs) offer a scalable path toward super-intelligence by autonomously generating, refining, and learning from their own experiences. However, existing methods for training such models still rely heavily on vast human-curated tasks and labels, typically via fine-tuning or reinforcement learning, which poses a fundamental bottleneck to advancing AI systems toward capabilities beyond human intelligence. To overcome this limitation, we introduce R-Zero, a fully autonomous framework that generates its own training data from scratch. Starting from a single base LLM, R-Zero initializes two independent models with distinct roles, a Challenger and a Solver. These models are optimized separately and co-evolve through interaction: the Challenger is rewarded for proposing tasks near the edge of the Solver capability, and the Solver is rewarded for solving increasingly challenging tasks posed by the Challenger. This process yields a targeted, self-improving curriculum without any pre-existing tasks and labels. Empirically, R-Zero substantially improves reasoning capability across different backbone LLMs, e.g., boosting the Qwen3-4B-Base by +6.49 on math-reasoning benchmarks and +7.54 on general-domain reasoning benchmarks.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces R-Zero, a framework for autonomous self-evolving reasoning in LLMs through co-evolutionary curriculum generation. It resides in the 'Dual-Role Co-Evolution for Reasoning' leaf of the taxonomy, which contains only three papers total, including R-Zero itself and two sibling works (Socratic-zero and Agent0). This leaf sits within the broader 'Co-Evolutionary Multi-Agent Self-Improvement' branch, indicating the paper occupies a relatively sparse but conceptually well-defined research direction focused on adversarial or complementary agent dynamics for reasoning improvement.

The taxonomy tree reveals that R-Zero's immediate neighbors explore similar dual-role dynamics: Socratic-zero emphasizes Socratic questioning for reasoning refinement, while Agent0 investigates zero-shot generalization through self-play. Adjacent leaves include 'Self-Play and Corpus-Grounded Curriculum Generation' (2 papers) and 'Morphology-Environment Co-Evolution' (1 paper), both addressing curriculum emergence through interaction but in different contexts. The broader 'Curriculum-Based Reinforcement Learning for LLM Reasoning' branch (3 leaves, 6 papers) tackles similar goals via structured task progression, though without the explicit co-evolutionary agent architecture that defines R-Zero's category.

Among 23 candidates examined across three contributions, two contributions show potential overlap with prior work. The core 'R-Zero framework' contribution examined 10 candidates and identified 1 that appears refutable, suggesting some precedent for zero-data self-evolution exists within the limited search scope. The 'co-evolutionary training mechanism' contribution examined 9 candidates with 1 refutable match, indicating the Challenger-Solver architecture may have conceptual predecessors. The 'uncertainty-based reward function' contribution examined 4 candidates with no refutations, appearing more novel within the sampled literature. These statistics reflect a targeted semantic search, not exhaustive coverage.

Based on the limited search scope of 23 candidates, R-Zero appears to make incremental architectural contributions within an emerging but not yet crowded research direction. The dual-role co-evolution concept has precedent among its sibling papers, though the specific implementation details and reward mechanisms may offer differentiation. The analysis captures semantic neighbors and taxonomy-defined relatives but does not claim comprehensive coverage of all relevant prior work in autonomous curriculum generation or self-evolving LLMs.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: autonomous self-evolving reasoning through co-evolutionary curriculum generation. The field explores how agents can improve their reasoning capabilities by dynamically generating training curricula that adapt alongside the learner itself. The taxonomy reveals several major branches: Co-Evolutionary Multi-Agent Self-Improvement focuses on systems where multiple roles (e.g., problem generators and solvers) evolve together, exemplified by works like Socratic-zero[4] and Agent0[6]. Curriculum-Based Reinforcement Learning for LLM Reasoning emphasizes structured progression in training language models for complex reasoning tasks, with methods such as WebRL[1] and MCTS Self-Improvement[2]. Guided and Controlled Self-Evolution addresses how external signals or constraints can steer autonomous learning, as seen in Guided Self-Evolving[5] and C2-Evo[7]. Other branches cover curriculum design for planning (Self-Evolving Curriculum[3]), skill abstraction (PolySkill[20]), symbolic reasoning architectures (Recursive Logic Analysis[21]), distributed systems (Morphology-Environment Co-Evolution[13]), and theoretical perspectives (RL in AI[18], Adaptive Intelligence[12]). A particularly active line of work centers on dual-role co-evolution, where one agent generates challenges while another solves them, creating a feedback loop that drives continuous improvement. R-Zero[0] sits squarely within this cluster, sharing conceptual ground with Socratic-zero[4] and Agent0[6], all of which leverage adversarial or cooperative dynamics between generator and solver roles. While Socratic-zero[4] emphasizes Socratic questioning to refine reasoning, and Agent0[6] explores zero-shot generalization through self-play, R-Zero[0] appears to integrate curriculum generation more tightly with the co-evolutionary process itself. Contrasting approaches like Guided Self-Evolving[5] introduce external guidance to prevent runaway complexity, whereas R-Zero[0] and its neighbors rely more heavily on emergent curriculum structures. Open questions remain around balancing exploration versus exploitation in curriculum design, ensuring diversity in generated tasks, and scaling these methods to broader reasoning domains beyond their initial testbeds.

Claimed Contributions

R-Zero framework for self-evolving reasoning LLMs from zero data

Can Refute

10 retrieved papers

The authors propose R-Zero, a framework that enables large language models to self-evolve without any pre-existing tasks or human labels. Starting from a single base LLM, it initializes two independent models (Challenger and Solver) that co-evolve through interaction to create a self-improving curriculum.

10 retrieved papers

Can Refute

Co-evolutionary training mechanism with Challenger and Solver roles

Can Refute

9 retrieved papers

The framework employs a dual-role mechanism where the Challenger generates questions targeted at the Solver's capability edge using uncertainty-based rewards, while the Solver is trained on these challenging questions using pseudo-labels from majority voting. Both models are optimized separately but co-evolve through their interaction.

9 retrieved papers

Can Refute

Uncertainty-based reward function for curriculum generation

4 retrieved papers

The authors design a reward function that guides the Challenger to generate questions where the Solver exhibits maximum uncertainty (around 50% empirical accuracy). This uncertainty is measured through self-consistency of multiple sampled responses, enabling automatic difficulty calibration without external verification.

4 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[4] Socratic-zero: Bootstrapping reasoning via data-free agent co-evolution PDF

Wang Shao-Bo, Jiao Zhengbo, Shaobo Wang, Zhang Zifan, Zhengbo Jiao, Zifan Zhang, Ze Xu, Yilang Peng, Yang BoYu, Xu Ze, Wang Wei, Boyu Yang, Wei Hu, Wei Wang, Zhang, Linfeng, Hu Wei, Linfeng Zhang (2025)

[6] Agent0: Unleashing self-evolving agents from zero data via tool-integrated reasoning PDF

Peng Xia, Kaide Zeng, Jiaqi Liu, Can Qin, Fang Wu, Yiyang Zhou, Caiming Xiong, Huaxiu Yao (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

R-Zero framework for self-evolving reasoning LLMs from zero data

[33] Large language models can self-improve PDF

Can Refute

[31] Training language models to self-correct via reinforcement learning PDF

Cannot Refute

[32] Toolformer: Language models can teach themselves to use tools PDF

Cannot Refute

[34] Pre-trained language models in biomedical domain: A systematic survey PDF

Cannot Refute

[35] Self-Rewarding Language Models PDF

Cannot Refute

[36] Large language models for automated open-domain scientific hypotheses discovery PDF

Cannot Refute

[37] Self-Instruct: Aligning Language Models with Self-Generated Instructions PDF

Cannot Refute

[38] DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines PDF

Cannot Refute

[39] Principle-driven self-alignment of language models from scratch with minimal human supervision PDF

Cannot Refute

[40] Enhancing large vision language models with self-training on image comprehension PDF

Cannot Refute

Contribution

Co-evolutionary training mechanism with Challenger and Solver roles

[4] Socratic-zero: Bootstrapping reasoning via data-free agent co-evolution PDF

Can Refute

[23] Question Answering and Question Generation as Dual Tasks PDF

Cannot Refute

[24] Joint learning of question answering and question generation PDF

Cannot Refute

[25] Retrieval Augmented Visual Question Answering with Outside Knowledge PDF

Cannot Refute

[26] Agent AI for Finance: From Financial Argument Mining to Agent-Based Modeling PDF

Cannot Refute

[27] Learning to collaborate for question answering and asking PDF

Cannot Refute

[28] MuGER: Multi-Granularity Evidence Retrieval and Reasoning for Hybrid Question Answering PDF

Cannot Refute

[29] Joint generation and bi-encoder for situated interactive multimodal conversations PDF

Cannot Refute

[30] Cocqa: co-training over questions and answers with an application to predicting question subjectivity orientation PDF

Cannot Refute

Contribution

Uncertainty-based reward function for curriculum generation

[41] Enhancing the Outcome Reward-based RL Training of MLLMs with Self-Consistency Sampling PDF

Cannot Refute

[42] Toward Efficient and Faithful Reasoning in Large Language Models PDF

Cannot Refute

[43] Robot learning across agents: from imitation to multi-agent cooperation PDF

Cannot Refute

[44] System 2 through Okazaki-RAG PDF

Cannot Refute

R-Zero: Self-Evolving Reasoning LLM from Zero Data

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[4] Socratic-zero: Bootstrapping reasoning via data-free agent co-evolution PDF

[6] Agent0: Unleashing self-evolving agents from zero data via tool-integrated reasoning PDF

Contribution Analysis

R-Zero framework for self-evolving reasoning LLMs from zero data

[33] Large language models can self-improve PDF

[31] Training language models to self-correct via reinforcement learning PDF

[32] Toolformer: Language models can teach themselves to use tools PDF

[34] Pre-trained language models in biomedical domain: A systematic survey PDF

[35] Self-Rewarding Language Models PDF

[36] Large language models for automated open-domain scientific hypotheses discovery PDF

[37] Self-Instruct: Aligning Language Models with Self-Generated Instructions PDF

[38] DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines PDF

[39] Principle-driven self-alignment of language models from scratch with minimal human supervision PDF

[40] Enhancing large vision language models with self-training on image comprehension PDF

Co-evolutionary training mechanism with Challenger and Solver roles

[4] Socratic-zero: Bootstrapping reasoning via data-free agent co-evolution PDF

[23] Question Answering and Question Generation as Dual Tasks PDF

[24] Joint learning of question answering and question generation PDF

[25] Retrieval Augmented Visual Question Answering with Outside Knowledge PDF

[26] Agent AI for Finance: From Financial Argument Mining to Agent-Based Modeling PDF

[27] Learning to collaborate for question answering and asking PDF

[28] MuGER: Multi-Granularity Evidence Retrieval and Reasoning for Hybrid Question Answering PDF

[29] Joint generation and bi-encoder for situated interactive multimodal conversations PDF

[30] Cocqa: co-training over questions and answers with an application to predicting question subjectivity orientation PDF

Uncertainty-based reward function for curriculum generation

[41] Enhancing the Outcome Reward-based RL Training of MLLMs with Self-Consistency Sampling PDF

[42] Toward Efficient and Faithful Reasoning in Large Language Models PDF

[43] Robot learning across agents: from imitation to multi-agent cooperation PDF

[44] System 2 through Okazaki-RAG PDF

Table of Contents