R-Zero: Self-Evolving Reasoning LLM from Zero Data

ICLR 2026 Conference SubmissionAnonymous Authors
large language modelreinforcement learningself-evolvingreasoning
Abstract:

Self-evolving Large Language Models (LLMs) offer a scalable path toward super-intelligence by autonomously generating, refining, and learning from their own experiences. However, existing methods for training such models still rely heavily on vast human-curated tasks and labels, typically via fine-tuning or reinforcement learning, which poses a fundamental bottleneck to advancing AI systems toward capabilities beyond human intelligence. To overcome this limitation, we introduce R-Zero, a fully autonomous framework that generates its own training data from scratch. Starting from a single base LLM, R-Zero initializes two independent models with distinct roles, a Challenger and a Solver. These models are optimized separately and co-evolve through interaction: the Challenger is rewarded for proposing tasks near the edge of the Solver capability, and the Solver is rewarded for solving increasingly challenging tasks posed by the Challenger. This process yields a targeted, self-improving curriculum without any pre-existing tasks and labels. Empirically, R-Zero substantially improves reasoning capability across different backbone LLMs, e.g., boosting the Qwen3-4B-Base by +6.49 on math-reasoning benchmarks and +7.54 on general-domain reasoning benchmarks.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces R-Zero, a framework for autonomous self-evolving reasoning in LLMs through co-evolutionary curriculum generation. It resides in the 'Dual-Role Co-Evolution for Reasoning' leaf of the taxonomy, which contains only three papers total, including R-Zero itself and two sibling works (Socratic-zero and Agent0). This leaf sits within the broader 'Co-Evolutionary Multi-Agent Self-Improvement' branch, indicating the paper occupies a relatively sparse but conceptually well-defined research direction focused on adversarial or complementary agent dynamics for reasoning improvement.

The taxonomy tree reveals that R-Zero's immediate neighbors explore similar dual-role dynamics: Socratic-zero emphasizes Socratic questioning for reasoning refinement, while Agent0 investigates zero-shot generalization through self-play. Adjacent leaves include 'Self-Play and Corpus-Grounded Curriculum Generation' (2 papers) and 'Morphology-Environment Co-Evolution' (1 paper), both addressing curriculum emergence through interaction but in different contexts. The broader 'Curriculum-Based Reinforcement Learning for LLM Reasoning' branch (3 leaves, 6 papers) tackles similar goals via structured task progression, though without the explicit co-evolutionary agent architecture that defines R-Zero's category.

Among 23 candidates examined across three contributions, two contributions show potential overlap with prior work. The core 'R-Zero framework' contribution examined 10 candidates and identified 1 that appears refutable, suggesting some precedent for zero-data self-evolution exists within the limited search scope. The 'co-evolutionary training mechanism' contribution examined 9 candidates with 1 refutable match, indicating the Challenger-Solver architecture may have conceptual predecessors. The 'uncertainty-based reward function' contribution examined 4 candidates with no refutations, appearing more novel within the sampled literature. These statistics reflect a targeted semantic search, not exhaustive coverage.

Based on the limited search scope of 23 candidates, R-Zero appears to make incremental architectural contributions within an emerging but not yet crowded research direction. The dual-role co-evolution concept has precedent among its sibling papers, though the specific implementation details and reward mechanisms may offer differentiation. The analysis captures semantic neighbors and taxonomy-defined relatives but does not claim comprehensive coverage of all relevant prior work in autonomous curriculum generation or self-evolving LLMs.

Taxonomy

Core-task Taxonomy Papers
22
3
Claimed Contributions
23
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: autonomous self-evolving reasoning through co-evolutionary curriculum generation. The field explores how agents can improve their reasoning capabilities by dynamically generating training curricula that adapt alongside the learner itself. The taxonomy reveals several major branches: Co-Evolutionary Multi-Agent Self-Improvement focuses on systems where multiple roles (e.g., problem generators and solvers) evolve together, exemplified by works like Socratic-zero[4] and Agent0[6]. Curriculum-Based Reinforcement Learning for LLM Reasoning emphasizes structured progression in training language models for complex reasoning tasks, with methods such as WebRL[1] and MCTS Self-Improvement[2]. Guided and Controlled Self-Evolution addresses how external signals or constraints can steer autonomous learning, as seen in Guided Self-Evolving[5] and C2-Evo[7]. Other branches cover curriculum design for planning (Self-Evolving Curriculum[3]), skill abstraction (PolySkill[20]), symbolic reasoning architectures (Recursive Logic Analysis[21]), distributed systems (Morphology-Environment Co-Evolution[13]), and theoretical perspectives (RL in AI[18], Adaptive Intelligence[12]). A particularly active line of work centers on dual-role co-evolution, where one agent generates challenges while another solves them, creating a feedback loop that drives continuous improvement. R-Zero[0] sits squarely within this cluster, sharing conceptual ground with Socratic-zero[4] and Agent0[6], all of which leverage adversarial or cooperative dynamics between generator and solver roles. While Socratic-zero[4] emphasizes Socratic questioning to refine reasoning, and Agent0[6] explores zero-shot generalization through self-play, R-Zero[0] appears to integrate curriculum generation more tightly with the co-evolutionary process itself. Contrasting approaches like Guided Self-Evolving[5] introduce external guidance to prevent runaway complexity, whereas R-Zero[0] and its neighbors rely more heavily on emergent curriculum structures. Open questions remain around balancing exploration versus exploitation in curriculum design, ensuring diversity in generated tasks, and scaling these methods to broader reasoning domains beyond their initial testbeds.

Claimed Contributions

R-Zero framework for self-evolving reasoning LLMs from zero data

The authors propose R-Zero, a framework that enables large language models to self-evolve without any pre-existing tasks or human labels. Starting from a single base LLM, it initializes two independent models (Challenger and Solver) that co-evolve through interaction to create a self-improving curriculum.

10 retrieved papers
Can Refute
Co-evolutionary training mechanism with Challenger and Solver roles

The framework employs a dual-role mechanism where the Challenger generates questions targeted at the Solver's capability edge using uncertainty-based rewards, while the Solver is trained on these challenging questions using pseudo-labels from majority voting. Both models are optimized separately but co-evolve through their interaction.

9 retrieved papers
Can Refute
Uncertainty-based reward function for curriculum generation

The authors design a reward function that guides the Challenger to generate questions where the Solver exhibits maximum uncertainty (around 50% empirical accuracy). This uncertainty is measured through self-consistency of multiple sampled responses, enabling automatic difficulty calibration without external verification.

4 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

R-Zero framework for self-evolving reasoning LLMs from zero data

The authors propose R-Zero, a framework that enables large language models to self-evolve without any pre-existing tasks or human labels. Starting from a single base LLM, it initializes two independent models (Challenger and Solver) that co-evolve through interaction to create a self-improving curriculum.

Contribution

Co-evolutionary training mechanism with Challenger and Solver roles

The framework employs a dual-role mechanism where the Challenger generates questions targeted at the Solver's capability edge using uncertainty-based rewards, while the Solver is trained on these challenging questions using pseudo-labels from majority voting. Both models are optimized separately but co-evolve through their interaction.

Contribution

Uncertainty-based reward function for curriculum generation

The authors design a reward function that guides the Challenger to generate questions where the Solver exhibits maximum uncertainty (around 50% empirical accuracy). This uncertainty is measured through self-consistency of multiple sampled responses, enabling automatic difficulty calibration without external verification.