Beyond Pass@ 1: Self-Play with Variational Problem Synthesis Sustains RLVR

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

LLM Reasoning; Reinforcement Learning; Self-envolving

Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a key paradigm for post-training Large Language Models (LLMs), particularly for complex reasoning tasks. However, vanilla RLVR training has been shown to improve Pass@1 performance at the expense of policy entropy, leading to reduced generation diversity and limiting the Pass@k performance, which typically represents the upper bound of LLM reasoning capability. In this paper, we systematically analyze the policy's generation diversity from the perspective of training problems and find that augmenting and updating training problems helps mitigate entropy collapse during training. Based on these observations, we propose an online Self-play with Variational problem Synthesis (SvS) strategy for RLVR training, which uses the policy's correct solutions to synthesize variational problems while ensuring their reference answers remain identical to the originals. This self-improving strategy effectively maintains policy entropy during training and substantially improves Pass@k compared with standard RLVR, sustaining prolonged improvements and achieving absolute gains of 18.3% and 22.8% in Pass@32 performance on the competition-level AIME24 and AIME25 benchmarks. Experiments on 12 reasoning benchmarks across varying model sizes from 3B to 32B consistently demonstrate the generalizability and robustness of SvS.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a Self-play with Variational problem Synthesis (SvS) strategy to maintain policy entropy during RLVR training by synthesizing variational problems from correct solutions. It resides in the 'Problem Synthesis and Augmentation' leaf under 'Data Curation and Sample Selection Strategies', which contains only one other sibling paper. This leaf represents a relatively sparse research direction within the broader taxonomy of 28 papers across multiple diversity-preserving approaches, suggesting the specific focus on problem synthesis for entropy preservation is less crowded than alternative strategies like divergence design or exploration mechanisms.

The taxonomy reveals neighboring leaves addressing diversity through different mechanisms: 'Offline Data Selection and Filtering' curates existing data, 'Online Rollout Selection' strategically samples during training, and 'Zero-Variance Prompt Exploitation' extracts feedback from uniform-reward prompts. The parent branch 'Data Curation and Sample Selection Strategies' excludes exploration mechanisms and objective modifications, which are handled by sibling branches 'Exploration Strategy and Policy Dynamics' and 'Diversity-Preserving Training Objectives'. This structural separation indicates the paper's data-level intervention approach diverges from training-objective or exploration-bonus methods prevalent in other branches.

Among 30 candidates examined, the core SvS strategy (Contribution 1) shows no clear refutation across 10 examined papers. However, Contributions 2 and 3—concerning entropy preservation and generalizability—each face 2 refutable candidates among 10 examined. The limited search scope means these statistics reflect top-K semantic matches rather than exhaustive coverage. The single sibling paper in the same leaf suggests prior work on problem synthesis for RLVR diversity is sparse, though the refutable candidates for performance claims indicate overlapping empirical findings may exist in the broader 30-paper search space.

Based on the limited 30-candidate search, the SvS mechanism appears relatively novel within its specific leaf, though the performance benefits overlap with some examined work. The taxonomy structure shows problem synthesis is one of several parallel strategies for diversity preservation, and the sparse population of this leaf suggests less prior exploration compared to divergence-based or exploration-focused approaches. The analysis does not cover exhaustive literature beyond top-K semantic retrieval and citation expansion.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Maintaining generation diversity in reinforcement learning with verifiable rewards. The field addresses a fundamental tension in RL-based language model training: how to improve performance on tasks with automatic verifiers (such as code generation or mathematical reasoning) without collapsing the policy into a narrow set of high-reward solutions. The taxonomy reveals several complementary research directions. Diversity-Preserving Training Objectives and Divergence Design explores regularization schemes and divergence measures (e.g., Divergence Choice[2], Alpha Divergence Preference[25]) that explicitly penalize mode collapse. Exploration Strategy and Policy Dynamics investigates how to encourage broader search through exploration bonuses or ensemble methods (e.g., Negatively Correlated Ensemble[6], Diversity Incentivized Exploration[12]). Data Curation and Sample Selection Strategies focuses on curating or synthesizing training data to maintain coverage of the problem space, while Sample Polarity and Policy Update Mechanisms examines how to balance positive and negative examples during updates (Sample Polarity[3]). Alternative Reward Paradigms and Policy Optimization considers non-standard reward formulations (Optimal Reward Baseline[8], Token Hidden Reward[15]), and Scaling and Prolonged Training Studies investigates whether diversity issues persist or resolve under extended training (Prolonged Training[14], RLHF Data Scaling[4]). Domain-Specific Applications and Extensions apply these ideas to specialized settings such as web interaction or visual environments. A particularly active line of work centers on data curation and problem synthesis, where researchers generate or augment training problems to prevent overfitting to a narrow distribution. Variational Problem Synthesis[0] sits squarely in this branch, proposing a variational approach to synthesize diverse problem instances that maintain verifiable reward structure. This contrasts with neighboring efforts like SHARP[9], which also addresses problem augmentation but may emphasize different synthesis mechanisms or diversity metrics. Meanwhile, works such as Diversity Quality Joint[5] and Diversity Enhanced Reasoning[7] explore how to jointly optimize for both solution quality and diversity, raising the question of whether explicit diversity objectives can be integrated into the reward signal itself or must remain as separate regularizers. The interplay between data-level interventions (as in Variational Problem Synthesis[0]) and training-level divergence penalties (Divergence Choice[2]) remains an open question, with some studies suggesting that combining both strategies yields the most robust diversity preservation across prolonged training regimes.

Claimed Contributions

Self-play with Variational problem Synthesis (SVS) strategy for RLVR training

10 retrieved papers

SVS is an online training strategy where the policy model synthesizes variational problems from its own correct solutions to underperforming training problems. These synthetic problems preserve the original reference answers while varying structure and description, enabling self-improvement without external guidance or distillation.

10 retrieved papers

Preservation of policy entropy and improvement in Pass@k performance

Can Refute

10 retrieved papers

The SVS framework maintains stable policy entropy during RLVR training through online data augmentation, preventing entropy collapse. This leads to sustained improvements in Pass@k performance, particularly achieving substantial gains on competition-level benchmarks like AIME.

10 retrieved papers

Can Refute

Generalizability across model sizes and benchmarks

Can Refute

10 retrieved papers

The authors validate SVS through extensive experiments showing consistent improvements across different model scales (3B to 32B parameters) and 12 reasoning benchmarks, demonstrating that the approach generalizes across various settings and model sizes.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[9] SHARP: Synthesizing High-quality Aligned Reasoning Problems for Large Reasoning Models Reinforcement Learning PDF

WU Xiong-jun, Zhang Zhenduo, Xiong Jun Wu, Wen, Zujie, Zhenduo Zhang, Zhang Zhiqiang, Zujie Wen, Ren Wang, Zhiqiang Zhang, Shi Lei, Wang Ren, Chen Cai, Lei Shi, Zhao Deng, Cai Chen, Wang Qing, Deng Zhao, Han Xudong, Qing Wang, Tang Cheng-fu, Xudong Han, jin, dingnan, Chengfu Tang, Cui Qing, Dingnan Jin, Zhou Jun, Qing Cui, Jun Zhou (2025) • arXiv.org

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Self-play with Variational problem Synthesis (SVS) strategy for RLVR training

[39] Absolute Zero: Reinforced Self-play Reasoning with Zero Data PDF

Cannot Refute

[40] Search Self-play: Pushing the Frontier of Agent Capability without Supervision PDF

Cannot Refute

[41] Building a Conversational Agent Overnight with Dialogue Self-Play PDF

Cannot Refute

[42] Intrinsic Motivation and Automatic Curricula via Asymmetric Self-Play PDF

Cannot Refute

[43] Learning to solve and verify: A self-play framework for code and test generation PDF

Cannot Refute

[44] Self-Improving AI Agents through Self-Play PDF

Cannot Refute

[45] Vision-zero: Scalable vlm self-improvement via strategic gamified self-play PDF

Cannot Refute

[46] SeRL: Self-Play Reinforcement Learning for Large Language Models with Limited Data PDF

Cannot Refute

[47] A Sharp Analysis of Model-based Reinforcement Learning with Self-Play PDF

Cannot Refute

[48] Genetic algorithm for curriculum design in multi-agent reinforcement learning PDF

Cannot Refute

Contribution

Preservation of policy entropy and improvement in Pass@k performance

[30] The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models PDF

Can Refute

[50] Enhancing efficiency and exploration in reinforcement learning for llms PDF

Can Refute

[49] Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning PDF

Cannot Refute

[51] Reasoning with Exploration: An Entropy Perspective PDF

Cannot Refute

[52] Ettrl: Balancing exploration and exploitation in llm test-time reinforcement learning via entropy mechanism PDF

Cannot Refute

[53] Reinforcement Learning for Reasoning in Large Language Models with One Training Example PDF

Cannot Refute

[54] Rethinking entropy regularization in large reasoning models PDF

Cannot Refute

[55] Entropy-based Exploration Conduction for Multi-step Reasoning PDF

Cannot Refute

[56] Agentic reinforced policy optimization PDF

Cannot Refute

[57] Perception-aware policy optimization for multimodal reasoning PDF

Cannot Refute

Contribution

Generalizability across model sizes and benchmarks

[29] Kimi k1. 5: Scaling reinforcement learning with llms PDF

Can Refute

[37] Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model PDF

Can Refute

[30] The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models PDF

Cannot Refute

[31] Reason-rft: Reinforcement fine-tuning for visual reasoning PDF

Cannot Refute

[32] Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning PDF

Cannot Refute

[33] DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning PDF

Cannot Refute

[34] Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? PDF

Cannot Refute

[35] Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models PDF

Cannot Refute

[36] Towards reasoning era: A survey of long chain-of-thought for reasoning large language models PDF

Cannot Refute

[38] Logic-rl: Unleashing llm reasoning with rule-based reinforcement learning PDF

Cannot Refute

Beyond Pass@ 1: Self-Play with Variational Problem Synthesis Sustains RLVR

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[9] SHARP: Synthesizing High-quality Aligned Reasoning Problems for Large Reasoning Models Reinforcement Learning PDF

Contribution Analysis

Self-play with Variational problem Synthesis (SVS) strategy for RLVR training

[39] Absolute Zero: Reinforced Self-play Reasoning with Zero Data PDF

[40] Search Self-play: Pushing the Frontier of Agent Capability without Supervision PDF

[41] Building a Conversational Agent Overnight with Dialogue Self-Play PDF

[42] Intrinsic Motivation and Automatic Curricula via Asymmetric Self-Play PDF

[43] Learning to solve and verify: A self-play framework for code and test generation PDF

[44] Self-Improving AI Agents through Self-Play PDF

[45] Vision-zero: Scalable vlm self-improvement via strategic gamified self-play PDF

[46] SeRL: Self-Play Reinforcement Learning for Large Language Models with Limited Data PDF

[47] A Sharp Analysis of Model-based Reinforcement Learning with Self-Play PDF

[48] Genetic algorithm for curriculum design in multi-agent reinforcement learning PDF

Preservation of policy entropy and improvement in Pass@k performance

[30] The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models PDF

[50] Enhancing efficiency and exploration in reinforcement learning for llms PDF

[49] Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning PDF

[51] Reasoning with Exploration: An Entropy Perspective PDF

[52] Ettrl: Balancing exploration and exploitation in llm test-time reinforcement learning via entropy mechanism PDF

[53] Reinforcement Learning for Reasoning in Large Language Models with One Training Example PDF

[54] Rethinking entropy regularization in large reasoning models PDF

[55] Entropy-based Exploration Conduction for Multi-step Reasoning PDF

[56] Agentic reinforced policy optimization PDF

[57] Perception-aware policy optimization for multimodal reasoning PDF

Generalizability across model sizes and benchmarks

[29] Kimi k1. 5: Scaling reinforcement learning with llms PDF

[37] Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model PDF

[30] The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models PDF

[31] Reason-rft: Reinforcement fine-tuning for visual reasoning PDF

[32] Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning PDF

[33] DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning PDF

[34] Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? PDF

[35] Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models PDF

[36] Towards reasoning era: A survey of long chain-of-thought for reasoning large language models PDF

[38] Logic-rl: Unleashing llm reasoning with rule-based reinforcement learning PDF

Table of Contents