Beyond Pass@ 1: Self-Play with Variational Problem Synthesis Sustains RLVR

ICLR 2026 Conference SubmissionAnonymous Authors
LLM Reasoning; Reinforcement Learning; Self-envolving
Abstract:

Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a key paradigm for post-training Large Language Models (LLMs), particularly for complex reasoning tasks. However, vanilla RLVR training has been shown to improve Pass@1 performance at the expense of policy entropy, leading to reduced generation diversity and limiting the Pass@k performance, which typically represents the upper bound of LLM reasoning capability. In this paper, we systematically analyze the policy's generation diversity from the perspective of training problems and find that augmenting and updating training problems helps mitigate entropy collapse during training. Based on these observations, we propose an online Self-play with Variational problem Synthesis (SvS) strategy for RLVR training, which uses the policy's correct solutions to synthesize variational problems while ensuring their reference answers remain identical to the originals. This self-improving strategy effectively maintains policy entropy during training and substantially improves Pass@k compared with standard RLVR, sustaining prolonged improvements and achieving absolute gains of 18.3% and 22.8% in Pass@32 performance on the competition-level AIME24 and AIME25 benchmarks. Experiments on 12 reasoning benchmarks across varying model sizes from 3B to 32B consistently demonstrate the generalizability and robustness of SvS.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a Self-play with Variational problem Synthesis (SvS) strategy to maintain policy entropy during RLVR training by synthesizing variational problems from correct solutions. It resides in the 'Problem Synthesis and Augmentation' leaf under 'Data Curation and Sample Selection Strategies', which contains only one other sibling paper. This leaf represents a relatively sparse research direction within the broader taxonomy of 28 papers across multiple diversity-preserving approaches, suggesting the specific focus on problem synthesis for entropy preservation is less crowded than alternative strategies like divergence design or exploration mechanisms.

The taxonomy reveals neighboring leaves addressing diversity through different mechanisms: 'Offline Data Selection and Filtering' curates existing data, 'Online Rollout Selection' strategically samples during training, and 'Zero-Variance Prompt Exploitation' extracts feedback from uniform-reward prompts. The parent branch 'Data Curation and Sample Selection Strategies' excludes exploration mechanisms and objective modifications, which are handled by sibling branches 'Exploration Strategy and Policy Dynamics' and 'Diversity-Preserving Training Objectives'. This structural separation indicates the paper's data-level intervention approach diverges from training-objective or exploration-bonus methods prevalent in other branches.

Among 30 candidates examined, the core SvS strategy (Contribution 1) shows no clear refutation across 10 examined papers. However, Contributions 2 and 3—concerning entropy preservation and generalizability—each face 2 refutable candidates among 10 examined. The limited search scope means these statistics reflect top-K semantic matches rather than exhaustive coverage. The single sibling paper in the same leaf suggests prior work on problem synthesis for RLVR diversity is sparse, though the refutable candidates for performance claims indicate overlapping empirical findings may exist in the broader 30-paper search space.

Based on the limited 30-candidate search, the SvS mechanism appears relatively novel within its specific leaf, though the performance benefits overlap with some examined work. The taxonomy structure shows problem synthesis is one of several parallel strategies for diversity preservation, and the sparse population of this leaf suggests less prior exploration compared to divergence-based or exploration-focused approaches. The analysis does not cover exhaustive literature beyond top-K semantic retrieval and citation expansion.

Taxonomy

Core-task Taxonomy Papers
28
3
Claimed Contributions
30
Contribution Candidate Papers Compared
4
Refutable Paper

Research Landscape Overview

Core task: Maintaining generation diversity in reinforcement learning with verifiable rewards. The field addresses a fundamental tension in RL-based language model training: how to improve performance on tasks with automatic verifiers (such as code generation or mathematical reasoning) without collapsing the policy into a narrow set of high-reward solutions. The taxonomy reveals several complementary research directions. Diversity-Preserving Training Objectives and Divergence Design explores regularization schemes and divergence measures (e.g., Divergence Choice[2], Alpha Divergence Preference[25]) that explicitly penalize mode collapse. Exploration Strategy and Policy Dynamics investigates how to encourage broader search through exploration bonuses or ensemble methods (e.g., Negatively Correlated Ensemble[6], Diversity Incentivized Exploration[12]). Data Curation and Sample Selection Strategies focuses on curating or synthesizing training data to maintain coverage of the problem space, while Sample Polarity and Policy Update Mechanisms examines how to balance positive and negative examples during updates (Sample Polarity[3]). Alternative Reward Paradigms and Policy Optimization considers non-standard reward formulations (Optimal Reward Baseline[8], Token Hidden Reward[15]), and Scaling and Prolonged Training Studies investigates whether diversity issues persist or resolve under extended training (Prolonged Training[14], RLHF Data Scaling[4]). Domain-Specific Applications and Extensions apply these ideas to specialized settings such as web interaction or visual environments. A particularly active line of work centers on data curation and problem synthesis, where researchers generate or augment training problems to prevent overfitting to a narrow distribution. Variational Problem Synthesis[0] sits squarely in this branch, proposing a variational approach to synthesize diverse problem instances that maintain verifiable reward structure. This contrasts with neighboring efforts like SHARP[9], which also addresses problem augmentation but may emphasize different synthesis mechanisms or diversity metrics. Meanwhile, works such as Diversity Quality Joint[5] and Diversity Enhanced Reasoning[7] explore how to jointly optimize for both solution quality and diversity, raising the question of whether explicit diversity objectives can be integrated into the reward signal itself or must remain as separate regularizers. The interplay between data-level interventions (as in Variational Problem Synthesis[0]) and training-level divergence penalties (Divergence Choice[2]) remains an open question, with some studies suggesting that combining both strategies yields the most robust diversity preservation across prolonged training regimes.

Claimed Contributions

Self-play with Variational problem Synthesis (SVS) strategy for RLVR training

SVS is an online training strategy where the policy model synthesizes variational problems from its own correct solutions to underperforming training problems. These synthetic problems preserve the original reference answers while varying structure and description, enabling self-improvement without external guidance or distillation.

10 retrieved papers
Preservation of policy entropy and improvement in Pass@k performance

The SVS framework maintains stable policy entropy during RLVR training through online data augmentation, preventing entropy collapse. This leads to sustained improvements in Pass@k performance, particularly achieving substantial gains on competition-level benchmarks like AIME.

10 retrieved papers
Can Refute
Generalizability across model sizes and benchmarks

The authors validate SVS through extensive experiments showing consistent improvements across different model scales (3B to 32B parameters) and 12 reasoning benchmarks, demonstrating that the approach generalizes across various settings and model sizes.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Self-play with Variational problem Synthesis (SVS) strategy for RLVR training

SVS is an online training strategy where the policy model synthesizes variational problems from its own correct solutions to underperforming training problems. These synthetic problems preserve the original reference answers while varying structure and description, enabling self-improvement without external guidance or distillation.

Contribution

Preservation of policy entropy and improvement in Pass@k performance

The SVS framework maintains stable policy entropy during RLVR training through online data augmentation, preventing entropy collapse. This leads to sustained improvements in Pass@k performance, particularly achieving substantial gains on competition-level benchmarks like AIME.

Contribution

Generalizability across model sizes and benchmarks

The authors validate SVS through extensive experiments showing consistent improvements across different model scales (3B to 32B parameters) and 12 reasoning benchmarks, demonstrating that the approach generalizes across various settings and model sizes.