ExGRPO: Learning to Reason from Prior Successes

ICLR 2026 Conference SubmissionAnonymous Authors
Reinforcement LearningLarge Reasoning ModelReinforcement Learning with Verifiable Rewards
Abstract:

Reinforcement learning from verifiable rewards (RLVR) is an emerging paradigm for improving the reasoning ability of large language models. However, standard on-policy training discards rollout experiences after a single update, leading to computational inefficiency and instability. While prior work on RL has highlighted the benefits of reusing past experience, the role of experience characteristics in shaping learning dynamics of large reasoning models remains underexplored. In this paper, we are the first to investigate what makes a reasoning experience valuable and identify rollout correctness and entropy as effective indicators of experience value. Based on these insights, we propose ExGRPO (Experiential Group Relative Policy Optimization), a framework that organizes and prioritizes valuable experiences, and employs a mixed-policy objective to balance exploration with experience exploitation. Experiments on five backbone models (1.5B-8B parameters) show that ExGRPO consistently improves reasoning performance on mathematical/general benchmarks, with an average gain of +3.5/7.6 points over on-policy RLVR. Moreover, ExGRPO stabilizes training on both stronger and weaker models where on-policy methods fail. These results highlight principled experience management as a key ingredient for efficient and scalable RLVR.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes ExGRPO, a framework for managing and replaying reasoning experiences in RLVR training, identifying rollout correctness and entropy as indicators of experience value. According to the taxonomy, this work sits in the 'Experience Management and Replay Strategies' leaf under 'Core RLVR Algorithms and Training Dynamics'. Notably, this leaf contains only one paper—the original submission itself—indicating a relatively sparse research direction within the broader RLVR landscape. The taxonomy shows 41 total papers across the field, with most concentrated in verification design, domain applications, and policy optimization methods.

The taxonomy reveals that neighboring leaves focus on policy optimization theory (3 papers) and exploration strategies (2 papers), suggesting that experience management has received less direct attention than algorithmic foundations or diversity mechanisms. The scope note for this leaf explicitly excludes diversity-focused exploration, positioning ExGRPO as complementary to works like Diversity Exploration that incentivize broad sampling. The broader 'Core RLVR Algorithms' branch contains multiple active directions, but experience replay specifically appears underexplored compared to verification design (5 sub-categories) and domain applications (4 sub-categories).

Among 30 candidates examined, the contribution-level analysis shows mixed novelty signals. The identification of valuable experience characteristics (Contribution 1) examined 10 candidates with 3 appearing to provide overlapping prior work. The ExGRPO framework itself (Contribution 2) and performance improvements (Contribution 3) each examined 10 candidates with 1 refutable match apiece. These statistics suggest that while some aspects of experience characterization have precedent in the limited search scope, the integrated framework and empirical validation may offer incremental advances. The relatively low refutation counts should be interpreted cautiously given the 30-candidate search scale.

Based on the limited literature search, the work appears to address a gap in experience management for RLVR, though the search scope (30 candidates from semantic matching) cannot confirm exhaustive novelty. The taxonomy structure suggests this direction is less crowded than verification design or domain applications, but the contribution-level statistics indicate that key ideas around experience value and replay have some precedent among examined candidates. A more comprehensive search would be needed to definitively assess originality.

Taxonomy

Core-task Taxonomy Papers
41
3
Claimed Contributions
30
Contribution Candidate Papers Compared
5
Refutable Paper

Research Landscape Overview

Core task: reinforcement learning from verifiable rewards for language model reasoning. The field has organized itself around several major branches that reflect different facets of this challenge. Core RLVR Algorithms and Training Dynamics focuses on the mechanics of policy optimization and experience management, exploring how to efficiently leverage verifiable feedback signals. Reasoning Capability Analysis and Evaluation examines what models learn and how well they generalize, while Verification and Reward Design addresses the construction of reliable reward signals across domains. Domain-Specific Applications and Extensions targets particular problem settings such as mathematical theorem proving and code generation, whereas Beyond Verifiable Domains and Multi-Objective and Multi-Domain Training consider settings where clean verification is unavailable or multiple objectives must be balanced. Training Challenges and Mitigation Strategies tackles practical issues like reward hacking and distribution shift, and Safety and Alignment ensures that capability gains do not compromise model safety. Surveys and Methodological Reviews provide broader perspectives, with works like RL LLM Survey[8] and Verification Design Survey[12] synthesizing key themes. Within this landscape, a particularly active line of work centers on experience management and replay strategies, where ExGRPO[0] sits. This branch investigates how to make the most of collected trajectories during training, contrasting with approaches that emphasize success amplification such as GRPO Success Amplification[3] or those that focus on diversity and exploration like Diversity Exploration[10]. ExGRPO[0] emphasizes efficient reuse of experience through replay mechanisms, addressing the sample efficiency challenges that arise when verifiable rewards are expensive to obtain. Nearby works such as Rewarding Progress[6] and Long Chain Thought[5] explore complementary themes around credit assignment and extended reasoning chains, while RL Incentivize Reasoning[1] and Verifiable Rewards Reasoning[2] provide broader algorithmic perspectives on how reinforcement learning can be tailored to reasoning tasks. The positioning of ExGRPO[0] reflects an ongoing tension between maximizing data efficiency and maintaining exploration breadth, a trade-off that remains central to scaling RLVR methods.

Claimed Contributions

Identification of valuable reasoning experience characteristics

The authors systematically analyze reasoning experiences in RLVR and identify two key properties that determine experience value: rollout correctness for questions (with medium-difficulty questions being most valuable) and trajectory entropy (with lower entropy indicating better reasoning quality). This analysis provides empirical guidelines for experience selection in reinforcement learning for reasoning models.

10 retrieved papers
Can Refute
ExGRPO framework for experience management and replay

The authors introduce ExGRPO, a novel framework that maintains a replay buffer of reasoning trajectories, organizes them into buckets by correctness levels, and uses a sampling strategy that prioritizes beneficial experiences with lowest entropy trajectories. The framework combines on-policy exploration with strategic experience replay through a mixed-policy optimization objective.

10 retrieved papers
Can Refute
Consistent performance improvements and training stabilization

The authors demonstrate that ExGRPO achieves substantial performance gains across five backbone models (1.5B-8B parameters) on both in-distribution mathematical reasoning and out-of-distribution benchmarks. Notably, ExGRPO successfully stabilizes training on models where standard on-policy RLVR collapses, such as Llama-3.1 8B base model.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Identification of valuable reasoning experience characteristics

The authors systematically analyze reasoning experiences in RLVR and identify two key properties that determine experience value: rollout correctness for questions (with medium-difficulty questions being most valuable) and trajectory entropy (with lower entropy indicating better reasoning quality). This analysis provides empirical guidelines for experience selection in reinforcement learning for reasoning models.

Contribution

ExGRPO framework for experience management and replay

The authors introduce ExGRPO, a novel framework that maintains a replay buffer of reasoning trajectories, organizes them into buckets by correctness levels, and uses a sampling strategy that prioritizes beneficial experiences with lowest entropy trajectories. The framework combines on-policy exploration with strategic experience replay through a mixed-policy optimization objective.

Contribution

Consistent performance improvements and training stabilization

The authors demonstrate that ExGRPO achieves substantial performance gains across five backbone models (1.5B-8B parameters) on both in-distribution mathematical reasoning and out-of-distribution benchmarks. Notably, ExGRPO successfully stabilizes training on models where standard on-policy RLVR collapses, such as Llama-3.1 8B base model.