ExGRPO: Learning to Reason from Prior Successes

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Reinforcement LearningLarge Reasoning ModelReinforcement Learning with Verifiable Rewards

Reinforcement learning from verifiable rewards (RLVR) is an emerging paradigm for improving the reasoning ability of large language models. However, standard on-policy training discards rollout experiences after a single update, leading to computational inefficiency and instability. While prior work on RL has highlighted the benefits of reusing past experience, the role of experience characteristics in shaping learning dynamics of large reasoning models remains underexplored. In this paper, we are the first to investigate what makes a reasoning experience valuable and identify rollout correctness and entropy as effective indicators of experience value. Based on these insights, we propose ExGRPO (Experiential Group Relative Policy Optimization), a framework that organizes and prioritizes valuable experiences, and employs a mixed-policy objective to balance exploration with experience exploitation. Experiments on five backbone models (1.5B-8B parameters) show that ExGRPO consistently improves reasoning performance on mathematical/general benchmarks, with an average gain of +3.5/7.6 points over on-policy RLVR. Moreover, ExGRPO stabilizes training on both stronger and weaker models where on-policy methods fail. These results highlight principled experience management as a key ingredient for efficient and scalable RLVR.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes ExGRPO, a framework for managing and replaying reasoning experiences in RLVR training, identifying rollout correctness and entropy as indicators of experience value. According to the taxonomy, this work sits in the 'Experience Management and Replay Strategies' leaf under 'Core RLVR Algorithms and Training Dynamics'. Notably, this leaf contains only one paper—the original submission itself—indicating a relatively sparse research direction within the broader RLVR landscape. The taxonomy shows 41 total papers across the field, with most concentrated in verification design, domain applications, and policy optimization methods.

The taxonomy reveals that neighboring leaves focus on policy optimization theory (3 papers) and exploration strategies (2 papers), suggesting that experience management has received less direct attention than algorithmic foundations or diversity mechanisms. The scope note for this leaf explicitly excludes diversity-focused exploration, positioning ExGRPO as complementary to works like Diversity Exploration that incentivize broad sampling. The broader 'Core RLVR Algorithms' branch contains multiple active directions, but experience replay specifically appears underexplored compared to verification design (5 sub-categories) and domain applications (4 sub-categories).

Among 30 candidates examined, the contribution-level analysis shows mixed novelty signals. The identification of valuable experience characteristics (Contribution 1) examined 10 candidates with 3 appearing to provide overlapping prior work. The ExGRPO framework itself (Contribution 2) and performance improvements (Contribution 3) each examined 10 candidates with 1 refutable match apiece. These statistics suggest that while some aspects of experience characterization have precedent in the limited search scope, the integrated framework and empirical validation may offer incremental advances. The relatively low refutation counts should be interpreted cautiously given the 30-candidate search scale.

Based on the limited literature search, the work appears to address a gap in experience management for RLVR, though the search scope (30 candidates from semantic matching) cannot confirm exhaustive novelty. The taxonomy structure suggests this direction is less crowded than verification design or domain applications, but the contribution-level statistics indicate that key ideas around experience value and replay have some precedent among examined candidates. A more comprehensive search would be needed to definitively assess originality.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: reinforcement learning from verifiable rewards for language model reasoning. The field has organized itself around several major branches that reflect different facets of this challenge. Core RLVR Algorithms and Training Dynamics focuses on the mechanics of policy optimization and experience management, exploring how to efficiently leverage verifiable feedback signals. Reasoning Capability Analysis and Evaluation examines what models learn and how well they generalize, while Verification and Reward Design addresses the construction of reliable reward signals across domains. Domain-Specific Applications and Extensions targets particular problem settings such as mathematical theorem proving and code generation, whereas Beyond Verifiable Domains and Multi-Objective and Multi-Domain Training consider settings where clean verification is unavailable or multiple objectives must be balanced. Training Challenges and Mitigation Strategies tackles practical issues like reward hacking and distribution shift, and Safety and Alignment ensures that capability gains do not compromise model safety. Surveys and Methodological Reviews provide broader perspectives, with works like RL LLM Survey[8] and Verification Design Survey[12] synthesizing key themes. Within this landscape, a particularly active line of work centers on experience management and replay strategies, where ExGRPO[0] sits. This branch investigates how to make the most of collected trajectories during training, contrasting with approaches that emphasize success amplification such as GRPO Success Amplification[3] or those that focus on diversity and exploration like Diversity Exploration[10]. ExGRPO[0] emphasizes efficient reuse of experience through replay mechanisms, addressing the sample efficiency challenges that arise when verifiable rewards are expensive to obtain. Nearby works such as Rewarding Progress[6] and Long Chain Thought[5] explore complementary themes around credit assignment and extended reasoning chains, while RL Incentivize Reasoning[1] and Verifiable Rewards Reasoning[2] provide broader algorithmic perspectives on how reinforcement learning can be tailored to reasoning tasks. The positioning of ExGRPO[0] reflects an ongoing tension between maximizing data efficiency and maintaining exploration breadth, a trade-off that remains central to scaling RLVR methods.

Claimed Contributions

Identification of valuable reasoning experience characteristics

Can Refute

10 retrieved papers

The authors systematically analyze reasoning experiences in RLVR and identify two key properties that determine experience value: rollout correctness for questions (with medium-difficulty questions being most valuable) and trajectory entropy (with lower entropy indicating better reasoning quality). This analysis provides empirical guidelines for experience selection in reinforcement learning for reasoning models.

10 retrieved papers

Can Refute

ExGRPO framework for experience management and replay

Can Refute

10 retrieved papers

The authors introduce ExGRPO, a novel framework that maintains a replay buffer of reasoning trajectories, organizes them into buckets by correctness levels, and uses a sampling strategy that prioritizes beneficial experiences with lowest entropy trajectories. The framework combines on-policy exploration with strategic experience replay through a mixed-policy optimization objective.

10 retrieved papers

Can Refute

Consistent performance improvements and training stabilization

Can Refute

10 retrieved papers

The authors demonstrate that ExGRPO achieves substantial performance gains across five backbone models (1.5B-8B parameters) on both in-distribution mathematical reasoning and out-of-distribution benchmarks. Notably, ExGRPO successfully stabilizes training on models where standard on-policy RLVR collapses, such as Llama-3.1 8B base model.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Identification of valuable reasoning experience characteristics

[53] ExGRPO: Learning to reason from experience PDF

Can Refute

[55] First return, entropy-eliciting explore PDF

Can Refute

[60] Decomposing the entropy-performance exchange: The missing keys to unlocking effective reinforcement learning PDF

Can Refute

[52] Ares: Multimodal adaptive reasoning via difficulty-aware token-level entropy shaping PDF

Cannot Refute

[54] Staying in the Sweet Spot: Responsive Reasoning Evolution via Capability-Adaptive Hint Scaffolding PDF

Cannot Refute

[56] EFRame: Deeper Reasoning via Exploration-Filtering-Replay Reinforcement Learning Framework PDF

Cannot Refute

[57] Exploring Multi-Temperature Strategies for Token-and Rollout-Level Control in RLVR PDF

Cannot Refute

[58] PEAR: Phase Entropy Aware Reward for Efficient Reasoning PDF

Cannot Refute

[59] Unlocking exploration in rlvr: Uncertainty-aware advantage shaping for deeper reasoning PDF

Cannot Refute

[61] Evolving language models without labels: Majority drives selection, novelty promotes variation PDF

Cannot Refute

Contribution

ExGRPO framework for experience management and replay

[68] Sample-efficient LLM Optimization with Reset Replay PDF

Can Refute

[62] Experience Replay for Continual Learning PDF

Cannot Refute

[63] Query-Policy Misalignment in Preference-Based Reinforcement Learning PDF

Cannot Refute

[64] Cooperative Traffic Scheduling in Transportation Network: A Knowledge Transfer Method PDF

Cannot Refute

[65] A prioritized objective actor-critic method for deep reinforcement learning PDF

Cannot Refute

[66] Hybrid attention-oriented experience replay for deep reinforcement learning and its application to a multi-robot cooperative hunting problem PDF

Cannot Refute

[67] Experience Replay-based Deep Reinforcement Learning for Dialogue Management Optimisation PDF

Cannot Refute

[69] Experience Consistency Distillation Continual Reinforcement Learning for Robotic Manipulation Tasks PDF

Cannot Refute

[70] Anti-jamming routing for internet of satellites: a reinforcement learning approach PDF

Cannot Refute

[71] Relay Hindsight Experience Replay: Continual Reinforcement Learning for Robot Manipulation Tasks with Sparse Rewards PDF

Cannot Refute

Contribution

Consistent performance improvements and training stabilization

[51] Stabilizing knowledge, promoting reasoning: Dual-token constraints for rlvr PDF

Can Refute

[42] Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models PDF

Cannot Refute

[43] Reasoning-table: Exploring reinforcement learning for table reasoning PDF

Cannot Refute

[44] Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model PDF

Cannot Refute

[45] L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning PDF

Cannot Refute

[46] Logic-rl: Unleashing llm reasoning with rule-based reinforcement learning PDF

Cannot Refute

[47] The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models PDF

Cannot Refute

[48] Reason-rft: Reinforcement fine-tuning for visual reasoning PDF

Cannot Refute

[49] Training Language Models to Reason Efficiently PDF

Cannot Refute

[50] Efficient reasoning models: A survey PDF

Cannot Refute

ExGRPO: Learning to Reason from Prior Successes

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Identification of valuable reasoning experience characteristics

[53] ExGRPO: Learning to reason from experience PDF

[55] First return, entropy-eliciting explore PDF

[60] Decomposing the entropy-performance exchange: The missing keys to unlocking effective reinforcement learning PDF

[52] Ares: Multimodal adaptive reasoning via difficulty-aware token-level entropy shaping PDF

[54] Staying in the Sweet Spot: Responsive Reasoning Evolution via Capability-Adaptive Hint Scaffolding PDF

[56] EFRame: Deeper Reasoning via Exploration-Filtering-Replay Reinforcement Learning Framework PDF

[57] Exploring Multi-Temperature Strategies for Token-and Rollout-Level Control in RLVR PDF

[58] PEAR: Phase Entropy Aware Reward for Efficient Reasoning PDF

[59] Unlocking exploration in rlvr: Uncertainty-aware advantage shaping for deeper reasoning PDF

[61] Evolving language models without labels: Majority drives selection, novelty promotes variation PDF

ExGRPO framework for experience management and replay

[68] Sample-efficient LLM Optimization with Reset Replay PDF

[62] Experience Replay for Continual Learning PDF

[63] Query-Policy Misalignment in Preference-Based Reinforcement Learning PDF

[64] Cooperative Traffic Scheduling in Transportation Network: A Knowledge Transfer Method PDF

[65] A prioritized objective actor-critic method for deep reinforcement learning PDF

[66] Hybrid attention-oriented experience replay for deep reinforcement learning and its application to a multi-robot cooperative hunting problem PDF

[67] Experience Replay-based Deep Reinforcement Learning for Dialogue Management Optimisation PDF

[69] Experience Consistency Distillation Continual Reinforcement Learning for Robotic Manipulation Tasks PDF

[70] Anti-jamming routing for internet of satellites: a reinforcement learning approach PDF

[71] Relay Hindsight Experience Replay: Continual Reinforcement Learning for Robot Manipulation Tasks with Sparse Rewards PDF

Consistent performance improvements and training stabilization

[51] Stabilizing knowledge, promoting reasoning: Dual-token constraints for rlvr PDF

[42] Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models PDF

[43] Reasoning-table: Exploring reinforcement learning for table reasoning PDF

[44] Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model PDF

[45] L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning PDF

[46] Logic-rl: Unleashing llm reasoning with rule-based reinforcement learning PDF

[47] The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models PDF

[48] Reason-rft: Reinforcement fine-tuning for visual reasoning PDF

[49] Training Language Models to Reason Efficiently PDF

[50] Efficient reasoning models: A survey PDF

Table of Contents