ExGRPO: Learning to Reason from Prior Successes
Overview
Overall Novelty Assessment
The paper proposes ExGRPO, a framework for managing and replaying reasoning experiences in RLVR training, identifying rollout correctness and entropy as indicators of experience value. According to the taxonomy, this work sits in the 'Experience Management and Replay Strategies' leaf under 'Core RLVR Algorithms and Training Dynamics'. Notably, this leaf contains only one paper—the original submission itself—indicating a relatively sparse research direction within the broader RLVR landscape. The taxonomy shows 41 total papers across the field, with most concentrated in verification design, domain applications, and policy optimization methods.
The taxonomy reveals that neighboring leaves focus on policy optimization theory (3 papers) and exploration strategies (2 papers), suggesting that experience management has received less direct attention than algorithmic foundations or diversity mechanisms. The scope note for this leaf explicitly excludes diversity-focused exploration, positioning ExGRPO as complementary to works like Diversity Exploration that incentivize broad sampling. The broader 'Core RLVR Algorithms' branch contains multiple active directions, but experience replay specifically appears underexplored compared to verification design (5 sub-categories) and domain applications (4 sub-categories).
Among 30 candidates examined, the contribution-level analysis shows mixed novelty signals. The identification of valuable experience characteristics (Contribution 1) examined 10 candidates with 3 appearing to provide overlapping prior work. The ExGRPO framework itself (Contribution 2) and performance improvements (Contribution 3) each examined 10 candidates with 1 refutable match apiece. These statistics suggest that while some aspects of experience characterization have precedent in the limited search scope, the integrated framework and empirical validation may offer incremental advances. The relatively low refutation counts should be interpreted cautiously given the 30-candidate search scale.
Based on the limited literature search, the work appears to address a gap in experience management for RLVR, though the search scope (30 candidates from semantic matching) cannot confirm exhaustive novelty. The taxonomy structure suggests this direction is less crowded than verification design or domain applications, but the contribution-level statistics indicate that key ideas around experience value and replay have some precedent among examined candidates. A more comprehensive search would be needed to definitively assess originality.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors systematically analyze reasoning experiences in RLVR and identify two key properties that determine experience value: rollout correctness for questions (with medium-difficulty questions being most valuable) and trajectory entropy (with lower entropy indicating better reasoning quality). This analysis provides empirical guidelines for experience selection in reinforcement learning for reasoning models.
The authors introduce ExGRPO, a novel framework that maintains a replay buffer of reasoning trajectories, organizes them into buckets by correctness levels, and uses a sampling strategy that prioritizes beneficial experiences with lowest entropy trajectories. The framework combines on-policy exploration with strategic experience replay through a mixed-policy optimization objective.
The authors demonstrate that ExGRPO achieves substantial performance gains across five backbone models (1.5B-8B parameters) on both in-distribution mathematical reasoning and out-of-distribution benchmarks. Notably, ExGRPO successfully stabilizes training on models where standard on-policy RLVR collapses, such as Llama-3.1 8B base model.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
Contribution Analysis
Detailed comparisons for each claimed contribution
Identification of valuable reasoning experience characteristics
The authors systematically analyze reasoning experiences in RLVR and identify two key properties that determine experience value: rollout correctness for questions (with medium-difficulty questions being most valuable) and trajectory entropy (with lower entropy indicating better reasoning quality). This analysis provides empirical guidelines for experience selection in reinforcement learning for reasoning models.
[53] ExGRPO: Learning to reason from experience PDF
[55] First return, entropy-eliciting explore PDF
[60] Decomposing the entropy-performance exchange: The missing keys to unlocking effective reinforcement learning PDF
[52] Ares: Multimodal adaptive reasoning via difficulty-aware token-level entropy shaping PDF
[54] Staying in the Sweet Spot: Responsive Reasoning Evolution via Capability-Adaptive Hint Scaffolding PDF
[56] EFRame: Deeper Reasoning via Exploration-Filtering-Replay Reinforcement Learning Framework PDF
[57] Exploring Multi-Temperature Strategies for Token-and Rollout-Level Control in RLVR PDF
[58] PEAR: Phase Entropy Aware Reward for Efficient Reasoning PDF
[59] Unlocking exploration in rlvr: Uncertainty-aware advantage shaping for deeper reasoning PDF
[61] Evolving language models without labels: Majority drives selection, novelty promotes variation PDF
ExGRPO framework for experience management and replay
The authors introduce ExGRPO, a novel framework that maintains a replay buffer of reasoning trajectories, organizes them into buckets by correctness levels, and uses a sampling strategy that prioritizes beneficial experiences with lowest entropy trajectories. The framework combines on-policy exploration with strategic experience replay through a mixed-policy optimization objective.
[68] Sample-efficient LLM Optimization with Reset Replay PDF
[62] Experience Replay for Continual Learning PDF
[63] Query-Policy Misalignment in Preference-Based Reinforcement Learning PDF
[64] Cooperative Traffic Scheduling in Transportation Network: A Knowledge Transfer Method PDF
[65] A prioritized objective actor-critic method for deep reinforcement learning PDF
[66] Hybrid attention-oriented experience replay for deep reinforcement learning and its application to a multi-robot cooperative hunting problem PDF
[67] Experience Replay-based Deep Reinforcement Learning for Dialogue Management Optimisation PDF
[69] Experience Consistency Distillation Continual Reinforcement Learning for Robotic Manipulation Tasks PDF
[70] Anti-jamming routing for internet of satellites: a reinforcement learning approach PDF
[71] Relay Hindsight Experience Replay: Continual Reinforcement Learning for Robot Manipulation Tasks with Sparse Rewards PDF
Consistent performance improvements and training stabilization
The authors demonstrate that ExGRPO achieves substantial performance gains across five backbone models (1.5B-8B parameters) on both in-distribution mathematical reasoning and out-of-distribution benchmarks. Notably, ExGRPO successfully stabilizes training on models where standard on-policy RLVR collapses, such as Llama-3.1 8B base model.