Hybrid Reinforcement: when reward is sparse, better to be dense

ICLR 2026 Conference SubmissionAnonymous Authors
Hybrid rewards for reinforcement learning
Abstract:

Post-training for reasoning in large language models has increasingly relied on verifiable rewards: deterministic checkers that provide 0011 correctness signals. While reliable, such binary feedback is brittle—many tasks admit partially correct or alternative answers that verifiers under-credit, and the resulting all-or-nothing supervision limits learning. Reward models offer richer, continuous feedback, which can serve as a complementary supervisory signal to verifiers. We introduce HERO (Hybrid Ensemble Reward Optimization), a reinforcement learning framework that integrates sparse verifier signals with dense reward model scores in a structured way. HERO employs stratified normalization to bound reward-model scores within verifier-defined groups, preserving correctness while refining quality distinctions, and variance-aware weighting to emphasize challenging prompts where dense signals matter most. Across diverse mathematical reasoning benchmarks, HERO consistently outperforms reward model-only and verifier-only baselines, with strong gains on both verifiable and hard-to-verify tasks. Our results show that hybrid reward design retains the stability of verifiers while leveraging the nuance of reward models to advance reasoning.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces HERO, a framework that combines sparse verifier signals with dense reward model scores through stratified normalization and variance-aware weighting. Within the taxonomy, it resides in the 'Stratified Normalization and Weighting Schemes' leaf under 'Hybrid Reward Integration Frameworks'. This leaf contains only two papers, indicating a relatively sparse research direction focused specifically on structured integration mechanisms that normalize rewards within verifier-defined groups. The positioning suggests the work addresses a targeted gap in hybrid reward design rather than entering a crowded subfield.

The taxonomy reveals that HERO's parent branch—Hybrid Reward Integration Frameworks—sits alongside Dense Reward Model Design (with four subtopics including process-level and generative approaches) and Sparse Verifiable Reward Optimization (covering outcome-based RL and exploration challenges). Neighboring leaves include 'Multi-Stage Dense-to-Sparse Reward Transitions', which explores temporal curriculum strategies rather than static integration. The taxonomy's scope notes clarify that HERO's structured normalization distinguishes it from general hybrid methods and from purely dense or sparse approaches, positioning it at the intersection of reliability-focused verification and richness-focused learned feedback.

Among the three contributions analyzed, the HERO framework and stratified normalization show no clear refutation across ten and two candidates examined respectively. However, the variance-aware weighting mechanism encountered four refutable candidates among ten examined, suggesting this component has more substantial prior exploration. The analysis examined twenty-two total candidates from top-K semantic search, a limited scope that captures nearby work but does not constitute exhaustive coverage. The statistics indicate that while the overall framework appears novel within this search scope, the weighting mechanism builds on more established techniques for emphasizing challenging examples in RL training.

Based on the limited search scope of twenty-two candidates, the work appears to occupy a relatively underexplored niche within hybrid reward integration. The stratified normalization approach shows stronger novelty signals than the weighting mechanism, which has more documented precedents. The taxonomy structure confirms that structured integration methods remain less densely populated than pure dense or sparse reward approaches, though the analysis cannot rule out relevant work outside the top-K semantic matches examined.

Taxonomy

Core-task Taxonomy Papers
42
3
Claimed Contributions
20
Contribution Candidate Papers Compared
4
Refutable Paper

Research Landscape Overview

Core task: Integrating sparse verifiable rewards with dense reward model signals for reasoning. This field addresses a fundamental challenge in training reasoning agents: how to combine infrequent but reliable outcome signals (such as final answer correctness) with continuous learned feedback that guides intermediate steps. The taxonomy reveals several complementary research directions. Hybrid Reward Integration Frameworks explore architectural strategies for blending sparse and dense signals, including normalization schemes and weighting methods exemplified by works like Hybrid Reinforcement[0] and Hybrid Reinforcement Dense[22]. Dense Reward Model Design focuses on training process reward models and step-level critics, as seen in Teaching LLMs Reason[2] and Dense Reward MCTS[4]. Sparse Verifiable Reward Optimization investigates outcome supervision and verification strategies, while Reward Model Evaluation examines design principles such as those discussed in Designing RL Reward[5]. Domain-Specific Applications demonstrate these techniques across reasoning tasks, dialogue systems, and embodied agents, with Advanced RL Optimization covering policy gradient methods and search-based approaches. A central tension emerges between relying on dense learned rewards—which provide rich training signals but may introduce bias or reward hacking—and sparse verifiable outcomes that are trustworthy but offer limited guidance during exploration. Recent work explores various middle grounds: some studies like RewardMap[3] investigate mapping strategies between reward types, while others such as Outcome Reward Limit[1] examine the boundaries of outcome-only supervision. Hybrid Reinforcement[0] sits within the stratified normalization branch, focusing on principled methods to balance and weight heterogeneous reward sources during training. This approach contrasts with purely dense methods like Dense Reward MCTS[4] that emphasize continuous guidance, and differs from outcome-focused strategies by explicitly addressing how to normalize and combine signals of different sparsity levels. The positioning reflects ongoing efforts to retain the reliability of verifiable rewards while leveraging dense models to accelerate learning and improve sample efficiency in complex reasoning domains.

Claimed Contributions

HERO framework for hybrid reward optimization

The authors propose a reinforcement learning framework that combines binary verifier signals with continuous reward model scores through stratified normalization and variance-aware weighting. This approach preserves correctness guarantees from verifiers while exploiting nuanced quality distinctions from reward models.

10 retrieved papers
Stratified normalization for reward integration

A technique that rescales continuous reward model scores within correctness groups defined by binary verifiers. This ensures dense feedback refines learning only within verified correct responses, maintaining correctness semantics while adding gradations.

1 retrieved paper
Variance-aware weighting mechanism

An adaptive reweighting scheme that adjusts the contribution of different prompts during training based on reward-model score variance. It emphasizes harder prompts with high variance while down-weighting easy prompts with uniform responses.

9 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

HERO framework for hybrid reward optimization

The authors propose a reinforcement learning framework that combines binary verifier signals with continuous reward model scores through stratified normalization and variance-aware weighting. This approach preserves correctness guarantees from verifiers while exploiting nuanced quality distinctions from reward models.

Contribution

Stratified normalization for reward integration

A technique that rescales continuous reward model scores within correctness groups defined by binary verifiers. This ensures dense feedback refines learning only within verified correct responses, maintaining correctness semantics while adding gradations.

Contribution

Variance-aware weighting mechanism

An adaptive reweighting scheme that adjusts the contribution of different prompts during training based on reward-model score variance. It emphasizes harder prompts with high variance while down-weighting easy prompts with uniform responses.