Rethinking Reward Models for Multi-Domain Test-Time Scaling

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

reward modelmulti-domaintest-time scaling

The reliability of large language models (LLMs) during test-time scaling is often assessed with external verifiers or reward models that distinguish correct reasoning from flawed logic. Prior work generally assumes that process reward models (PRMs), which score every intermediate reasoning step, outperform outcome reward models (ORMs) that assess only the final answer. This view is based mainly on evidence from narrow, math-adjacent domains. We present the first unified evaluation of four reward model variants, discriminative ORM and PRM (dORM, dPRM) and generative ORM and PRM (gORM, gPRM), across 14 diverse domains. Contrary to conventional wisdom, we find that (i) dORM performs on par with dPRM, (ii) gPRM is not competitive, and (iii) overall, gORM is the most robust, yielding significant and consistent gains across every tested domain. We attribute this to PRM-style stepwise scoring, which inherits label noise from LLM auto-labeling and has difficulty evaluating long reasoning trajectories, including those involving self-correcting reasoning. Our theoretical analysis shows that step-wise aggregation compounds errors as reasoning length grows, and our empirical observations confirm this effect. These findings challenge the prevailing assumption that fine-grained supervision is always better and support generative outcome verification for multi-domain deployment. We publicly release our code, datasets, and checkpoints at this anonymous repository to facilitate future research in multi-domain settings.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper contributes a unified evaluation of four reward model variants (discriminative and generative, outcome and process) across 14 diverse domains, challenging the assumption that process reward models universally outperform outcome models. It resides in the Discriminative Outcome Reward Models leaf, which contains only two papers total, indicating a relatively sparse research direction within the broader taxonomy. This positioning suggests the work addresses an underexplored area, as most prior attention has focused on process-level supervision or domain-specific applications rather than systematic cross-domain comparisons of outcome-based approaches.

The taxonomy reveals neighboring research in Discriminative Process Reward Models (four papers) and Generative Process Reward Models (three papers), reflecting the field's historical emphasis on step-level supervision. The broader Reward Model Architectures branch encompasses training methodologies and multimodal extensions, while sibling branches cover test-time algorithms like tree search and sampling strategies. The paper's cross-domain scope bridges the architectural focus of its immediate leaf with the Domain-Specific Applications branch, which examines mathematical reasoning, code generation, and other specialized tasks. This positioning highlights the work's attempt to synthesize insights across traditionally siloed evaluation contexts.

Among 30 candidates examined, the unified evaluation contribution appears novel with zero refutable candidates found across 10 examined papers. However, the empirical findings challenging PRM superiority show one refutable candidate among 10, and the theoretical analysis of PRM error growth has two refutable candidates among 10 examined. These statistics suggest the cross-domain evaluation framework itself is relatively unexplored, while specific claims about PRM limitations and error propagation have some precedent in the limited literature examined. The search scope indicates these assessments reflect top-K semantic matches rather than exhaustive coverage.

Based on the limited search of 30 candidates, the work appears to occupy a sparse position within outcome reward model research, with its primary novelty lying in systematic cross-domain comparison rather than individual technical claims. The analysis does not cover the full breadth of reward model literature, particularly recent work in specialized domains or alternative training paradigms that may exist outside the semantic search radius.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: evaluating reward models for test-time scaling across multiple domains. The field organizes around several complementary branches. Reward Model Architectures and Training Methodologies explores how to design and train discriminative or generative reward signals, including outcome-based models that score final solutions and process-based models that evaluate intermediate reasoning steps. Test-Time Scaling Algorithms and Search Strategies investigates inference-time techniques such as best-of-N sampling, tree search, and iterative refinement methods that leverage these reward models to improve generation quality. Domain-Specific Applications and Evaluations examines how these approaches perform in areas like mathematical reasoning, code generation, and vision-language tasks, while Generative Models Beyond Language extends reward-guided scaling to diffusion models and multimodal settings. Benchmarking and Evaluation Frameworks provides standardized testbeds for comparing reward model quality and scaling behavior, and Theoretical Foundations and Survey Studies synthesizes emerging principles and open questions across the landscape. Recent work highlights tensions between different reward paradigms and scaling strategies. Some studies focus on discriminative outcome reward models that classify final answers, as seen in Rethinking Reward Models[0] and Logical Outcome Rewards[37], which emphasize efficiency and interpretability for verifiable domains. Others explore process-level feedback or hybrid approaches that guide intermediate steps, trading off annotation cost against finer-grained supervision. Test-time methods range from simple reranking schemes like Best of N[24] to sophisticated search algorithms such as Hierarchical MCTS[21] and Dual Phase Search[39], each balancing computational budget with solution quality. Rethinking Reward Models[0] sits within the discriminative outcome branch, examining how outcome-based verifiers scale across domains and contrasting their behavior with process-oriented alternatives. This positioning reflects ongoing debates about whether coarse outcome signals suffice for effective test-time scaling or whether richer intermediate feedback becomes essential as tasks grow more complex.

Claimed Contributions

Unified evaluation of four reward model variants across 14 diverse domains

10 retrieved papers

The authors conduct the first comprehensive and controlled comparison of four types of reward models (discriminative and generative, outcome-based and process-based) evaluated across 14 different domains, going beyond the narrow math-focused evaluations in prior work.

10 retrieved papers

Empirical findings challenging conventional wisdom about process reward models

Can Refute

10 retrieved papers

The authors report empirical results showing that in multi-domain settings, outcome reward models perform comparably or better than process reward models, contradicting the prevailing assumption from math-domain studies that fine-grained process supervision is always superior.

10 retrieved papers

Can Refute

Theoretical and empirical analysis of PRM error growth with reasoning length

Can Refute

10 retrieved papers

The authors provide theoretical bounds demonstrating that process reward model errors increase with chain-of-thought length, and support this with empirical evidence showing PRMs struggle with longer reasoning trajectories and self-correcting reasoning patterns.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[37] Logical Reasoning with Outcome Reward Models for Test-Time Scaling PDF

Buntine, Wray, Shareghi, Ehsan (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Unified evaluation of four reward model variants across 14 diverse domains

[6] Process Reward Models That Think PDF

Cannot Refute

[55] Generative Verifiers: Reward Modeling as Next-Token Prediction PDF

Cannot Refute

[56] SelfEval: Leveraging the discriminative nature of generative models for evaluation PDF

Cannot Refute

[57] InternVideo: General Video Foundation Models via Generative and Discriminative Learning PDF

Cannot Refute

[58] Basereward: A strong baseline for multimodal reward model PDF

Cannot Refute

[59] Discriminative Policy Optimization for Token-Level Reward Models PDF

Cannot Refute

[60] GRAM: A Generative Foundation Reward Model for Reward Generalization PDF

Cannot Refute

[61] A Unified Pairwise Framework for RLHF: Bridging Generative Reward Modeling and Policy Optimization PDF

Cannot Refute

[62] Seqgan: Sequence generative adversarial nets with policy gradient PDF

Cannot Refute

[63] Irgan: A minimax game for unifying generative and discriminative information retrieval models PDF

Cannot Refute

Contribution

Empirical findings challenging conventional wisdom about process reward models

[48] Solving math word problems with process- and outcome-based feedback PDF

Can Refute

[11] Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators PDF

Cannot Refute

[46] Visualprm: An effective process reward model for multimodal reasoning PDF

Cannot Refute

[47] The lessons of developing process reward models in mathematical reasoning PDF

Cannot Refute

[49] Rest-mcts*: Llm self-training via process reward guided tree search PDF

Cannot Refute

[50] Free Process Rewards without Process Labels PDF

Cannot Refute

[51] Rewarding progress: Scaling automated process verifiers for llm reasoning PDF

Cannot Refute

[52] VLRMBench: A Comprehensive and Challenging Benchmark for Vision-Language Reward Models PDF

Cannot Refute

[53] Posterior-GRPO: Rewarding Reasoning Processes in Code Generation PDF

Cannot Refute

[54] Beyond correctness: Harmonizing process and outcome rewards through rl training PDF

Cannot Refute

Contribution

Theoretical and empirical analysis of PRM error growth with reasoning length

[65] Beyond the First Error: Process Reward Models for Reflective Mathematical Reasoning PDF

Can Refute

[71] Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models PDF

Can Refute

[64] Improve Mathematical Reasoning in Language Models by Automated Process Supervision PDF

Cannot Refute

[66] Error Typing for Smarter Rewards: Improving Process Reward Models with Error-Aware Hierarchical Supervision PDF

Cannot Refute

[67] Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning? PDF

Cannot Refute

[68] Lost at the Beginning of Reasoning PDF

Cannot Refute

[69] Incentivizing Consistent, Effective and Scalable Reasoning Capability in Audio LLMs via Reasoning Process Rewards PDF

Cannot Refute

[70] Learning planning-based reasoning by trajectories collection and process reward synthesizing PDF

Cannot Refute

[72] GroundedPRM: Tree-Guided and Fidelity-Aware Process Reward Modeling for Step-Level Reasoning PDF

Cannot Refute

[73] GM-PRM: A Generative Multimodal Process Reward Model for Multimodal Mathematical Reasoning PDF

Cannot Refute

Rethinking Reward Models for Multi-Domain Test-Time Scaling

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[37] Logical Reasoning with Outcome Reward Models for Test-Time Scaling PDF

Contribution Analysis

Unified evaluation of four reward model variants across 14 diverse domains

[6] Process Reward Models That Think PDF

[55] Generative Verifiers: Reward Modeling as Next-Token Prediction PDF

[56] SelfEval: Leveraging the discriminative nature of generative models for evaluation PDF

[57] InternVideo: General Video Foundation Models via Generative and Discriminative Learning PDF

[58] Basereward: A strong baseline for multimodal reward model PDF

[59] Discriminative Policy Optimization for Token-Level Reward Models PDF

[60] GRAM: A Generative Foundation Reward Model for Reward Generalization PDF

[61] A Unified Pairwise Framework for RLHF: Bridging Generative Reward Modeling and Policy Optimization PDF

[62] Seqgan: Sequence generative adversarial nets with policy gradient PDF

[63] Irgan: A minimax game for unifying generative and discriminative information retrieval models PDF

Empirical findings challenging conventional wisdom about process reward models

[48] Solving math word problems with process- and outcome-based feedback PDF

[11] Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators PDF

[46] Visualprm: An effective process reward model for multimodal reasoning PDF

[47] The lessons of developing process reward models in mathematical reasoning PDF

[49] Rest-mcts*: Llm self-training via process reward guided tree search PDF

[50] Free Process Rewards without Process Labels PDF

[51] Rewarding progress: Scaling automated process verifiers for llm reasoning PDF

[52] VLRMBench: A Comprehensive and Challenging Benchmark for Vision-Language Reward Models PDF

[53] Posterior-GRPO: Rewarding Reasoning Processes in Code Generation PDF

[54] Beyond correctness: Harmonizing process and outcome rewards through rl training PDF

Theoretical and empirical analysis of PRM error growth with reasoning length

[65] Beyond the First Error: Process Reward Models for Reflective Mathematical Reasoning PDF

[71] Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models PDF

[64] Improve Mathematical Reasoning in Language Models by Automated Process Supervision PDF

[66] Error Typing for Smarter Rewards: Improving Process Reward Models with Error-Aware Hierarchical Supervision PDF

[67] Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning? PDF

[68] Lost at the Beginning of Reasoning PDF

[69] Incentivizing Consistent, Effective and Scalable Reasoning Capability in Audio LLMs via Reasoning Process Rewards PDF

[70] Learning planning-based reasoning by trajectories collection and process reward synthesizing PDF

[72] GroundedPRM: Tree-Guided and Fidelity-Aware Process Reward Modeling for Step-Level Reasoning PDF

[73] GM-PRM: A Generative Multimodal Process Reward Model for Multimodal Mathematical Reasoning PDF

Table of Contents