Rethinking Reward Models for Multi-Domain Test-Time Scaling
Overview
Overall Novelty Assessment
The paper contributes a unified evaluation of four reward model variants (discriminative and generative, outcome and process) across 14 diverse domains, challenging the assumption that process reward models universally outperform outcome models. It resides in the Discriminative Outcome Reward Models leaf, which contains only two papers total, indicating a relatively sparse research direction within the broader taxonomy. This positioning suggests the work addresses an underexplored area, as most prior attention has focused on process-level supervision or domain-specific applications rather than systematic cross-domain comparisons of outcome-based approaches.
The taxonomy reveals neighboring research in Discriminative Process Reward Models (four papers) and Generative Process Reward Models (three papers), reflecting the field's historical emphasis on step-level supervision. The broader Reward Model Architectures branch encompasses training methodologies and multimodal extensions, while sibling branches cover test-time algorithms like tree search and sampling strategies. The paper's cross-domain scope bridges the architectural focus of its immediate leaf with the Domain-Specific Applications branch, which examines mathematical reasoning, code generation, and other specialized tasks. This positioning highlights the work's attempt to synthesize insights across traditionally siloed evaluation contexts.
Among 30 candidates examined, the unified evaluation contribution appears novel with zero refutable candidates found across 10 examined papers. However, the empirical findings challenging PRM superiority show one refutable candidate among 10, and the theoretical analysis of PRM error growth has two refutable candidates among 10 examined. These statistics suggest the cross-domain evaluation framework itself is relatively unexplored, while specific claims about PRM limitations and error propagation have some precedent in the limited literature examined. The search scope indicates these assessments reflect top-K semantic matches rather than exhaustive coverage.
Based on the limited search of 30 candidates, the work appears to occupy a sparse position within outcome reward model research, with its primary novelty lying in systematic cross-domain comparison rather than individual technical claims. The analysis does not cover the full breadth of reward model literature, particularly recent work in specialized domains or alternative training paradigms that may exist outside the semantic search radius.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors conduct the first comprehensive and controlled comparison of four types of reward models (discriminative and generative, outcome-based and process-based) evaluated across 14 different domains, going beyond the narrow math-focused evaluations in prior work.
The authors report empirical results showing that in multi-domain settings, outcome reward models perform comparably or better than process reward models, contradicting the prevailing assumption from math-domain studies that fine-grained process supervision is always superior.
The authors provide theoretical bounds demonstrating that process reward model errors increase with chain-of-thought length, and support this with empirical evidence showing PRMs struggle with longer reasoning trajectories and self-correcting reasoning patterns.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[37] Logical Reasoning with Outcome Reward Models for Test-Time Scaling PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Unified evaluation of four reward model variants across 14 diverse domains
The authors conduct the first comprehensive and controlled comparison of four types of reward models (discriminative and generative, outcome-based and process-based) evaluated across 14 different domains, going beyond the narrow math-focused evaluations in prior work.
[6] Process Reward Models That Think PDF
[55] Generative Verifiers: Reward Modeling as Next-Token Prediction PDF
[56] SelfEval: Leveraging the discriminative nature of generative models for evaluation PDF
[57] InternVideo: General Video Foundation Models via Generative and Discriminative Learning PDF
[58] Basereward: A strong baseline for multimodal reward model PDF
[59] Discriminative Policy Optimization for Token-Level Reward Models PDF
[60] GRAM: A Generative Foundation Reward Model for Reward Generalization PDF
[61] A Unified Pairwise Framework for RLHF: Bridging Generative Reward Modeling and Policy Optimization PDF
[62] Seqgan: Sequence generative adversarial nets with policy gradient PDF
[63] Irgan: A minimax game for unifying generative and discriminative information retrieval models PDF
Empirical findings challenging conventional wisdom about process reward models
The authors report empirical results showing that in multi-domain settings, outcome reward models perform comparably or better than process reward models, contradicting the prevailing assumption from math-domain studies that fine-grained process supervision is always superior.
[48] Solving math word problems with process- and outcome-based feedback PDF
[11] Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators PDF
[46] Visualprm: An effective process reward model for multimodal reasoning PDF
[47] The lessons of developing process reward models in mathematical reasoning PDF
[49] Rest-mcts*: Llm self-training via process reward guided tree search PDF
[50] Free Process Rewards without Process Labels PDF
[51] Rewarding progress: Scaling automated process verifiers for llm reasoning PDF
[52] VLRMBench: A Comprehensive and Challenging Benchmark for Vision-Language Reward Models PDF
[53] Posterior-GRPO: Rewarding Reasoning Processes in Code Generation PDF
[54] Beyond correctness: Harmonizing process and outcome rewards through rl training PDF
Theoretical and empirical analysis of PRM error growth with reasoning length
The authors provide theoretical bounds demonstrating that process reward model errors increase with chain-of-thought length, and support this with empirical evidence showing PRMs struggle with longer reasoning trajectories and self-correcting reasoning patterns.