Rethinking Reward Models for Multi-Domain Test-Time Scaling

ICLR 2026 Conference SubmissionAnonymous Authors
reward modelmulti-domaintest-time scaling
Abstract:

The reliability of large language models (LLMs) during test-time scaling is often assessed with external verifiers or reward models that distinguish correct reasoning from flawed logic. Prior work generally assumes that process reward models (PRMs), which score every intermediate reasoning step, outperform outcome reward models (ORMs) that assess only the final answer. This view is based mainly on evidence from narrow, math-adjacent domains. We present the first unified evaluation of four reward model variants, discriminative ORM and PRM (dORM, dPRM) and generative ORM and PRM (gORM, gPRM), across 14 diverse domains. Contrary to conventional wisdom, we find that (i) dORM performs on par with dPRM, (ii) gPRM is not competitive, and (iii) overall, gORM is the most robust, yielding significant and consistent gains across every tested domain. We attribute this to PRM-style stepwise scoring, which inherits label noise from LLM auto-labeling and has difficulty evaluating long reasoning trajectories, including those involving self-correcting reasoning. Our theoretical analysis shows that step-wise aggregation compounds errors as reasoning length grows, and our empirical observations confirm this effect. These findings challenge the prevailing assumption that fine-grained supervision is always better and support generative outcome verification for multi-domain deployment. We publicly release our code, datasets, and checkpoints at this anonymous repository to facilitate future research in multi-domain settings.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper contributes a unified evaluation of four reward model variants (discriminative and generative, outcome and process) across 14 diverse domains, challenging the assumption that process reward models universally outperform outcome models. It resides in the Discriminative Outcome Reward Models leaf, which contains only two papers total, indicating a relatively sparse research direction within the broader taxonomy. This positioning suggests the work addresses an underexplored area, as most prior attention has focused on process-level supervision or domain-specific applications rather than systematic cross-domain comparisons of outcome-based approaches.

The taxonomy reveals neighboring research in Discriminative Process Reward Models (four papers) and Generative Process Reward Models (three papers), reflecting the field's historical emphasis on step-level supervision. The broader Reward Model Architectures branch encompasses training methodologies and multimodal extensions, while sibling branches cover test-time algorithms like tree search and sampling strategies. The paper's cross-domain scope bridges the architectural focus of its immediate leaf with the Domain-Specific Applications branch, which examines mathematical reasoning, code generation, and other specialized tasks. This positioning highlights the work's attempt to synthesize insights across traditionally siloed evaluation contexts.

Among 30 candidates examined, the unified evaluation contribution appears novel with zero refutable candidates found across 10 examined papers. However, the empirical findings challenging PRM superiority show one refutable candidate among 10, and the theoretical analysis of PRM error growth has two refutable candidates among 10 examined. These statistics suggest the cross-domain evaluation framework itself is relatively unexplored, while specific claims about PRM limitations and error propagation have some precedent in the limited literature examined. The search scope indicates these assessments reflect top-K semantic matches rather than exhaustive coverage.

Based on the limited search of 30 candidates, the work appears to occupy a sparse position within outcome reward model research, with its primary novelty lying in systematic cross-domain comparison rather than individual technical claims. The analysis does not cover the full breadth of reward model literature, particularly recent work in specialized domains or alternative training paradigms that may exist outside the semantic search radius.

Taxonomy

Core-task Taxonomy Papers
45
3
Claimed Contributions
30
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: evaluating reward models for test-time scaling across multiple domains. The field organizes around several complementary branches. Reward Model Architectures and Training Methodologies explores how to design and train discriminative or generative reward signals, including outcome-based models that score final solutions and process-based models that evaluate intermediate reasoning steps. Test-Time Scaling Algorithms and Search Strategies investigates inference-time techniques such as best-of-N sampling, tree search, and iterative refinement methods that leverage these reward models to improve generation quality. Domain-Specific Applications and Evaluations examines how these approaches perform in areas like mathematical reasoning, code generation, and vision-language tasks, while Generative Models Beyond Language extends reward-guided scaling to diffusion models and multimodal settings. Benchmarking and Evaluation Frameworks provides standardized testbeds for comparing reward model quality and scaling behavior, and Theoretical Foundations and Survey Studies synthesizes emerging principles and open questions across the landscape. Recent work highlights tensions between different reward paradigms and scaling strategies. Some studies focus on discriminative outcome reward models that classify final answers, as seen in Rethinking Reward Models[0] and Logical Outcome Rewards[37], which emphasize efficiency and interpretability for verifiable domains. Others explore process-level feedback or hybrid approaches that guide intermediate steps, trading off annotation cost against finer-grained supervision. Test-time methods range from simple reranking schemes like Best of N[24] to sophisticated search algorithms such as Hierarchical MCTS[21] and Dual Phase Search[39], each balancing computational budget with solution quality. Rethinking Reward Models[0] sits within the discriminative outcome branch, examining how outcome-based verifiers scale across domains and contrasting their behavior with process-oriented alternatives. This positioning reflects ongoing debates about whether coarse outcome signals suffice for effective test-time scaling or whether richer intermediate feedback becomes essential as tasks grow more complex.

Claimed Contributions

Unified evaluation of four reward model variants across 14 diverse domains

The authors conduct the first comprehensive and controlled comparison of four types of reward models (discriminative and generative, outcome-based and process-based) evaluated across 14 different domains, going beyond the narrow math-focused evaluations in prior work.

10 retrieved papers
Empirical findings challenging conventional wisdom about process reward models

The authors report empirical results showing that in multi-domain settings, outcome reward models perform comparably or better than process reward models, contradicting the prevailing assumption from math-domain studies that fine-grained process supervision is always superior.

10 retrieved papers
Can Refute
Theoretical and empirical analysis of PRM error growth with reasoning length

The authors provide theoretical bounds demonstrating that process reward model errors increase with chain-of-thought length, and support this with empirical evidence showing PRMs struggle with longer reasoning trajectories and self-correcting reasoning patterns.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Unified evaluation of four reward model variants across 14 diverse domains

The authors conduct the first comprehensive and controlled comparison of four types of reward models (discriminative and generative, outcome-based and process-based) evaluated across 14 different domains, going beyond the narrow math-focused evaluations in prior work.

Contribution

Empirical findings challenging conventional wisdom about process reward models

The authors report empirical results showing that in multi-domain settings, outcome reward models perform comparably or better than process reward models, contradicting the prevailing assumption from math-domain studies that fine-grained process supervision is always superior.

Contribution

Theoretical and empirical analysis of PRM error growth with reasoning length

The authors provide theoretical bounds demonstrating that process reward model errors increase with chain-of-thought length, and support this with empirical evidence showing PRMs struggle with longer reasoning trajectories and self-correcting reasoning patterns.

Rethinking Reward Models for Multi-Domain Test-Time Scaling | Novelty Validation