Linking Process to Outcome: Conditional Reward Modeling for LLM Reasoning

ICLR 2026 Conference SubmissionAnonymous Authors
LLMreasoningprocess reward modelreinforcement learning
Abstract:

Process Reward Models (PRMs) have emerged as a promising approach to enhance the reasoning capabilities of large language models (LLMs) by guiding their step-by-step reasoning toward a final answer. However, existing PRMs either treat each reasoning step in isolation, failing to capture inter-step dependencies, or struggle to align process rewards with the final outcome. Consequently, the reward signal fails to respect temporal causality in sequential reasoning and faces ambiguous credit assignment. These limitations make downstream models vulnerable to reward hacking and lead to suboptimal performance. In this work, we propose Conditional Reward Modeling (CRM) that frames LLM reasoning as a temporal process leading to a correct answer. The reward of each reasoning step is not only conditioned on the preceding steps but also explicitly linked to the final outcome of the reasoning trajectory. By enforcing conditional probability rules, our design captures the causal relationships among reasoning steps, with the link to the outcome allowing precise attribution of each intermediate step, thereby resolving credit assignment ambiguity. Further, through this consistent probabilistic modeling, the rewards produced by CRM enable more reliable cross-sample comparison. Experiments across Best-of-N sampling, beam search and reinforcement learning demonstrate that CRM consistently outperforms existing reward models, offering a principled framework for enhancing LLM reasoning. In particular, CRM is more robust to reward hacking and delivers stable downstream improvements without relying on verifiable rewards derived from ground truth.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Conditional Reward Modeling (CRM), a framework that conditions each reasoning step's reward on preceding steps and explicitly links it to the final outcome. It resides in the Hierarchical and Conditional Reward Modeling leaf, which contains only three papers total. This is a relatively sparse research direction within the broader taxonomy of fifty papers, suggesting that hierarchical and conditional approaches to process reward modeling remain an emerging area compared to more crowded topics like inference-time search or automated annotation.

The taxonomy reveals that CRM's immediate neighbors explore related but distinct mechanisms: one sibling investigates hierarchical multi-step rewards, while another examines hierarchical working memory. Nearby leaves include Generative and Reasoning-Based Reward Models, which generate chain-of-thought rationales before scoring, and Uncertainty and Reliability in Reward Models, which address robustness to reward hacking. The taxonomy's scope and exclude notes clarify that CRM's focus on inter-step dependencies distinguishes it from Independent Step Evaluation methods, while its process-level supervision separates it from Outcome-Based and Self-Rewarding Approaches that rely solely on final-answer signals.

Among thirty candidates examined, the precise credit assignment mechanism shows one refutable candidate out of ten examined, indicating some prior work addresses similar attribution challenges. The CRM framework itself and the probabilistically consistent cross-sample comparison each examined ten candidates with zero refutations, suggesting these contributions may be more distinctive within the limited search scope. The statistics reflect a targeted semantic search rather than an exhaustive survey, so the absence of refutations does not guarantee absolute novelty but does indicate that closely related work is not immediately apparent in the top-ranked candidates.

Given the sparse population of the Hierarchical and Conditional Reward Modeling leaf and the limited overlap found among thirty candidates, the work appears to occupy a relatively underexplored niche. However, the analysis is constrained by the search scope and does not cover the full breadth of reinforcement learning or causal inference literature that might address similar temporal credit assignment problems outside the LLM reasoning context.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: Reward modeling for multi-step reasoning in large language models. The field has organized itself around several complementary directions. Process Reward Model Design and Training focuses on architectures and learning strategies that assign credit at intermediate reasoning steps, including hierarchical and conditional variants. Process Reward Model Applications explores how these models guide search, verification, and inference-time computation. Outcome-Based and Self-Rewarding Approaches investigate end-to-end signals and models that generate their own supervision. Multi-Turn and Interactive Agent Reasoning examines credit assignment in conversational or agentic settings, while Domain-Specific and Multimodal Reasoning adapts reward modeling to specialized tasks such as mathematics, code, or vision-language problems. Evaluation, Analysis, and Benchmarking studies the reliability and failure modes of reward models, and Long-Horizon Skill Learning and Robotics extends these ideas to embodied agents. Surveys and Theoretical Foundations provide broader perspectives, and Fairness and Trustworthiness in Reasoning addresses ethical dimensions. Within Process Reward Model Design and Training, a particularly active line of work explores hierarchical and conditional reward structures that decompose complex reasoning into manageable subgoals. Conditional Reward Modeling[0] sits squarely in this cluster, emphasizing how reward signals can be conditioned on problem structure or intermediate states. Nearby efforts such as Hierarchical Multi-Step Rewards[4] and Hierarchical Working Memory[3] similarly investigate multi-level credit assignment, though they differ in whether they prioritize explicit memory mechanisms or purely reward-based decomposition. Another contrast emerges between works that automate process supervision—such as Automated Process Supervision[9]—and those that rely on human annotations or model-generated labels. Open questions include how to balance granularity and scalability in hierarchical designs, and whether conditional rewards generalize robustly across diverse reasoning domains. Conditional Reward Modeling[0] contributes to this landscape by proposing a flexible framework for context-dependent credit assignment, complementing the structural hierarchies explored by its neighbors.

Claimed Contributions

Conditional Reward Modeling (CRM) framework

The authors introduce CRM, a framework that models each reasoning step's reward as a conditional probability dependent on all preceding steps, thereby capturing inter-step dependencies in sequential reasoning. This addresses the limitation of prior PRMs that treat steps in isolation.

10 retrieved papers
Precise credit assignment mechanism

CRM explicitly links process rewards to the final outcome through the conditional probability chain rule, enabling precise attribution of the final result to individual reasoning steps. This resolves the credit assignment ambiguity prevalent in existing PRMs.

10 retrieved papers
Can Refute
Probabilistically consistent cross-sample comparison

The consistent probabilistic formulation ensures that reward signals carry the same semantic meaning across different reasoning trajectories, making them directly comparable across samples. This facilitates downstream tasks such as Best-of-N sampling, beam search, and RL optimization.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Conditional Reward Modeling (CRM) framework

The authors introduce CRM, a framework that models each reasoning step's reward as a conditional probability dependent on all preceding steps, thereby capturing inter-step dependencies in sequential reasoning. This addresses the limitation of prior PRMs that treat steps in isolation.

Contribution

Precise credit assignment mechanism

CRM explicitly links process rewards to the final outcome through the conditional probability chain rule, enabling precise attribution of the final result to individual reasoning steps. This resolves the credit assignment ambiguity prevalent in existing PRMs.

Contribution

Probabilistically consistent cross-sample comparison

The consistent probabilistic formulation ensures that reward signals carry the same semantic meaning across different reasoning trajectories, making them directly comparable across samples. This facilitates downstream tasks such as Best-of-N sampling, beam search, and RL optimization.