Linking Process to Outcome: Conditional Reward Modeling for LLM Reasoning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.5 Download Report PDF

LLMreasoningprocess reward modelreinforcement learning

Process Reward Models (PRMs) have emerged as a promising approach to enhance the reasoning capabilities of large language models (LLMs) by guiding their step-by-step reasoning toward a final answer. However, existing PRMs either treat each reasoning step in isolation, failing to capture inter-step dependencies, or struggle to align process rewards with the final outcome. Consequently, the reward signal fails to respect temporal causality in sequential reasoning and faces ambiguous credit assignment. These limitations make downstream models vulnerable to reward hacking and lead to suboptimal performance. In this work, we propose Conditional Reward Modeling (CRM) that frames LLM reasoning as a temporal process leading to a correct answer. The reward of each reasoning step is not only conditioned on the preceding steps but also explicitly linked to the final outcome of the reasoning trajectory. By enforcing conditional probability rules, our design captures the causal relationships among reasoning steps, with the link to the outcome allowing precise attribution of each intermediate step, thereby resolving credit assignment ambiguity. Further, through this consistent probabilistic modeling, the rewards produced by CRM enable more reliable cross-sample comparison. Experiments across Best-of-N sampling, beam search and reinforcement learning demonstrate that CRM consistently outperforms existing reward models, offering a principled framework for enhancing LLM reasoning. In particular, CRM is more robust to reward hacking and delivers stable downstream improvements without relying on verifiable rewards derived from ground truth.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Conditional Reward Modeling (CRM), a framework that conditions each reasoning step's reward on preceding steps and explicitly links it to the final outcome. It resides in the Hierarchical and Conditional Reward Modeling leaf, which contains only three papers total. This is a relatively sparse research direction within the broader taxonomy of fifty papers, suggesting that hierarchical and conditional approaches to process reward modeling remain an emerging area compared to more crowded topics like inference-time search or automated annotation.

The taxonomy reveals that CRM's immediate neighbors explore related but distinct mechanisms: one sibling investigates hierarchical multi-step rewards, while another examines hierarchical working memory. Nearby leaves include Generative and Reasoning-Based Reward Models, which generate chain-of-thought rationales before scoring, and Uncertainty and Reliability in Reward Models, which address robustness to reward hacking. The taxonomy's scope and exclude notes clarify that CRM's focus on inter-step dependencies distinguishes it from Independent Step Evaluation methods, while its process-level supervision separates it from Outcome-Based and Self-Rewarding Approaches that rely solely on final-answer signals.

Among thirty candidates examined, the precise credit assignment mechanism shows one refutable candidate out of ten examined, indicating some prior work addresses similar attribution challenges. The CRM framework itself and the probabilistically consistent cross-sample comparison each examined ten candidates with zero refutations, suggesting these contributions may be more distinctive within the limited search scope. The statistics reflect a targeted semantic search rather than an exhaustive survey, so the absence of refutations does not guarantee absolute novelty but does indicate that closely related work is not immediately apparent in the top-ranked candidates.

Given the sparse population of the Hierarchical and Conditional Reward Modeling leaf and the limited overlap found among thirty candidates, the work appears to occupy a relatively underexplored niche. However, the analysis is constrained by the search scope and does not cover the full breadth of reinforcement learning or causal inference literature that might address similar temporal credit assignment problems outside the LLM reasoning context.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Reward modeling for multi-step reasoning in large language models. The field has organized itself around several complementary directions. Process Reward Model Design and Training focuses on architectures and learning strategies that assign credit at intermediate reasoning steps, including hierarchical and conditional variants. Process Reward Model Applications explores how these models guide search, verification, and inference-time computation. Outcome-Based and Self-Rewarding Approaches investigate end-to-end signals and models that generate their own supervision. Multi-Turn and Interactive Agent Reasoning examines credit assignment in conversational or agentic settings, while Domain-Specific and Multimodal Reasoning adapts reward modeling to specialized tasks such as mathematics, code, or vision-language problems. Evaluation, Analysis, and Benchmarking studies the reliability and failure modes of reward models, and Long-Horizon Skill Learning and Robotics extends these ideas to embodied agents. Surveys and Theoretical Foundations provide broader perspectives, and Fairness and Trustworthiness in Reasoning addresses ethical dimensions. Within Process Reward Model Design and Training, a particularly active line of work explores hierarchical and conditional reward structures that decompose complex reasoning into manageable subgoals. Conditional Reward Modeling[0] sits squarely in this cluster, emphasizing how reward signals can be conditioned on problem structure or intermediate states. Nearby efforts such as Hierarchical Multi-Step Rewards[4] and Hierarchical Working Memory[3] similarly investigate multi-level credit assignment, though they differ in whether they prioritize explicit memory mechanisms or purely reward-based decomposition. Another contrast emerges between works that automate process supervision—such as Automated Process Supervision[9]—and those that rely on human annotations or model-generated labels. Open questions include how to balance granularity and scalability in hierarchical designs, and whether conditional rewards generalize robustly across diverse reasoning domains. Conditional Reward Modeling[0] contributes to this landscape by proposing a flexible framework for context-dependent credit assignment, complementing the structural hierarchies explored by its neighbors.

Claimed Contributions

Conditional Reward Modeling (CRM) framework

10 retrieved papers

The authors introduce CRM, a framework that models each reasoning step's reward as a conditional probability dependent on all preceding steps, thereby capturing inter-step dependencies in sequential reasoning. This addresses the limitation of prior PRMs that treat steps in isolation.

10 retrieved papers

Precise credit assignment mechanism

Can Refute

10 retrieved papers

CRM explicitly links process rewards to the final outcome through the conditional probability chain rule, enabling precise attribution of the final result to individual reasoning steps. This resolves the credit assignment ambiguity prevalent in existing PRMs.

10 retrieved papers

Can Refute

Probabilistically consistent cross-sample comparison

10 retrieved papers

The consistent probabilistic formulation ensures that reward signals carry the same semantic meaning across different reasoning trajectories, making them directly comparable across samples. This facilitates downstream tasks such as Best-of-N sampling, beam search, and RL optimization.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[4] Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models PDF

Wang Teng, Teng Wang, He, Zhenqi, Zhangyi Jiang, Tong, Shenyang, Zhenqi He, Yang Wen-han, Wenhan Yang, Zheng Yanan, Yanan Zheng, Li Zeyu, Zeyu Li, He Zifan, Zifang He, Gong Hailei, Shenyang Tong, Hailei Gong, Ma Shengjie, Zhang JianPing (2025)

[49] ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs PDF

Yang Ling, Gu Jingwen, Qiu Jia-Hao, Shen Ke, He, Jingrui, Wang Mengdi (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Conditional Reward Modeling (CRM) framework

[59] Chain-of-thought prompting elicits reasoning in large language models PDF

Cannot Refute

[60] The impact of reasoning step length on large language models PDF

Cannot Refute

[61] Multimodal Chain-of-Thought Reasoning in Language Models PDF

Cannot Refute

[62] ReAct: Synergizing Reasoning and Acting in Language Models PDF

Cannot Refute

[63] Towards reasoning era: A survey of long chain-of-thought for reasoning large language models PDF

Cannot Refute

[64] ART: Automatic multi-step reasoning and tool-use for large language models PDF

Cannot Refute

[65] Timebench: A comprehensive evaluation of temporal reasoning abilities in large language models PDF

Cannot Refute

[66] Multi-step reasoning with large language models, a survey PDF

Cannot Refute

[67] Understanding Social Reasoning in Language Models with Language Models PDF

Cannot Refute

[68] Learning adaptive parallel reasoning with language models PDF

Cannot Refute

Contribution

Precise credit assignment mechanism

[74] GRPO-: Credit Assignment improves LLM Reasoning PDF

Can Refute

[11] Process Reinforcement through Implicit Rewards PDF

Cannot Refute

[22] Let's Verify Step by Step PDF

Cannot Refute

[69] Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs PDF

Cannot Refute

[70] Vagen: Reinforcing world model reasoning for multi-turn vlm agents PDF

Cannot Refute

[71] Complexity-Based Prompting for Multi-Step Reasoning PDF

Cannot Refute

[72] LLaVA-CoT: Let Vision Language Models Reason Step-by-Step PDF

Cannot Refute

[73] REFINER: Reasoning Feedback on Intermediate Representations PDF

Cannot Refute

[75] R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization PDF

Cannot Refute

[76] VinePPO: Accurate credit assignment in RL for LLM mathematical reasoning PDF

Cannot Refute

Contribution

Probabilistically consistent cross-sample comparison

[4] Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models PDF

Cannot Refute

[41] Exploring the limit of outcome reward for learning mathematical reasoning PDF

Cannot Refute

[51] Self-rewarding correction for mathematical reasoning PDF

Cannot Refute

[52] DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning PDF

Cannot Refute

[53] Reinforcing video reasoning with focused thinking PDF

Cannot Refute

[54] Reward-guided speculative decoding for efficient llm reasoning PDF

Cannot Refute

[55] Multi-hop path reasoning over sparse temporal knowledge graphs based on path completion and reward shaping PDF

Cannot Refute

[56] Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs PDF

Cannot Refute

[57] Rewarding Graph Reasoning Process makes LLMs more Generalized Reasoners PDF

Cannot Refute

[58] Boosting the Generalization and Reasoning of Vision Language Models with Curriculum Reinforcement Learning PDF

Cannot Refute

Linking Process to Outcome: Conditional Reward Modeling for LLM Reasoning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[4] Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models PDF

[49] ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs PDF

Contribution Analysis

Conditional Reward Modeling (CRM) framework

[59] Chain-of-thought prompting elicits reasoning in large language models PDF

[60] The impact of reasoning step length on large language models PDF

[61] Multimodal Chain-of-Thought Reasoning in Language Models PDF

[62] ReAct: Synergizing Reasoning and Acting in Language Models PDF

[63] Towards reasoning era: A survey of long chain-of-thought for reasoning large language models PDF

[64] ART: Automatic multi-step reasoning and tool-use for large language models PDF

[65] Timebench: A comprehensive evaluation of temporal reasoning abilities in large language models PDF

[66] Multi-step reasoning with large language models, a survey PDF

[67] Understanding Social Reasoning in Language Models with Language Models PDF

[68] Learning adaptive parallel reasoning with language models PDF

Precise credit assignment mechanism

[74] GRPO-: Credit Assignment improves LLM Reasoning PDF

[11] Process Reinforcement through Implicit Rewards PDF

[22] Let's Verify Step by Step PDF

[69] Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs PDF

[70] Vagen: Reinforcing world model reasoning for multi-turn vlm agents PDF

[71] Complexity-Based Prompting for Multi-Step Reasoning PDF

[72] LLaVA-CoT: Let Vision Language Models Reason Step-by-Step PDF

[73] REFINER: Reasoning Feedback on Intermediate Representations PDF

[75] R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization PDF

[76] VinePPO: Accurate credit assignment in RL for LLM mathematical reasoning PDF

Probabilistically consistent cross-sample comparison

[4] Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models PDF

[41] Exploring the limit of outcome reward for learning mathematical reasoning PDF

[51] Self-rewarding correction for mathematical reasoning PDF

[52] DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning PDF

[53] Reinforcing video reasoning with focused thinking PDF

[54] Reward-guided speculative decoding for efficient llm reasoning PDF

[55] Multi-hop path reasoning over sparse temporal knowledge graphs based on path completion and reward shaping PDF

[56] Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs PDF

[57] Rewarding Graph Reasoning Process makes LLMs more Generalized Reasoners PDF

[58] Boosting the Generalization and Reasoning of Vision Language Models with Curriculum Reinforcement Learning PDF

Table of Contents