R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning

ICLR 2026 Conference SubmissionAnonymous Authors
Multimodal Large Language ModelMultimodal Reward ModelStable Reinforcement LearningLong-CoT Reasoning
Abstract:

Multimodal Reward Models (MRMs) play a crucial role in enhancing the performance of Multimodal Large Language Models (MLLMs). While recent advancements have primarily focused on improving the model structure and training data of MRMs, there has been limited exploration into the effectiveness of long-term reasoning capabilities for reward modeling and how to activate these capabilities in MRMs. In this paper, we explore how Reinforcement Learning (RL) can be used to improve reward modeling. Specifically, we reformulate the reward modeling problem as a rule-based RL task. However, we observe that directly applying existing RL algorithms, such as Reinforce++, to reward modeling often leads to training instability or even collapse due to the inherent limitations of these algorithms. To address this issue, we propose the StableReinforce algorithm, which refines the training loss, advantage estimation strategy, and reward design of existing RL methods. These refinements result in more stable training dynamics and superior performance. To facilitate MRM training, we collect 200K preference data from diverse datasets. Our reward model, R1-Reward, trained using the StableReinforce algorithm on this dataset, significantly improves performance on multimodal reward modeling benchmarks. Compared to previous SOTA models, R1-Reward achieves a 8.4% improvement on the VL Reward-Bench and a 14.3% improvement on the Multimodal Reward Bench. Moreover, with more inference compute, R1-Reward's performance is further enhanced, highlighting the potential of RL algorithms in optimizing MRMs.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes StableReinforce, a refinement of policy gradient methods for training multimodal reward models, and introduces R1-Reward trained on 200K preference pairs. It sits within the Policy Optimization Methods leaf, which contains six papers exploring policy gradient variants (GRPO, PPO, custom optimizers) for multimodal RL. This is a moderately populated research direction within the broader Reinforcement Learning Algorithms and Optimization Methods branch, indicating active but not overcrowded exploration of policy-based training for reward models.

The taxonomy reveals closely related work in neighboring leaves: Hybrid and Value-Based RL Approaches explores alternative optimization paradigms, while Reinforcement Learning from Human Feedback focuses on preference-based alignment (though the paper's rule-based formulation differs). The Chain-of-Thought and Reasoning-Enhanced Reward Models branch addresses explicit reasoning mechanisms, which connects to this work's emphasis on long-term reasoning capabilities. The sibling papers in Policy Optimization Methods share the core focus on policy gradients but vary in architectural choices and feedback granularity, suggesting this leaf represents a coherent but diverse research cluster.

Among 29 candidates examined, the StableReinforce algorithm contribution showed no clear refutation across 10 candidates, suggesting potential novelty in its specific refinements to loss, advantage estimation, and reward design. The rule-based RL reformulation similarly found no refuting work among 10 candidates. However, the R1-Reward-200K dataset contribution encountered one refutable candidate among nine examined, indicating some overlap in progressive difficulty training strategies for preference data. The limited search scope means these findings reflect top-K semantic matches rather than exhaustive coverage.

Based on the 29-candidate search, the algorithmic contributions appear more distinctive than the dataset contribution within the examined literature. The taxonomy context suggests the work occupies a moderately explored niche, with sibling papers pursuing related but distinct policy optimization approaches. A broader search beyond top-K semantic similarity might reveal additional relevant work, particularly in the reasoning-enhanced reward models direction or in general RL stability techniques adapted to multimodal settings.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
29
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: training multimodal reward models through reinforcement learning. The field has evolved into a structured landscape organized around five major branches. Reward Model Architecture and Training Paradigms explores how to design and train reward functions that can handle vision, language, and other modalities, with works like Unified Multimodal CoT Reward[1] and Unified Reward Model[2] proposing architectures that unify cross-modal signals. Reinforcement Learning Algorithms and Optimization Methods focuses on policy optimization techniques and algorithmic innovations, including approaches such as R1-omni[3] and Skywork R1v2[12] that refine training dynamics. Application Domains and Task-Specific Implementations addresses concrete use cases ranging from video understanding (Tuning Video Models RLAIF[4]) to document reasoning (DocThinker[34]) and even medical image registration (Multimodal Image Registration RL[7]). Evaluation and Benchmarking provides standardized testbeds like VideoRewardBench[41] to measure progress, while Safety and Security tackles alignment and robustness concerns exemplified by Safe RLHF-V[36]. Within the policy optimization branch, a particularly active line of work centers on integrating reasoning traces and iterative refinement into multimodal reward learning. R1-Reward[0] sits squarely in this cluster, emphasizing policy-level optimization methods that leverage reinforcement signals to improve both reasoning quality and multimodal alignment. Nearby efforts such as EchoInk-R1[5] and R1-VL[8] share a similar focus on refining vision-language models through RL-driven reward shaping, though they differ in architectural choices and the granularity of feedback. A key trade-off across these methods involves balancing sample efficiency against the richness of multimodal supervision: some approaches rely on dense process-level rewards (VRPRM[18]), while others adopt sparser outcome-based signals (Mixed-R1[15]). Open questions remain around scaling these techniques to longer reasoning chains and ensuring that learned rewards generalize robustly across diverse visual and linguistic contexts.

Claimed Contributions

StableReinforce algorithm for stable reward model training

The authors introduce StableReinforce, a novel reinforcement learning algorithm that addresses training instability in reward modeling through three key refinements: pre-clipping to prevent numerical overflow, advantage filtering using the 3-sigma rule to handle outliers, and a consistency reward mechanism that ensures alignment between reasoning processes and final outputs.

10 retrieved papers
Reformulation of reward modeling as rule-based RL task

The paper reformulates multimodal reward modeling as a reinforcement learning problem where the policy decides which of two answers is better, with rewards based on consistency with ground truth. This enables the application of RL techniques to activate long-term reasoning capabilities in reward models.

10 retrieved papers
R1-Reward-200K dataset with progressive difficulty training strategy

The authors construct a 200K preference dataset from multiple sources and implement a progressive difficulty training strategy. They use GPT-4o to generate thinking processes for cold-start data and select challenging samples (requiring multiple attempts) for RL training, improving data efficiency.

9 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

StableReinforce algorithm for stable reward model training

The authors introduce StableReinforce, a novel reinforcement learning algorithm that addresses training instability in reward modeling through three key refinements: pre-clipping to prevent numerical overflow, advantage filtering using the 3-sigma rule to handle outliers, and a consistency reward mechanism that ensures alignment between reasoning processes and final outputs.

Contribution

Reformulation of reward modeling as rule-based RL task

The paper reformulates multimodal reward modeling as a reinforcement learning problem where the policy decides which of two answers is better, with rewards based on consistency with ground truth. This enables the application of RL techniques to activate long-term reasoning capabilities in reward models.

Contribution

R1-Reward-200K dataset with progressive difficulty training strategy

The authors construct a 200K preference dataset from multiple sources and implement a progressive difficulty training strategy. They use GPT-4o to generate thinking processes for cold-start data and select challenging samples (requiring multiple attempts) for RL training, improving data efficiency.