R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

Multimodal Large Language ModelMultimodal Reward ModelStable Reinforcement LearningLong-CoT Reasoning

Multimodal Reward Models (MRMs) play a crucial role in enhancing the performance of Multimodal Large Language Models (MLLMs). While recent advancements have primarily focused on improving the model structure and training data of MRMs, there has been limited exploration into the effectiveness of long-term reasoning capabilities for reward modeling and how to activate these capabilities in MRMs. In this paper, we explore how Reinforcement Learning (RL) can be used to improve reward modeling. Specifically, we reformulate the reward modeling problem as a rule-based RL task. However, we observe that directly applying existing RL algorithms, such as Reinforce++, to reward modeling often leads to training instability or even collapse due to the inherent limitations of these algorithms. To address this issue, we propose the StableReinforce algorithm, which refines the training loss, advantage estimation strategy, and reward design of existing RL methods. These refinements result in more stable training dynamics and superior performance. To facilitate MRM training, we collect 200K preference data from diverse datasets. Our reward model, R1-Reward, trained using the StableReinforce algorithm on this dataset, significantly improves performance on multimodal reward modeling benchmarks. Compared to previous SOTA models, R1-Reward achieves a 8.4% improvement on the VL Reward-Bench and a 14.3% improvement on the Multimodal Reward Bench. Moreover, with more inference compute, R1-Reward's performance is further enhanced, highlighting the potential of RL algorithms in optimizing MRMs.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes StableReinforce, a refinement of policy gradient methods for training multimodal reward models, and introduces R1-Reward trained on 200K preference pairs. It sits within the Policy Optimization Methods leaf, which contains six papers exploring policy gradient variants (GRPO, PPO, custom optimizers) for multimodal RL. This is a moderately populated research direction within the broader Reinforcement Learning Algorithms and Optimization Methods branch, indicating active but not overcrowded exploration of policy-based training for reward models.

The taxonomy reveals closely related work in neighboring leaves: Hybrid and Value-Based RL Approaches explores alternative optimization paradigms, while Reinforcement Learning from Human Feedback focuses on preference-based alignment (though the paper's rule-based formulation differs). The Chain-of-Thought and Reasoning-Enhanced Reward Models branch addresses explicit reasoning mechanisms, which connects to this work's emphasis on long-term reasoning capabilities. The sibling papers in Policy Optimization Methods share the core focus on policy gradients but vary in architectural choices and feedback granularity, suggesting this leaf represents a coherent but diverse research cluster.

Among 29 candidates examined, the StableReinforce algorithm contribution showed no clear refutation across 10 candidates, suggesting potential novelty in its specific refinements to loss, advantage estimation, and reward design. The rule-based RL reformulation similarly found no refuting work among 10 candidates. However, the R1-Reward-200K dataset contribution encountered one refutable candidate among nine examined, indicating some overlap in progressive difficulty training strategies for preference data. The limited search scope means these findings reflect top-K semantic matches rather than exhaustive coverage.

Based on the 29-candidate search, the algorithmic contributions appear more distinctive than the dataset contribution within the examined literature. The taxonomy context suggests the work occupies a moderately explored niche, with sibling papers pursuing related but distinct policy optimization approaches. A broader search beyond top-K semantic similarity might reveal additional relevant work, particularly in the reasoning-enhanced reward models direction or in general RL stability techniques adapted to multimodal settings.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: training multimodal reward models through reinforcement learning. The field has evolved into a structured landscape organized around five major branches. Reward Model Architecture and Training Paradigms explores how to design and train reward functions that can handle vision, language, and other modalities, with works like Unified Multimodal CoT Reward[1] and Unified Reward Model[2] proposing architectures that unify cross-modal signals. Reinforcement Learning Algorithms and Optimization Methods focuses on policy optimization techniques and algorithmic innovations, including approaches such as R1-omni[3] and Skywork R1v2[12] that refine training dynamics. Application Domains and Task-Specific Implementations addresses concrete use cases ranging from video understanding (Tuning Video Models RLAIF[4]) to document reasoning (DocThinker[34]) and even medical image registration (Multimodal Image Registration RL[7]). Evaluation and Benchmarking provides standardized testbeds like VideoRewardBench[41] to measure progress, while Safety and Security tackles alignment and robustness concerns exemplified by Safe RLHF-V[36]. Within the policy optimization branch, a particularly active line of work centers on integrating reasoning traces and iterative refinement into multimodal reward learning. R1-Reward[0] sits squarely in this cluster, emphasizing policy-level optimization methods that leverage reinforcement signals to improve both reasoning quality and multimodal alignment. Nearby efforts such as EchoInk-R1[5] and R1-VL[8] share a similar focus on refining vision-language models through RL-driven reward shaping, though they differ in architectural choices and the granularity of feedback. A key trade-off across these methods involves balancing sample efficiency against the richness of multimodal supervision: some approaches rely on dense process-level rewards (VRPRM[18]), while others adopt sparser outcome-based signals (Mixed-R1[15]). Open questions remain around scaling these techniques to longer reasoning chains and ensuring that learned rewards generalize robustly across diverse visual and linguistic contexts.

Claimed Contributions

StableReinforce algorithm for stable reward model training

10 retrieved papers

The authors introduce StableReinforce, a novel reinforcement learning algorithm that addresses training instability in reward modeling through three key refinements: pre-clipping to prevent numerical overflow, advantage filtering using the 3-sigma rule to handle outliers, and a consistency reward mechanism that ensures alignment between reasoning processes and final outputs.

10 retrieved papers

Reformulation of reward modeling as rule-based RL task

10 retrieved papers

The paper reformulates multimodal reward modeling as a reinforcement learning problem where the policy decides which of two answers is better, with rewards based on consistency with ground truth. This enables the application of RL techniques to activate long-term reasoning capabilities in reward models.

10 retrieved papers

R1-Reward-200K dataset with progressive difficulty training strategy

Can Refute

9 retrieved papers

The authors construct a 200K preference dataset from multiple sources and implement a progressive difficulty training strategy. They use GPT-4o to generate thinking processes for cold-start data and select challenging samples (requiring multiple attempts) for RL training, improving data efficiency.

9 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[5] Echoink-r1: Exploring audio-visual reasoning in multimodal llms via reinforcement learning PDF

Xing, Zhenghao, Hu Xiaowei, Zheng Xing, Fu, Chi-Wing, Xiaowei Hu, Wang Wenhai, Chi-Wing Fu, Dai, Jifeng, Wenhai Wang, Heng, Pheng-Ann, Jifeng Dai, Pheng-Ann Heng (2025)

[8] R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization PDF

Zhang Jingyi, Huang Jiaxing, Jingyi Zhang, Jiaxing Huang, Liu Shunyu, Huanjin Yao, Zhang Xikun, Shunyu Liu, Lu, Shijian, Xikun Zhang, Tao, Dacheng, Shijian Lu, Dacheng Tao (2025)

[12] Skywork r1v2: Multimodal hybrid reinforcement learning for reasoning PDF

Wang Peiyu, Wei, Yichen, Chris, Peng Yi, Yichen Wei, Wang Xiao-kun, Yi Peng, Qiu Wei-jie, Xiaokun Wang, Shen Wei, Weijie Qiu, Xie Tianyidan, Wei Shen, Pei, Jiangbo, Tianyidan Xie, Zhang Jian-hao, Jiangbo Pei, Jianhao Zhang, Song, Xuchen, Yunzhuo Hao, Liu Yang, Xuchen Song, Zhou Ya-hui, Yang Liu, Yahui Zhou (2025)

[15] Mixed-r1: Unified reward perspective for reasoning capability in multimodal large language models PDF

Xu, Shilin, Li Yanwei, Shilin Xu, Yang Rui, Yanwei Li, Zhang Tao, Rui Yang, Sun Yueyi, Tao Zhang, Yueyi Sun, Li, Linfeng, Wei Chow, Song Hang, Linfeng Li, Xu Qi, Hang Song, Tong, Yunhai, Qi Xu, Xiangtai, Yunhai Tong, Fei Hao, Xiangtai Li, Hao Fei (2025)

[27] GLM-4.1 V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning PDF

V. Team, Hong Wenyi, Wenyi Hong, Yu, Wenmeng, Wenmeng Yu, Gu, Xiaotao, Xia Gu, Wang Guo, Guo Wang, Gan, Guobing, Guobing Gan, Haomiao Tang, Cheng, Jiale, Jiale Cheng, Qi Ji, Ji Qi, Ji Junhui, Junhui Ji, Lihang Pan, Shuaiqi Duan, Wang Weihan, Weihan Wang, Wang Yan, Yan Wang, Yean Cheng, Zehai He, SU Zhe, Zhe Su, Yang, Zhen, Zhen Yang, Pan Ziyang, Ziyang Pan, Zeng, Aohan, Aohan Zeng, Wang Bao-xu, Baoxu Wang, Chen Bin, Boyan Shi, Shi Bo-yan, Changyu Pang, Pang Chang-yu, Chenhui Zhang, Zhang, Chenhui, Da Yin, Yin Da, Fan Yang, Yang Fan, Guoqing Chen, Chen Guoqing, Jiazheng Xu, XU Jiazheng, Jiali Chen, Zhu Jia-le, Jing Chen, Chen Jiali, Jinhao Chen, Chen Jing, Jinghao Lin, Chen Jinhao, Jinjiang Wang, Lin, Jinghao, Junjie Chen, Wang JinJiang, Leqi Lei, Chen Junjie, Letian Gong, Leyi Pan, Mingzhi Zhang, Pan Leyi, Qinkai Zheng, Shengchao Yang, Xu Ming-de, Shilong Zhong, Zhang Ming-zhi., Shiyu Huang, Zheng, Qinkai, Shuyuan Zhao, Yang Sheng, Siyan Xue, Zhong Shi, Shangqin Tu, Huang Shiyu, Shengbiao Meng, Zhao Shuyuan, Tianshu Zhang, Tian-Yuan Luo, Tianxiang Hao, Wenkai Li, Zhang Tian-shu, Wei Jia, Xinpeng Lyu, Hao Tianxiang, Xuancheng Huang, Tong Tian-yu, Yanling Wang, Li, Wenkai, Ya-Qi Xue, Jia Wei, Yanfeng Wang, Liu, Xiao, Yifan An, Zhang Xiaohan, Yifan Du, Lyu, Xin, Yi Shi, Fan Xin-yue, Yiheng Huang, Huang, Xuancheng, Yilin Niu, Wang Yan-ling, Yuan Wang, Xue Ya-dong, Yuanchang Yue, Wang Yan-feng, Yuchen Li, Wang Yanzi, Yutao Zhang, Yuxuan Zhang, Du, Yifan, Zhanxiao Du, Shi Yiming, Zhenyu Hou, Huang Yi-Heng, Zhao Xue, Niu Yilin, Zhengxiao Du, Wang Yuan, Zihan Wang, Peng Zhang, Yuchen, DeâHuan Liu, Zhang YuTao, Bin Xu, Wang, Yuting, Juan-Zi Li, Wang Yu, Minlie Huang, Zhang Yuxuan, Yuxiao Dong, Xue Zhao, Jie Tang, Hou Zhenyu, Zhengxiao, Wang Zihan, Peng, Liu DeBing, Xu Bin, Juanzi, Minlie, Dong, Yuxiao, Tang, Jie (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

StableReinforce algorithm for stable reward model training

[67] Continuous reinforcement learning via advantage value difference reward shaping: A proximal policy optimization perspective PDF

Cannot Refute

[68] An improved deep reinforcement learning algorithm for path planning in unmanned driving PDF

Cannot Refute

[69] Avatar: Reinforcement learning to see, hear, and reason over video PDF

Cannot Refute

[70] Boosting policy learning in reinforcement learning via adaptive intrinsic reward regulation PDF

Cannot Refute

[71] Hybrid group relative policy optimization: A multi-sample approach to enhancing policy optimization PDF

Cannot Refute

[72] Enhancing stability and performance in mobile robot path planning with pmr-dueling dqn algorithm PDF

Cannot Refute

[73] Leftover lunch: Advantage-based offline reinforcement learning for language models PDF

Cannot Refute

[74] From Sparse to Dense: Toddler-inspired Reward Transition in Goal-Oriented Reinforcement Learning PDF

Cannot Refute

[75] Trust Region Reward Optimization and Proximal Inverse Reward Optimization Algorithm PDF

Cannot Refute

[76] Quantile Advantage Estimation for Entropy-Safe Reasoning PDF

Cannot Refute

Contribution

Reformulation of reward modeling as rule-based RL task

[1] Unified multimodal chain-of-thought reward model through reinforcement fine-tuning PDF

Cannot Refute

[4] Tuning large multimodal models for videos using reinforcement learning from ai feedback PDF

Cannot Refute

[12] Skywork r1v2: Multimodal hybrid reinforcement learning for reasoning PDF

Cannot Refute

[19] Internlm-xcomposer2. 5-reward: A simple yet effective multi-modal reward model PDF

Cannot Refute

[51] Personalizing reinforcement learning from human feedback with variational preference learning PDF

Cannot Refute

[52] Vl-cogito: Progressive curriculum reinforcement learning for advanced multimodal reasoning PDF

Cannot Refute

[53] Decisionnce: Embodied multimodal representations via implicit preference learning PDF

Cannot Refute

[54] Aligning Large Vision-Language Models by Deep Reinforcement Learning and Direct Preference Optimization PDF

Cannot Refute

[55] GraphFusion-HRL: Multi-Modal Hierarchical Reinforcement Graph Learning for Context-Rich Recommender Systems PDF

Cannot Refute

[56] Vr-thinker: Boosting video reward models through thinking-with-image reasoning PDF

Cannot Refute

Contribution

R1-Reward-200K dataset with progressive difficulty training strategy

[59] Curriculum-RLAIF: Curriculum Alignment with Reinforcement Learning from AI Feedback PDF

Can Refute

[57] Dast: Difficulty-adaptive slow-thinking for large reasoning models PDF

Cannot Refute

[58] Enhancing alignment using curriculum learning & ranked preferences PDF

Cannot Refute

[60] Openprm: Building open-domain process-based reward models with preference trees PDF

Cannot Refute

[61] 2D-Curri-DPO: Two-Dimensional Curriculum Learning for Direct Preference Optimization PDF

Cannot Refute

[62] De novo molecular design enabled by direct preference optimization and curriculum learning PDF

Cannot Refute

[63] Rovrm: A robust visual reward model optimized via auxiliary textual preference data PDF

Cannot Refute

[64] Curriculum direct preference optimization for diffusion and consistency models PDF

Cannot Refute

[65] Curry-dpo: Enhancing alignment using curriculum learning & ranked preferences PDF

Cannot Refute

R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[5] Echoink-r1: Exploring audio-visual reasoning in multimodal llms via reinforcement learning PDF

[8] R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization PDF

[12] Skywork r1v2: Multimodal hybrid reinforcement learning for reasoning PDF

[15] Mixed-r1: Unified reward perspective for reasoning capability in multimodal large language models PDF

[27] GLM-4.1 V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning PDF

Contribution Analysis

StableReinforce algorithm for stable reward model training

[67] Continuous reinforcement learning via advantage value difference reward shaping: A proximal policy optimization perspective PDF

[68] An improved deep reinforcement learning algorithm for path planning in unmanned driving PDF

[69] Avatar: Reinforcement learning to see, hear, and reason over video PDF

[70] Boosting policy learning in reinforcement learning via adaptive intrinsic reward regulation PDF

[71] Hybrid group relative policy optimization: A multi-sample approach to enhancing policy optimization PDF

[72] Enhancing stability and performance in mobile robot path planning with pmr-dueling dqn algorithm PDF

[73] Leftover lunch: Advantage-based offline reinforcement learning for language models PDF

[74] From Sparse to Dense: Toddler-inspired Reward Transition in Goal-Oriented Reinforcement Learning PDF

[75] Trust Region Reward Optimization and Proximal Inverse Reward Optimization Algorithm PDF

[76] Quantile Advantage Estimation for Entropy-Safe Reasoning PDF

Reformulation of reward modeling as rule-based RL task

[1] Unified multimodal chain-of-thought reward model through reinforcement fine-tuning PDF

[4] Tuning large multimodal models for videos using reinforcement learning from ai feedback PDF

[12] Skywork r1v2: Multimodal hybrid reinforcement learning for reasoning PDF

[19] Internlm-xcomposer2. 5-reward: A simple yet effective multi-modal reward model PDF

[51] Personalizing reinforcement learning from human feedback with variational preference learning PDF

[52] Vl-cogito: Progressive curriculum reinforcement learning for advanced multimodal reasoning PDF

[53] Decisionnce: Embodied multimodal representations via implicit preference learning PDF

[54] Aligning Large Vision-Language Models by Deep Reinforcement Learning and Direct Preference Optimization PDF

[55] GraphFusion-HRL: Multi-Modal Hierarchical Reinforcement Graph Learning for Context-Rich Recommender Systems PDF

[56] Vr-thinker: Boosting video reward models through thinking-with-image reasoning PDF

R1-Reward-200K dataset with progressive difficulty training strategy

[59] Curriculum-RLAIF: Curriculum Alignment with Reinforcement Learning from AI Feedback PDF

[57] Dast: Difficulty-adaptive slow-thinking for large reasoning models PDF

[58] Enhancing alignment using curriculum learning & ranked preferences PDF

[60] Openprm: Building open-domain process-based reward models with preference trees PDF

[61] 2D-Curri-DPO: Two-Dimensional Curriculum Learning for Direct Preference Optimization PDF

[62] De novo molecular design enabled by direct preference optimization and curriculum learning PDF

[63] Rovrm: A robust visual reward model optimized via auxiliary textual preference data PDF

[64] Curriculum direct preference optimization for diffusion and consistency models PDF

[65] Curry-dpo: Enhancing alignment using curriculum learning & ranked preferences PDF

Table of Contents