SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Multimodal Large Language ModelsReinforcement LearningReasoning

Recent advances have shown success in eliciting strong reasoning abilities in multimodal large language models (MLLMs) through rule-based reinforcement learning (RL) with outcome rewards. However, this paradigm typically lacks supervision over the thinking process leading to the final outcome. As a result, the model may learn sub-optimal reasoning strategies, which can hinder its generalization ability. In light of this, we propose SophiaVL-R1, as an attempt to add reward signals for the thinking process in this paradigm. To achieve this, we first train a thinking reward model that evaluates the quality of the entire thinking process. Given that the thinking reward may be unreliable for certain samples due to reward hacking, we propose the Trust-GRPO method, which assigns a trustworthiness weight to the thinking reward during training. This weight is computed based on the thinking reward comparison of responses leading to correct answers versus incorrect answers, helping to mitigate the impact of potentially unreliable thinking rewards. Moreover, we design an annealing training strategy that gradually reduces the thinking reward over time, allowing the model to rely more on the accurate rule-based outcome reward in later training stages. Experiments show that our SophiaVL-R1 surpasses a series of reasoning MLLMs on various benchmarks (\textit{e.g.}, MathVisita, MMMU), demonstrating strong reasoning and generalization capabilities. Notably, our SophiaVL-R1-7B even outperforms LLaVA-OneVision-72B on most benchmarks, despite the latter having 10 $\times$ more parameters. All code, models, and datasets will be made publicly available.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes SophiaVL-R1, which introduces a thinking reward model to evaluate the entire reasoning process in multimodal large language models, combined with a Trust-GRPO training method that weights process rewards by trustworthiness. This work resides in the 'Policy Optimization Methods' leaf under 'Reinforcement Learning Frameworks for Multimodal Reasoning', which contains four papers total. The leaf focuses on policy gradient and relative policy optimization techniques for MLLM training, explicitly excluding outcome-only reward models. This places the paper in a moderately populated research direction within a broader taxonomy of fifty papers across thirty-six topics, suggesting active but not overcrowded exploration of RL-based reasoning enhancement.

The taxonomy reveals several neighboring research directions that contextualize this work. The sibling leaf 'Multi-Domain and Multi-Agent RL Frameworks' addresses heterogeneous tasks and agent interactions, while the parent branch also includes 'RL Paradigms and Theoretical Surveys'. Adjacent branches cover 'Process Reward Models and Step-Level Supervision' (with visual PRMs and generative PRMs) and 'Chain-of-Thought and Structured Reasoning Paradigms' (including autonomous multi-stage reasoning). The paper bridges policy optimization methods with process-level supervision concepts, drawing on ideas from both the RL frameworks branch and the process reward models branch, though it sits formally within the former.

Among twenty-three candidates examined, the analysis found three refutable pairs for the 'SophiaVL-R1 multimodal reasoning model' contribution (ten candidates examined), while the 'thinking reward model' (ten candidates) and 'Trust-GRPO algorithm' (three candidates) showed no clear refutations. The limited search scope—top-K semantic matches plus citation expansion—means these statistics reflect a targeted rather than exhaustive review. The thinking reward model and Trust-GRPO components appear more distinctive within the examined literature, whereas the overall model architecture encounters some overlapping prior work among the candidates reviewed.

Based on the limited search of twenty-three candidates, the paper's process-level reward modeling and trustworthiness weighting mechanisms appear relatively novel, while the integrated model faces more substantial prior work. The taxonomy structure suggests this research direction—combining RL policy optimization with process supervision—remains an active area with room for methodological contributions. However, the analysis does not cover the full landscape of multimodal reasoning research, and a broader literature search might reveal additional related work in adjacent branches such as 'Step-Level Reasoning with Fine-Grained Rewards' or 'Generative Process Reward Models'.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: reinforcing multimodal large language model reasoning with process-level supervision. The field has organized itself around several complementary directions. Reinforcement learning frameworks for multimodal reasoning explore policy optimization methods that align vision-language models with human preferences and reasoning objectives, often drawing on techniques like mixed preference optimization (Mixed Preference Optimization[3]) and step-level reward signals. Process reward models and step-level supervision form a dense branch focused on training verifiers that evaluate intermediate reasoning steps rather than only final answers, with works such as Vision Process Rewards[20] and MM-PRM[37] developing multimodal extensions of process-level feedback. Chain-of-thought and structured reasoning paradigms investigate how to elicit and represent explicit reasoning traces in vision-language settings, spanning methods like Llava-cot[7] and Insight-v[6]. Data construction and training strategies address the challenge of generating high-quality reasoning annotations at scale, while domain-specific applications target areas such as scientific reasoning, medical diagnosis, and robotic grasping. Additional branches cover reasoning enhancement through external knowledge, hallucination mitigation, architectural efficiency, and comprehensive surveys that synthesize emerging trends. Particularly active lines of work center on integrating reinforcement learning with process-level rewards to improve step-by-step reasoning quality in multimodal contexts. SophiaVL[0] sits within the policy optimization cluster, emphasizing how RL techniques can refine reasoning trajectories by leveraging fine-grained supervision signals at each reasoning step. This approach contrasts with outcome-based methods and aligns closely with Vision-r1[1] and R1-vl[2], which similarly apply policy gradient or actor-critic frameworks to vision-language reasoning. Compared to Mixed Preference Optimization[3], which blends multiple preference signals, SophiaVL[0] focuses more directly on process-level reward shaping to guide intermediate reasoning decisions. A key open question across these works is how to balance the cost of annotating step-level feedback with the gains in reasoning reliability, and whether learned process reward models can generalize across diverse multimodal tasks without extensive domain-specific tuning.

Claimed Contributions

Thinking reward model for holistic reasoning quality evaluation

10 retrieved papers

The authors introduce a thinking reward model trained on annotated reasoning responses that evaluates the entire thinking process holistically rather than step-by-step. This model assesses reasoning quality across dimensions such as logical soundness, consistency, and redundancy to help distinguish favorable from flawed reasoning patterns.

10 retrieved papers

Trust-GRPO algorithm with trustworthiness weighting and annealing

3 retrieved papers

The authors propose Trust-GRPO, a training algorithm that assigns a trustworthiness weight to thinking rewards by comparing rewards of correct versus incorrect responses. It includes a time-based annealing strategy that gradually reduces thinking reward influence, allowing the model to rely more on accurate rule-based outcome rewards in later training stages.

3 retrieved papers

SophiaVL-R1 multimodal reasoning model

Can Refute

10 retrieved papers

The authors develop SophiaVL-R1, a multimodal large language model that enhances reasoning by integrating model-generated thinking rewards with rule-based outcome rewards during reinforcement learning training. The model demonstrates strong reasoning and generalization capabilities across various benchmarks.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] Vision-r1: Incentivizing reasoning capability in multimodal large language models PDF

Huang Wenxuan, Jia Bohan, Wenxuan Huang, Zhai Zijie, Bohan Jia, Cao Shaosheng, Zijie Zhai, Ye Zheyu, Shaoshen Cao, Zhao Fei, Zheyu Ye, XU Zhe, Fei Zhao, Hu Yao, Zhe Xu, Lin, Shaohui, Yao Hu, Shaohui Lin (2025)

[2] R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization PDF

Zhang Jingyi, Huang Jiaxing, Liu Shunyu, Zhang Xikun, Lu, Shijian, Tao, Dacheng (2025)

[3] Enhancing the reasoning ability of multimodal large language models via mixed preference optimization PDF

Wang Wei-yun, WeiâYun Wang, Chen Zhe, Zhe Chen, Weiyun Wang, Wang Wenhai, Wenhai Wang, Cao Yue, Yue Cao, Liu Yangzhou, Yangzhou Liu, Gao Zhangwei, Zhangwei Gao, Zhu JinGuo, Jinguo Zhu, Zhu, Xizhou, Xizhou Zhu, Lu, Lewei, Lewei Lu, Qiao Yu, Yu Qiao, Dai, Jifeng, Jifeng Dai (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Thinking reward model for holistic reasoning quality evaluation

[11] Reinforced mllm: A survey on rl-based reasoning in multimodal large language models PDF

Cannot Refute

[51] Reasoning with language model is planning with world model PDF

Cannot Refute

[52] Rewardbench: Evaluating reward models for language modeling PDF

Cannot Refute

[53] Improve Mathematical Reasoning in Language Models by Automated Process Supervision PDF

Cannot Refute

[54] Unlocking the mysteries of OpenAI o1: A survey of the reasoning abilities of large language models PDF

Cannot Refute

[55] Unlocking Multimodal Mathematical Reasoning via Process Reward Model PDF

Cannot Refute

[56] Multimodal RewardBench: Holistic Evaluation of Reward Models for Vision Language Models PDF

Cannot Refute

[57] AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling PDF

Cannot Refute

[58] Teaching large language models to reason with reinforcement learning PDF

Cannot Refute

[59] ToTRL: Unlock LLM Tree-of-Thoughts Reasoning Potential through Puzzles Solving PDF

Cannot Refute

Contribution

Trust-GRPO algorithm with trustworthiness weighting and annealing

[60] Confidence-Guided Stepwise Model Routing for Cost-Efficient Reasoning PDF

Cannot Refute

[61] Beyond Correctness: Confidence-Aware Reward Modeling for Enhancing Large Language Model Reasoning PDF

Cannot Refute

[62] From Hypothesis to Premises: LLM-based Backward Logical Reasoning with Selective Symbolic Translation PDF

Cannot Refute

Contribution

SophiaVL-R1 multimodal reasoning model

[1] Vision-r1: Incentivizing reasoning capability in multimodal large language models PDF

Can Refute

[20] Training vision-language process reward models for test-time scaling in multimodal reasoning: Key insights and lessons learned PDF

Can Refute

[69] SAIL-RL: Guiding MLLMs in When and How to Think via Dual-Reward RL Tuning PDF

Can Refute

[63] Progressive Multimodal Reasoning via Active Retrieval PDF

Cannot Refute

[64] GraphFusion-HRL: Multi-Modal Hierarchical Reinforcement Graph Learning for Context-Rich Recommender Systems PDF

Cannot Refute

[65] Self-rewarding vision-language model via reasoning decomposition PDF

Cannot Refute

[66] REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding PDF

Cannot Refute

[67] Audio-thinker: Guiding audio language model when and how to think via reinforcement learning PDF

Cannot Refute

[68] Semi-off-Policy Reinforcement Learning for Vision-Language Slow-thinking Reasoning PDF

Cannot Refute

[70] VLRMBench: A Comprehensive and Challenging Benchmark for Vision-Language Reward Models PDF

Cannot Refute

SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] Vision-r1: Incentivizing reasoning capability in multimodal large language models PDF

[2] R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization PDF

[3] Enhancing the reasoning ability of multimodal large language models via mixed preference optimization PDF

Contribution Analysis

Thinking reward model for holistic reasoning quality evaluation

[11] Reinforced mllm: A survey on rl-based reasoning in multimodal large language models PDF

[51] Reasoning with language model is planning with world model PDF

[52] Rewardbench: Evaluating reward models for language modeling PDF

[53] Improve Mathematical Reasoning in Language Models by Automated Process Supervision PDF

[54] Unlocking the mysteries of OpenAI o1: A survey of the reasoning abilities of large language models PDF

[55] Unlocking Multimodal Mathematical Reasoning via Process Reward Model PDF

[56] Multimodal RewardBench: Holistic Evaluation of Reward Models for Vision Language Models PDF

[57] AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling PDF

[58] Teaching large language models to reason with reinforcement learning PDF

[59] ToTRL: Unlock LLM Tree-of-Thoughts Reasoning Potential through Puzzles Solving PDF

Trust-GRPO algorithm with trustworthiness weighting and annealing

[60] Confidence-Guided Stepwise Model Routing for Cost-Efficient Reasoning PDF

[61] Beyond Correctness: Confidence-Aware Reward Modeling for Enhancing Large Language Model Reasoning PDF

[62] From Hypothesis to Premises: LLM-based Backward Logical Reasoning with Selective Symbolic Translation PDF

SophiaVL-R1 multimodal reasoning model

[1] Vision-r1: Incentivizing reasoning capability in multimodal large language models PDF

[20] Training vision-language process reward models for test-time scaling in multimodal reasoning: Key insights and lessons learned PDF

[69] SAIL-RL: Guiding MLLMs in When and How to Think via Dual-Reward RL Tuning PDF

[63] Progressive Multimodal Reasoning via Active Retrieval PDF

[64] GraphFusion-HRL: Multi-Modal Hierarchical Reinforcement Graph Learning for Context-Rich Recommender Systems PDF

[65] Self-rewarding vision-language model via reasoning decomposition PDF

[66] REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding PDF

[67] Audio-thinker: Guiding audio language model when and how to think via reinforcement learning PDF

[68] Semi-off-Policy Reinforcement Learning for Vision-Language Slow-thinking Reasoning PDF

[70] VLRMBench: A Comprehensive and Challenging Benchmark for Vision-Language Reward Models PDF

Table of Contents