SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward

ICLR 2026 Conference SubmissionAnonymous Authors
Multimodal Large Language ModelsReinforcement LearningReasoning
Abstract:

Recent advances have shown success in eliciting strong reasoning abilities in multimodal large language models (MLLMs) through rule-based reinforcement learning (RL) with outcome rewards. However, this paradigm typically lacks supervision over the thinking process leading to the final outcome. As a result, the model may learn sub-optimal reasoning strategies, which can hinder its generalization ability. In light of this, we propose SophiaVL-R1, as an attempt to add reward signals for the thinking process in this paradigm. To achieve this, we first train a thinking reward model that evaluates the quality of the entire thinking process. Given that the thinking reward may be unreliable for certain samples due to reward hacking, we propose the Trust-GRPO method, which assigns a trustworthiness weight to the thinking reward during training. This weight is computed based on the thinking reward comparison of responses leading to correct answers versus incorrect answers, helping to mitigate the impact of potentially unreliable thinking rewards. Moreover, we design an annealing training strategy that gradually reduces the thinking reward over time, allowing the model to rely more on the accurate rule-based outcome reward in later training stages. Experiments show that our SophiaVL-R1 surpasses a series of reasoning MLLMs on various benchmarks (\textit{e.g.}, MathVisita, MMMU), demonstrating strong reasoning and generalization capabilities. Notably, our SophiaVL-R1-7B even outperforms LLaVA-OneVision-72B on most benchmarks, despite the latter having 10 ×\times more parameters. All code, models, and datasets will be made publicly available.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes SophiaVL-R1, which introduces a thinking reward model to evaluate the entire reasoning process in multimodal large language models, combined with a Trust-GRPO training method that weights process rewards by trustworthiness. This work resides in the 'Policy Optimization Methods' leaf under 'Reinforcement Learning Frameworks for Multimodal Reasoning', which contains four papers total. The leaf focuses on policy gradient and relative policy optimization techniques for MLLM training, explicitly excluding outcome-only reward models. This places the paper in a moderately populated research direction within a broader taxonomy of fifty papers across thirty-six topics, suggesting active but not overcrowded exploration of RL-based reasoning enhancement.

The taxonomy reveals several neighboring research directions that contextualize this work. The sibling leaf 'Multi-Domain and Multi-Agent RL Frameworks' addresses heterogeneous tasks and agent interactions, while the parent branch also includes 'RL Paradigms and Theoretical Surveys'. Adjacent branches cover 'Process Reward Models and Step-Level Supervision' (with visual PRMs and generative PRMs) and 'Chain-of-Thought and Structured Reasoning Paradigms' (including autonomous multi-stage reasoning). The paper bridges policy optimization methods with process-level supervision concepts, drawing on ideas from both the RL frameworks branch and the process reward models branch, though it sits formally within the former.

Among twenty-three candidates examined, the analysis found three refutable pairs for the 'SophiaVL-R1 multimodal reasoning model' contribution (ten candidates examined), while the 'thinking reward model' (ten candidates) and 'Trust-GRPO algorithm' (three candidates) showed no clear refutations. The limited search scope—top-K semantic matches plus citation expansion—means these statistics reflect a targeted rather than exhaustive review. The thinking reward model and Trust-GRPO components appear more distinctive within the examined literature, whereas the overall model architecture encounters some overlapping prior work among the candidates reviewed.

Based on the limited search of twenty-three candidates, the paper's process-level reward modeling and trustworthiness weighting mechanisms appear relatively novel, while the integrated model faces more substantial prior work. The taxonomy structure suggests this research direction—combining RL policy optimization with process supervision—remains an active area with room for methodological contributions. However, the analysis does not cover the full landscape of multimodal reasoning research, and a broader literature search might reveal additional related work in adjacent branches such as 'Step-Level Reasoning with Fine-Grained Rewards' or 'Generative Process Reward Models'.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
23
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: reinforcing multimodal large language model reasoning with process-level supervision. The field has organized itself around several complementary directions. Reinforcement learning frameworks for multimodal reasoning explore policy optimization methods that align vision-language models with human preferences and reasoning objectives, often drawing on techniques like mixed preference optimization (Mixed Preference Optimization[3]) and step-level reward signals. Process reward models and step-level supervision form a dense branch focused on training verifiers that evaluate intermediate reasoning steps rather than only final answers, with works such as Vision Process Rewards[20] and MM-PRM[37] developing multimodal extensions of process-level feedback. Chain-of-thought and structured reasoning paradigms investigate how to elicit and represent explicit reasoning traces in vision-language settings, spanning methods like Llava-cot[7] and Insight-v[6]. Data construction and training strategies address the challenge of generating high-quality reasoning annotations at scale, while domain-specific applications target areas such as scientific reasoning, medical diagnosis, and robotic grasping. Additional branches cover reasoning enhancement through external knowledge, hallucination mitigation, architectural efficiency, and comprehensive surveys that synthesize emerging trends. Particularly active lines of work center on integrating reinforcement learning with process-level rewards to improve step-by-step reasoning quality in multimodal contexts. SophiaVL[0] sits within the policy optimization cluster, emphasizing how RL techniques can refine reasoning trajectories by leveraging fine-grained supervision signals at each reasoning step. This approach contrasts with outcome-based methods and aligns closely with Vision-r1[1] and R1-vl[2], which similarly apply policy gradient or actor-critic frameworks to vision-language reasoning. Compared to Mixed Preference Optimization[3], which blends multiple preference signals, SophiaVL[0] focuses more directly on process-level reward shaping to guide intermediate reasoning decisions. A key open question across these works is how to balance the cost of annotating step-level feedback with the gains in reasoning reliability, and whether learned process reward models can generalize across diverse multimodal tasks without extensive domain-specific tuning.

Claimed Contributions

Thinking reward model for holistic reasoning quality evaluation

The authors introduce a thinking reward model trained on annotated reasoning responses that evaluates the entire thinking process holistically rather than step-by-step. This model assesses reasoning quality across dimensions such as logical soundness, consistency, and redundancy to help distinguish favorable from flawed reasoning patterns.

10 retrieved papers
Trust-GRPO algorithm with trustworthiness weighting and annealing

The authors propose Trust-GRPO, a training algorithm that assigns a trustworthiness weight to thinking rewards by comparing rewards of correct versus incorrect responses. It includes a time-based annealing strategy that gradually reduces thinking reward influence, allowing the model to rely more on accurate rule-based outcome rewards in later training stages.

3 retrieved papers
SophiaVL-R1 multimodal reasoning model

The authors develop SophiaVL-R1, a multimodal large language model that enhances reasoning by integrating model-generated thinking rewards with rule-based outcome rewards during reinforcement learning training. The model demonstrates strong reasoning and generalization capabilities across various benchmarks.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Thinking reward model for holistic reasoning quality evaluation

The authors introduce a thinking reward model trained on annotated reasoning responses that evaluates the entire thinking process holistically rather than step-by-step. This model assesses reasoning quality across dimensions such as logical soundness, consistency, and redundancy to help distinguish favorable from flawed reasoning patterns.

Contribution

Trust-GRPO algorithm with trustworthiness weighting and annealing

The authors propose Trust-GRPO, a training algorithm that assigns a trustworthiness weight to thinking rewards by comparing rewards of correct versus incorrect responses. It includes a time-based annealing strategy that gradually reduces thinking reward influence, allowing the model to rely more on accurate rule-based outcome rewards in later training stages.

Contribution

SophiaVL-R1 multimodal reasoning model

The authors develop SophiaVL-R1, a multimodal large language model that enhances reasoning by integrating model-generated thinking rewards with rule-based outcome rewards during reinforcement learning training. The model demonstrates strong reasoning and generalization capabilities across various benchmarks.