Perception-R1: Advancing Multimodal Reasoning Capabilities of MLLMs via Visual Perception Reward

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Multimodal Large Language ModelsMultimodal ReasoningReinforcement Learning

Enhancing the multimodal reasoning capabilities of Multimodal Large Language Models (MLLMs) is a challenging task that has attracted increasing attention in the community. Recently, several studies have applied Reinforcement Learning with Verifiable Rewards (RLVR) to the multimodal domain in order to enhance the reasoning abilities of MLLMs. However, these works largely overlook the enhancement of multimodal perception capabilities in MLLMs, which serve as a core prerequisite and foundational component of complex multimodal reasoning. Through McNemar's test, we find that existing RLVR method fails to effectively enhance the multimodal perception capabilities of MLLMs, thereby limiting their further improvement in multimodal reasoning. To address this limitation, we propose Perception-R1, which introduces a novel visual perception reward that explicitly encourages MLLMs to perceive the visual content accurately, thereby can effectively incentivizing both their multimodal perception and reasoning capabilities. Specifically, we first collect textual visual annotations from the CoT trajectories of multimodal problems, which will serve as visual references for reward assignment. During RLVR training, we employ a judging LLM to assess the consistency between the visual annotations and the responses generated by MLLM, and assign the visual perception reward based on these consistency judgments. Extensive experiments on several multimodal math and general benchmarks demonstrate the effectiveness and robustness of our Perception-R1, which achieves superior performance on all benchmarks using only 1,442 training data.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Perception-R1, a reinforcement learning method that introduces visual perception rewards to enhance both perception accuracy and reasoning quality in multimodal large language models. It resides in the 'Perception-Aware Reward Design' leaf under 'Reinforcement Learning-Based Reasoning Enhancement', which contains only two papers total. This represents a relatively sparse research direction within the broader taxonomy of fifty papers across thirty-six topics, suggesting the specific focus on perception-aware reward mechanisms remains an emerging area rather than a crowded subfield.

The taxonomy tree reveals neighboring leaves including 'Pure RL Reasoning Emergence' (two papers on emergent reasoning without supervised rationales), 'Preference-Based Optimization' (one paper on preference datasets), and 'RL for Vision-Language-Action Models' (two papers on embodied agents). The paper's emphasis on explicit visual perception rewards distinguishes it from these adjacent directions, which either focus on general policy learning or action-grounded reasoning. The scope note for this leaf explicitly excludes general reward mechanisms without perception-specific components, positioning the work at the intersection of visual grounding and reinforcement learning.

Among thirty candidates examined through semantic search and citation expansion, the contribution-level analysis shows mixed novelty signals. The investigation of perception limitations in RLVR-trained models examined ten candidates with zero refutations, suggesting this diagnostic angle appears relatively unexplored. However, the core Perception-R1 method examined ten candidates with one refutable match, and the performance claims examined ten candidates with two refutable matches. These statistics indicate that while the perception-focused diagnostic work may be novel, the technical approach and empirical contributions face more substantial prior work within the limited search scope.

The analysis reflects a targeted literature search rather than exhaustive coverage, examining thirty semantically similar papers from a field of fifty total works. The relatively sparse taxonomy leaf and limited refutations for the diagnostic contribution suggest potential novelty in identifying perception bottlenecks, though the method itself encounters overlapping prior work. A broader search beyond top-K semantic matches might reveal additional relevant comparisons, particularly in adjacent reinforcement learning or visual grounding literature not captured in this taxonomy.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: enhancing multimodal reasoning capabilities of multimodal large language models. The field has evolved into a rich landscape with several major branches. Reinforcement Learning-Based Reasoning Enhancement explores how reward signals and policy optimization can guide models toward more robust reasoning, often incorporating perception-aware reward design or cold-start strategies. Chain-of-Thought and Rationale-Based Reasoning focuses on eliciting step-by-step explanations, with works like Multimodal Chain-of-Thought[12] and Ddcot[14] pioneering structured intermediate reasoning. Supervised and Hybrid Training Approaches blend traditional fine-tuning with preference optimization methods, while Architectural and Representation Innovations introduce novel encoders or cross-modal fusion mechanisms. Specialized Reasoning Capabilities target domain-specific challenges such as mathematical problem-solving (Math-llava[3]) or spatial understanding (Spatialrgpt[9]), and Application-Driven Reasoning Systems deploy these models in real-world contexts like autonomous driving (DriveVLM[29]) or navigation (NavGPT[32]). Evaluation, Benchmarking, and Analysis branches provide critical assessment tools (Vhelm[11], Nphardeval4v[4]), while Comprehensive Surveys and Reviews (Multimodal Reasoning Survey[41], Reinforced MLLM Survey[34]) synthesize emerging trends. Within the reinforcement learning branch, a particularly active line of work centers on designing reward functions that account for perceptual grounding and reasoning quality. Perception-R1[0] exemplifies this direction by integrating perception-aware rewards to align visual understanding with logical inference, contrasting with more general RL frameworks like Vision-R1[1] or VLA-r1[5] that may prioritize broader policy learning. Cold Start Reasoning[37] addresses the challenge of bootstrapping reasoning from limited initial supervision, highlighting a key trade-off between sample efficiency and reasoning depth. Meanwhile, works such as Mixed Preference Optimization[6] and DeepThinkVLA[7] explore hybrid strategies that combine supervised signals with reinforcement feedback. Perception-R1[0] sits at the intersection of these themes, emphasizing how carefully crafted perceptual rewards can guide models to produce more interpretable and accurate multimodal reasoning, a focus that distinguishes it from neighboring efforts that treat perception and reasoning as more loosely coupled components.

Claimed Contributions

Investigation of multimodal perception limitations in RLVR-trained MLLMs

10 retrieved papers

The authors conduct McNemar's test and error analysis to demonstrate that existing accuracy-only RLVR methods fail to significantly improve the multimodal perception capabilities of MLLMs compared to base models, identifying this as a key limitation for multimodal reasoning advancement.

10 retrieved papers

Perception-R1 method with visual perception reward

Can Refute

10 retrieved papers

The authors introduce Perception-R1, a novel training approach that incorporates a visual perception reward alongside standard accuracy rewards in RLVR. This reward is computed by extracting visual annotations from CoT trajectories and using a judging LLM to assess consistency between these annotations and model-generated responses, thereby explicitly encouraging accurate visual perception.

10 retrieved papers

Can Refute

Superior performance with exceptional data efficiency

Can Refute

10 retrieved papers

The authors demonstrate through comprehensive experiments that Perception-R1 achieves state-of-the-art performance across multiple multimodal benchmarks while using substantially fewer training samples (1,442) compared to existing methods, showcasing both effectiveness and exceptional data efficiency.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[37] Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start PDF

Wei Lai, Li Yuting, Wang Chen, Wang, Yue, Kong Linghe, Sun, Lichao, Huang, Weiran (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Investigation of multimodal perception limitations in RLVR-trained MLLMs

[1] Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models PDF

Cannot Refute

[38] SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models PDF

Cannot Refute

[55] Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning PDF

Cannot Refute

[70] Fine-tuning large vision-language models as decision-making agents via reinforcement learning PDF

Cannot Refute

[71] Improving vision-language-action model with online reinforcement learning PDF

Cannot Refute

[72] Self-rewarding vision-language model via reasoning decomposition PDF

Cannot Refute

[73] Co-Reinforcement Learning for Unified Multimodal Understanding and Generation PDF

Cannot Refute

[74] VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model PDF

Cannot Refute

[75] CLIP-ReID: Exploiting Vision-Language Model for Image Re-identification without Concrete Text Labels PDF

Cannot Refute

[76] Tgrpo: Fine-tuning vision-language-action model via trajectory-wise group relative policy optimization PDF

Cannot Refute

Contribution

Perception-R1 method with visual perception reward

[57] Perception before reasoning: Two-stage reinforcement learning for visual reasoning in vision-language models PDF

Can Refute

[51] Skywork r1v2: Multimodal hybrid reinforcement learning for reasoning PDF

Cannot Refute

[52] VisualPRM: An Effective Process Reward Model for Multimodal Reasoning PDF

Cannot Refute

[53] Unleashing Perception-Time Scaling to Multimodal Reasoning Models PDF

Cannot Refute

[54] VisionReasoner: Unified Visual Perception and Reasoning via Reinforcement Learning PDF

Cannot Refute

[55] Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning PDF

Cannot Refute

[56] UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding PDF

Cannot Refute

[58] VisRL: Intention-Driven Visual Perception via Reinforced Reasoning PDF

Cannot Refute

[59] GeoPQA: Bridging the Visual Perception Gap in MLLMs for Geometric Reasoning PDF

Cannot Refute

[60] Jigsaw-R1: A Study of Rule-based Visual Reinforcement Learning with Jigsaw Puzzles PDF

Cannot Refute

Contribution

Superior performance with exceptional data efficiency

[65] SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement PDF

Can Refute

[69] Unlocking the Potential of Difficulty Prior in RL-based Multimodal Reasoning PDF

Can Refute

[20] Multimodal Reasoning with Multimodal Knowledge Graph PDF

Cannot Refute

[61] Visiomath: Benchmarking figure-based mathematical reasoning in lmms PDF

Cannot Refute

[62] GM-PRM: A Generative Multimodal Process Reward Model for Multimodal Mathematical Reasoning PDF

Cannot Refute

[63] Diving into Self-Evolving Training for Multimodal Reasoning PDF

Cannot Refute

[64] NeSyGeo: A Neuro-Symbolic Framework for Multimodal Geometric Reasoning Data Generation PDF

Cannot Refute

[66] Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources PDF

Cannot Refute

[67] Redstar: Does scaling long-cot data unlock better slow-reasoning systems? PDF

Cannot Refute

[68] Readme: Rapid equation discovery with multimodal encoders PDF

Cannot Refute

Perception-R1: Advancing Multimodal Reasoning Capabilities of MLLMs via Visual Perception Reward

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[37] Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start PDF

Contribution Analysis

Investigation of multimodal perception limitations in RLVR-trained MLLMs

[1] Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models PDF

[38] SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models PDF

[55] Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning PDF

[70] Fine-tuning large vision-language models as decision-making agents via reinforcement learning PDF

[71] Improving vision-language-action model with online reinforcement learning PDF

[72] Self-rewarding vision-language model via reasoning decomposition PDF

[73] Co-Reinforcement Learning for Unified Multimodal Understanding and Generation PDF

[74] VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model PDF

[75] CLIP-ReID: Exploiting Vision-Language Model for Image Re-identification without Concrete Text Labels PDF

[76] Tgrpo: Fine-tuning vision-language-action model via trajectory-wise group relative policy optimization PDF

Perception-R1 method with visual perception reward

[57] Perception before reasoning: Two-stage reinforcement learning for visual reasoning in vision-language models PDF

[51] Skywork r1v2: Multimodal hybrid reinforcement learning for reasoning PDF

[52] VisualPRM: An Effective Process Reward Model for Multimodal Reasoning PDF

[53] Unleashing Perception-Time Scaling to Multimodal Reasoning Models PDF

[54] VisionReasoner: Unified Visual Perception and Reasoning via Reinforcement Learning PDF

[55] Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning PDF

[56] UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding PDF

[58] VisRL: Intention-Driven Visual Perception via Reinforced Reasoning PDF

[59] GeoPQA: Bridging the Visual Perception Gap in MLLMs for Geometric Reasoning PDF

[60] Jigsaw-R1: A Study of Rule-based Visual Reinforcement Learning with Jigsaw Puzzles PDF

Superior performance with exceptional data efficiency

[65] SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement PDF

[69] Unlocking the Potential of Difficulty Prior in RL-based Multimodal Reasoning PDF

[20] Multimodal Reasoning with Multimodal Knowledge Graph PDF

[61] Visiomath: Benchmarking figure-based mathematical reasoning in lmms PDF

[62] GM-PRM: A Generative Multimodal Process Reward Model for Multimodal Mathematical Reasoning PDF

[63] Diving into Self-Evolving Training for Multimodal Reasoning PDF

[64] NeSyGeo: A Neuro-Symbolic Framework for Multimodal Geometric Reasoning Data Generation PDF

[66] Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources PDF

[67] Redstar: Does scaling long-cot data unlock better slow-reasoning systems? PDF

[68] Readme: Rapid equation discovery with multimodal encoders PDF

Table of Contents