Perception-R1: Advancing Multimodal Reasoning Capabilities of MLLMs via Visual Perception Reward

ICLR 2026 Conference SubmissionAnonymous Authors
Multimodal Large Language ModelsMultimodal ReasoningReinforcement Learning
Abstract:

Enhancing the multimodal reasoning capabilities of Multimodal Large Language Models (MLLMs) is a challenging task that has attracted increasing attention in the community. Recently, several studies have applied Reinforcement Learning with Verifiable Rewards (RLVR) to the multimodal domain in order to enhance the reasoning abilities of MLLMs. However, these works largely overlook the enhancement of multimodal perception capabilities in MLLMs, which serve as a core prerequisite and foundational component of complex multimodal reasoning. Through McNemar's test, we find that existing RLVR method fails to effectively enhance the multimodal perception capabilities of MLLMs, thereby limiting their further improvement in multimodal reasoning. To address this limitation, we propose Perception-R1, which introduces a novel visual perception reward that explicitly encourages MLLMs to perceive the visual content accurately, thereby can effectively incentivizing both their multimodal perception and reasoning capabilities. Specifically, we first collect textual visual annotations from the CoT trajectories of multimodal problems, which will serve as visual references for reward assignment. During RLVR training, we employ a judging LLM to assess the consistency between the visual annotations and the responses generated by MLLM, and assign the visual perception reward based on these consistency judgments. Extensive experiments on several multimodal math and general benchmarks demonstrate the effectiveness and robustness of our Perception-R1, which achieves superior performance on all benchmarks using only 1,442 training data.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Perception-R1, a reinforcement learning method that introduces visual perception rewards to enhance both perception accuracy and reasoning quality in multimodal large language models. It resides in the 'Perception-Aware Reward Design' leaf under 'Reinforcement Learning-Based Reasoning Enhancement', which contains only two papers total. This represents a relatively sparse research direction within the broader taxonomy of fifty papers across thirty-six topics, suggesting the specific focus on perception-aware reward mechanisms remains an emerging area rather than a crowded subfield.

The taxonomy tree reveals neighboring leaves including 'Pure RL Reasoning Emergence' (two papers on emergent reasoning without supervised rationales), 'Preference-Based Optimization' (one paper on preference datasets), and 'RL for Vision-Language-Action Models' (two papers on embodied agents). The paper's emphasis on explicit visual perception rewards distinguishes it from these adjacent directions, which either focus on general policy learning or action-grounded reasoning. The scope note for this leaf explicitly excludes general reward mechanisms without perception-specific components, positioning the work at the intersection of visual grounding and reinforcement learning.

Among thirty candidates examined through semantic search and citation expansion, the contribution-level analysis shows mixed novelty signals. The investigation of perception limitations in RLVR-trained models examined ten candidates with zero refutations, suggesting this diagnostic angle appears relatively unexplored. However, the core Perception-R1 method examined ten candidates with one refutable match, and the performance claims examined ten candidates with two refutable matches. These statistics indicate that while the perception-focused diagnostic work may be novel, the technical approach and empirical contributions face more substantial prior work within the limited search scope.

The analysis reflects a targeted literature search rather than exhaustive coverage, examining thirty semantically similar papers from a field of fifty total works. The relatively sparse taxonomy leaf and limited refutations for the diagnostic contribution suggest potential novelty in identifying perception bottlenecks, though the method itself encounters overlapping prior work. A broader search beyond top-K semantic matches might reveal additional relevant comparisons, particularly in adjacent reinforcement learning or visual grounding literature not captured in this taxonomy.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: enhancing multimodal reasoning capabilities of multimodal large language models. The field has evolved into a rich landscape with several major branches. Reinforcement Learning-Based Reasoning Enhancement explores how reward signals and policy optimization can guide models toward more robust reasoning, often incorporating perception-aware reward design or cold-start strategies. Chain-of-Thought and Rationale-Based Reasoning focuses on eliciting step-by-step explanations, with works like Multimodal Chain-of-Thought[12] and Ddcot[14] pioneering structured intermediate reasoning. Supervised and Hybrid Training Approaches blend traditional fine-tuning with preference optimization methods, while Architectural and Representation Innovations introduce novel encoders or cross-modal fusion mechanisms. Specialized Reasoning Capabilities target domain-specific challenges such as mathematical problem-solving (Math-llava[3]) or spatial understanding (Spatialrgpt[9]), and Application-Driven Reasoning Systems deploy these models in real-world contexts like autonomous driving (DriveVLM[29]) or navigation (NavGPT[32]). Evaluation, Benchmarking, and Analysis branches provide critical assessment tools (Vhelm[11], Nphardeval4v[4]), while Comprehensive Surveys and Reviews (Multimodal Reasoning Survey[41], Reinforced MLLM Survey[34]) synthesize emerging trends. Within the reinforcement learning branch, a particularly active line of work centers on designing reward functions that account for perceptual grounding and reasoning quality. Perception-R1[0] exemplifies this direction by integrating perception-aware rewards to align visual understanding with logical inference, contrasting with more general RL frameworks like Vision-R1[1] or VLA-r1[5] that may prioritize broader policy learning. Cold Start Reasoning[37] addresses the challenge of bootstrapping reasoning from limited initial supervision, highlighting a key trade-off between sample efficiency and reasoning depth. Meanwhile, works such as Mixed Preference Optimization[6] and DeepThinkVLA[7] explore hybrid strategies that combine supervised signals with reinforcement feedback. Perception-R1[0] sits at the intersection of these themes, emphasizing how carefully crafted perceptual rewards can guide models to produce more interpretable and accurate multimodal reasoning, a focus that distinguishes it from neighboring efforts that treat perception and reasoning as more loosely coupled components.

Claimed Contributions

Investigation of multimodal perception limitations in RLVR-trained MLLMs

The authors conduct McNemar's test and error analysis to demonstrate that existing accuracy-only RLVR methods fail to significantly improve the multimodal perception capabilities of MLLMs compared to base models, identifying this as a key limitation for multimodal reasoning advancement.

10 retrieved papers
Perception-R1 method with visual perception reward

The authors introduce Perception-R1, a novel training approach that incorporates a visual perception reward alongside standard accuracy rewards in RLVR. This reward is computed by extracting visual annotations from CoT trajectories and using a judging LLM to assess consistency between these annotations and model-generated responses, thereby explicitly encouraging accurate visual perception.

10 retrieved papers
Can Refute
Superior performance with exceptional data efficiency

The authors demonstrate through comprehensive experiments that Perception-R1 achieves state-of-the-art performance across multiple multimodal benchmarks while using substantially fewer training samples (1,442) compared to existing methods, showcasing both effectiveness and exceptional data efficiency.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Investigation of multimodal perception limitations in RLVR-trained MLLMs

The authors conduct McNemar's test and error analysis to demonstrate that existing accuracy-only RLVR methods fail to significantly improve the multimodal perception capabilities of MLLMs compared to base models, identifying this as a key limitation for multimodal reasoning advancement.

Contribution

Perception-R1 method with visual perception reward

The authors introduce Perception-R1, a novel training approach that incorporates a visual perception reward alongside standard accuracy rewards in RLVR. This reward is computed by extracting visual annotations from CoT trajectories and using a judging LLM to assess consistency between these annotations and model-generated responses, thereby explicitly encouraging accurate visual perception.

Contribution

Superior performance with exceptional data efficiency

The authors demonstrate through comprehensive experiments that Perception-R1 achieves state-of-the-art performance across multiple multimodal benchmarks while using substantially fewer training samples (1,442) compared to existing methods, showcasing both effectiveness and exceptional data efficiency.