Perception-Aware Policy Optimization for Multimodal Reasoning

ICLR 2026 Conference SubmissionAnonymous Authors
multimodal reasoningreinforcement learningpolicy optimizationlarge language modelsvisual perceptionGRPODAPO
Abstract:

Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be a highly effective strategy for empowering Large Language Models (LLMs) with long chain-of-thought reasoning abilities. However, its design and optimizations remain tailored to purely textual domains, resulting in suboptimal performance when applied to multimodal reasoning tasks. In particular, we observe that a major source of error (67%) in current multimodal reasoning lies in the perception of visual inputs. To address this bottleneck, we propose PAPO, a novel policy gradient algorithm that encourages the model to generate visually grounded reasoning without external supervision. Specifically, we introduce the Implicit Perception Loss in the form of a KL divergence term, which maximizes the difference between two probability distributions over the same rollout sequence, conditioned on either the original or corrupted visual input. Notably, PAPO does not rely on any additional data annotation, reward models, or stronger teacher models, and can therefore be seamlessly integrated into mainstream RLVR algorithms such as GRPO and DAPO. To further enhance the training stability of PAPO, we introduce the Double Entropy Loss, which effectively regularizes the new KL objective without compromising performance. Despite its simplicity, PAPO yields significant overall improvements of 4.4%-17.5% on diverse multimodal benchmarks. The improvements are more pronounced, approaching 8.0%-19.1%, on tasks with high vision dependency. We also observe a substantial reduction of 30.5% in perception errors, indicating improved perceptual capabilities with PAPO. Overall, PAPO offers a new perspective on advancing multimodal RLVR via the optimization objective, moving beyond rollout or reward design and pointing toward deeper integration of perception and reasoning.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces PAPO, a policy gradient algorithm that addresses visual perception errors in multimodal reasoning by incorporating an Implicit Perception Loss based on KL divergence between original and corrupted visual inputs. It resides in the 'Visual Grounding and Perception Optimization' leaf alongside three sibling papers (Visual Perception Reward, Perception in Reflection, and one other). This leaf sits within the broader 'Perception-Centric RL Methods' branch, which contains only three leaves total, suggesting a moderately populated but not overcrowded research direction focused specifically on perception as the primary optimization target.

The taxonomy reveals neighboring work in adjacent leaves: 'Iterative Perception Refinement' (three papers) explores feedback loops for progressive visual understanding, while 'Perception-Reasoning Decoupling' (two papers) advocates separating perception from reasoning. The parent branch 'Reinforcement Learning Frameworks for Multimodal Reasoning' also contains 'General-Purpose Multimodal RL Frameworks' with eight papers addressing cross-modal reasoning without perception-specific focus. PAPO's approach of jointly optimizing perception within the policy loop contrasts with decoupled architectures and differs from iterative refinement methods by embedding perceptual grounding directly into the gradient signal rather than through separate feedback stages.

Among 30 candidates examined across three contributions, none were identified as clearly refuting the work. The PAPO algorithm with Implicit Perception Loss examined 10 candidates with zero refutable matches, as did the Double Entropy Loss regularization (10 candidates, zero refutable) and the integration of perception into RLVR objectives (10 candidates, zero refutable). This suggests that within the limited search scope, the specific mechanism of using KL divergence over corrupted visual inputs to drive perceptual grounding appears distinct from examined prior work, though the search scale means potentially relevant papers outside the top-30 semantic matches may exist.

Based on the limited literature search covering 30 candidates, the work appears to occupy a recognizable but not densely populated niche within perception-centric multimodal RL. The taxonomy structure shows related directions exist (iterative refinement, decoupled architectures), but the specific technical approach of implicit perception loss through input corruption was not matched in the examined candidates. A more exhaustive search beyond top-K semantic similarity might reveal closer precedents, particularly in adjacent leaves or in domain-specific applications not fully captured here.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Improving visual perception in multimodal reasoning through policy optimization. This field addresses how reinforcement learning can enhance the way multimodal models extract and utilize visual information when solving complex reasoning problems. The taxonomy reveals several major branches: Reinforcement Learning Frameworks for Multimodal Reasoning explores perception-centric methods and visual grounding techniques, often emphasizing how agents learn to attend to relevant visual features through reward signals (e.g., Visual Perception Reward[22], Perception in Reflection[23]). Training Paradigms and Optimization Algorithms investigates algorithmic innovations such as group relative policy optimization and reward modeling strategies that guide perceptual improvements. Domain-Specific Applications targets concrete settings like medical imaging (Med-R1[30], Patho-R1[33]) and embodied navigation (Embodied AI Vehicular[10]), while Specialized Reasoning Tasks focuses on challenges such as spatial reasoning (Spatial-SSRL[3]) and document understanding (DocThinker[28]). Foundational Studies and Benchmarking provides evaluation frameworks and baseline comparisons across these diverse problem settings. A particularly active line of work centers on integrating perception optimization directly into the policy learning loop, where models learn not only what reasoning steps to take but also which visual cues to prioritize. Perception-Aware Policy[0] exemplifies this approach by coupling perceptual attention mechanisms with policy gradients, situating itself within the Visual Grounding and Perception Optimization cluster alongside neighbors like Visual Perception Reward[22] and Perception in Reflection[23]. While Perception in Reflection[23] emphasizes iterative refinement of visual interpretations through self-critique, Perception-Aware Policy[0] more tightly integrates perceptual decisions into the forward reasoning trajectory. Contrasting approaches such as Perception before Reasoning[29] advocate for decoupling perception from downstream reasoning, raising open questions about whether joint optimization or modular pipelines better balance sample efficiency and interpretability. Across branches, trade-offs emerge between end-to-end learning (VideoChat-R1[4], Vision-r1[8]) and structured decomposition (Perceptual Decoupling[17]), reflecting ongoing debates about how best to scale visual understanding in policy-driven multimodal systems.

Claimed Contributions

PAPO algorithm with Implicit Perception Loss

The authors introduce PAPO (Perception-Aware Policy Optimization), a new policy gradient algorithm that integrates an Implicit Perception Loss term into RLVR frameworks. This loss maximizes KL divergence between probability distributions conditioned on original versus corrupted visual inputs, encouraging visually grounded responses without requiring additional annotations, reward models, or teacher models.

10 retrieved papers
Double Entropy Loss regularization

The authors propose a Double Entropy Loss regularization technique that stabilizes training by preventing model collapse during optimization of the unbounded KL divergence objective. This regularization encourages low entropy in both the original and corrupted policy distributions.

10 retrieved papers
Integration of perception into RLVR optimization objective

The authors present a new perspective on multimodal RLVR by modifying the core optimization objective itself rather than only adjusting data, rollout, or reward components. This represents the first work to explore deeper integration of perception-aware supervision signals beyond reward-level modifications in multimodal reasoning.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

PAPO algorithm with Implicit Perception Loss

The authors introduce PAPO (Perception-Aware Policy Optimization), a new policy gradient algorithm that integrates an Implicit Perception Loss term into RLVR frameworks. This loss maximizes KL divergence between probability distributions conditioned on original versus corrupted visual inputs, encouraging visually grounded responses without requiring additional annotations, reward models, or teacher models.

Contribution

Double Entropy Loss regularization

The authors propose a Double Entropy Loss regularization technique that stabilizes training by preventing model collapse during optimization of the unbounded KL divergence objective. This regularization encourages low entropy in both the original and corrupted policy distributions.

Contribution

Integration of perception into RLVR optimization objective

The authors present a new perspective on multimodal RLVR by modifying the core optimization objective itself rather than only adjusting data, rollout, or reward components. This represents the first work to explore deeper integration of perception-aware supervision signals beyond reward-level modifications in multimodal reasoning.

Perception-Aware Policy Optimization for Multimodal Reasoning | Novelty Validation