Perception-Aware Policy Optimization for Multimodal Reasoning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

multimodal reasoningreinforcement learningpolicy optimizationlarge language modelsvisual perceptionGRPODAPO

Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be a highly effective strategy for empowering Large Language Models (LLMs) with long chain-of-thought reasoning abilities. However, its design and optimizations remain tailored to purely textual domains, resulting in suboptimal performance when applied to multimodal reasoning tasks. In particular, we observe that a major source of error (67%) in current multimodal reasoning lies in the perception of visual inputs. To address this bottleneck, we propose PAPO, a novel policy gradient algorithm that encourages the model to generate visually grounded reasoning without external supervision. Specifically, we introduce the Implicit Perception Loss in the form of a KL divergence term, which maximizes the difference between two probability distributions over the same rollout sequence, conditioned on either the original or corrupted visual input. Notably, PAPO does not rely on any additional data annotation, reward models, or stronger teacher models, and can therefore be seamlessly integrated into mainstream RLVR algorithms such as GRPO and DAPO. To further enhance the training stability of PAPO, we introduce the Double Entropy Loss, which effectively regularizes the new KL objective without compromising performance. Despite its simplicity, PAPO yields significant overall improvements of 4.4%-17.5% on diverse multimodal benchmarks. The improvements are more pronounced, approaching 8.0%-19.1%, on tasks with high vision dependency. We also observe a substantial reduction of 30.5% in perception errors, indicating improved perceptual capabilities with PAPO. Overall, PAPO offers a new perspective on advancing multimodal RLVR via the optimization objective, moving beyond rollout or reward design and pointing toward deeper integration of perception and reasoning.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces PAPO, a policy gradient algorithm that addresses visual perception errors in multimodal reasoning by incorporating an Implicit Perception Loss based on KL divergence between original and corrupted visual inputs. It resides in the 'Visual Grounding and Perception Optimization' leaf alongside three sibling papers (Visual Perception Reward, Perception in Reflection, and one other). This leaf sits within the broader 'Perception-Centric RL Methods' branch, which contains only three leaves total, suggesting a moderately populated but not overcrowded research direction focused specifically on perception as the primary optimization target.

The taxonomy reveals neighboring work in adjacent leaves: 'Iterative Perception Refinement' (three papers) explores feedback loops for progressive visual understanding, while 'Perception-Reasoning Decoupling' (two papers) advocates separating perception from reasoning. The parent branch 'Reinforcement Learning Frameworks for Multimodal Reasoning' also contains 'General-Purpose Multimodal RL Frameworks' with eight papers addressing cross-modal reasoning without perception-specific focus. PAPO's approach of jointly optimizing perception within the policy loop contrasts with decoupled architectures and differs from iterative refinement methods by embedding perceptual grounding directly into the gradient signal rather than through separate feedback stages.

Among 30 candidates examined across three contributions, none were identified as clearly refuting the work. The PAPO algorithm with Implicit Perception Loss examined 10 candidates with zero refutable matches, as did the Double Entropy Loss regularization (10 candidates, zero refutable) and the integration of perception into RLVR objectives (10 candidates, zero refutable). This suggests that within the limited search scope, the specific mechanism of using KL divergence over corrupted visual inputs to drive perceptual grounding appears distinct from examined prior work, though the search scale means potentially relevant papers outside the top-30 semantic matches may exist.

Based on the limited literature search covering 30 candidates, the work appears to occupy a recognizable but not densely populated niche within perception-centric multimodal RL. The taxonomy structure shows related directions exist (iterative refinement, decoupled architectures), but the specific technical approach of implicit perception loss through input corruption was not matched in the examined candidates. A more exhaustive search beyond top-K semantic similarity might reveal closer precedents, particularly in adjacent leaves or in domain-specific applications not fully captured here.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Improving visual perception in multimodal reasoning through policy optimization. This field addresses how reinforcement learning can enhance the way multimodal models extract and utilize visual information when solving complex reasoning problems. The taxonomy reveals several major branches: Reinforcement Learning Frameworks for Multimodal Reasoning explores perception-centric methods and visual grounding techniques, often emphasizing how agents learn to attend to relevant visual features through reward signals (e.g., Visual Perception Reward[22], Perception in Reflection[23]). Training Paradigms and Optimization Algorithms investigates algorithmic innovations such as group relative policy optimization and reward modeling strategies that guide perceptual improvements. Domain-Specific Applications targets concrete settings like medical imaging (Med-R1[30], Patho-R1[33]) and embodied navigation (Embodied AI Vehicular[10]), while Specialized Reasoning Tasks focuses on challenges such as spatial reasoning (Spatial-SSRL[3]) and document understanding (DocThinker[28]). Foundational Studies and Benchmarking provides evaluation frameworks and baseline comparisons across these diverse problem settings. A particularly active line of work centers on integrating perception optimization directly into the policy learning loop, where models learn not only what reasoning steps to take but also which visual cues to prioritize. Perception-Aware Policy[0] exemplifies this approach by coupling perceptual attention mechanisms with policy gradients, situating itself within the Visual Grounding and Perception Optimization cluster alongside neighbors like Visual Perception Reward[22] and Perception in Reflection[23]. While Perception in Reflection[23] emphasizes iterative refinement of visual interpretations through self-critique, Perception-Aware Policy[0] more tightly integrates perceptual decisions into the forward reasoning trajectory. Contrasting approaches such as Perception before Reasoning[29] advocate for decoupling perception from downstream reasoning, raising open questions about whether joint optimization or modular pipelines better balance sample efficiency and interpretability. Across branches, trade-offs emerge between end-to-end learning (VideoChat-R1[4], Vision-r1[8]) and structured decomposition (Perceptual Decoupling[17]), reflecting ongoing debates about how best to scale visual understanding in policy-driven multimodal systems.

Claimed Contributions

PAPO algorithm with Implicit Perception Loss

10 retrieved papers

The authors introduce PAPO (Perception-Aware Policy Optimization), a new policy gradient algorithm that integrates an Implicit Perception Loss term into RLVR frameworks. This loss maximizes KL divergence between probability distributions conditioned on original versus corrupted visual inputs, encouraging visually grounded responses without requiring additional annotations, reward models, or teacher models.

10 retrieved papers

Double Entropy Loss regularization

10 retrieved papers

The authors propose a Double Entropy Loss regularization technique that stabilizes training by preventing model collapse during optimization of the unbounded KL divergence objective. This regularization encourages low entropy in both the original and corrupted policy distributions.

10 retrieved papers

Integration of perception into RLVR optimization objective

10 retrieved papers

The authors present a new perspective on multimodal RLVR by modifying the core optimization objective itself rather than only adjusting data, rollout, or reward components. This represents the first work to explore deeper integration of perception-aware supervision signals beyond reward-level modifications in multimodal reasoning.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[22] Advancing Multimodal Reasoning Capabilities of Multimodal Large Language Models via Visual Perception Reward PDF

T Xiao, X Xu, Z Huang, H Gao, Q Liu (2025)

[23] Perception in Reflection PDF

Wei Yana, Zhao Liang, Yana Wei, Liang Zhao, Yu, En, Kangheng Lin, Peng Yuang, En Yu, Dong, Runpei, Yuang Peng, Sun Jian-jian, Runpei Dong, Wei, Haoran, JianâYuan Sun, Ge, Zheng, Haoran Wei, Zhang Xiang-yu, Zheng Ge, Patel, Vishal M., Xiangyu Zhang, Vishal M. Patel (2025) • International Conference on Machine Learning

[29] Perception before reasoning: Two-stage reinforcement learning for visual reasoning in vision-language models PDF

Chen Yan, Li Long, Yan Chen, Xi, Teng, Long Li, Zeng Long, Teng Xi, Wang Jingdong, Long Zeng, Jingdong Wang (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

PAPO algorithm with Implicit Perception Loss

[8] Vision-r1: Incentivizing reasoning capability in multimodal large language models PDF

Cannot Refute

[20] Latent visual reasoning PDF

Cannot Refute

[29] Perception before reasoning: Two-stage reinforcement learning for visual reasoning in vision-language models PDF

Cannot Refute

[32] Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning of Vision Language Models PDF

Cannot Refute

[35] PeRL: Permutation-Enhanced Reinforcement Learning for Interleaved Vision-Language Reasoning PDF

Cannot Refute

[61] Reason-rft: Reinforcement fine-tuning for visual reasoning PDF

Cannot Refute

[62] Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning PDF

Cannot Refute

[63] Latent chain-of-thought for visual reasoning PDF

Cannot Refute

[64] R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization PDF

Cannot Refute

[65] Grounded Reinforcement Learning for Visual Reasoning PDF

Cannot Refute

Contribution

Double Entropy Loss regularization

[51] The entropy mechanism of reinforcement learning for reasoning language models PDF

Cannot Refute

[52] Relative entropy regularized sample-efficient reinforcement learning with continuous actions PDF

Cannot Refute

[53] APO: Enhancing Reasoning Ability of MLLMs via Asymmetric Policy Optimization PDF

Cannot Refute

[54] Greedification operators for policy optimization: Investigating forward and reverse kl divergences PDF

Cannot Refute

[55] Kl-entropy-regularized rl with a generative model is minimax optimal PDF

Cannot Refute

[56] Cautious policy programming: exploiting KL regularization for monotonic policy improvement in reinforcement learning PDF

Cannot Refute

[57] Enforcing kl regularization in general tsallis entropy reinforcement learning via advantage learning PDF

Cannot Refute

[58] Statistical analysis of Inverse Entropy-regularized Reinforcement Learning PDF

Cannot Refute

[59] Your Policy Regularizer is Secretly an Adversary PDF

Cannot Refute

[60] Entropy Regularization for Scalable, Safe and Robust Reinforcement Learning PDF

Cannot Refute

Contribution

Integration of perception into RLVR optimization objective

[17] Perceptual Decoupling for Scalable Multi-modal Reasoning via Reward-Optimized Captioning PDF

Cannot Refute

[29] Perception before reasoning: Two-stage reinforcement learning for visual reasoning in vision-language models PDF

Cannot Refute

[31] VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception PDF

Cannot Refute

[50] VQAThinker: Exploring Generalizable and Explainable Video Quality Assessment via Reinforcement Learning PDF

Cannot Refute

[66] Breaking the sft plateau: Multimodal structured reinforcement learning for chart-to-code generation PDF

Cannot Refute

[67] Visuriddles: Fine-grained perception is a primary bottleneck for multimodal large language models in abstract visual reasoning PDF

Cannot Refute

[68] Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning PDF

Cannot Refute

[69] Unlocking Multimodal Mathematical Reasoning via Process Reward Model PDF

Cannot Refute

[70] Learning only with images: Visual reinforcement learning with reasoning, rendering, and visual feedback PDF

Cannot Refute

[71] Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens PDF

Cannot Refute

Perception-Aware Policy Optimization for Multimodal Reasoning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[22] Advancing Multimodal Reasoning Capabilities of Multimodal Large Language Models via Visual Perception Reward PDF

[23] Perception in Reflection PDF

[29] Perception before reasoning: Two-stage reinforcement learning for visual reasoning in vision-language models PDF

Contribution Analysis

PAPO algorithm with Implicit Perception Loss

[8] Vision-r1: Incentivizing reasoning capability in multimodal large language models PDF

[20] Latent visual reasoning PDF

[29] Perception before reasoning: Two-stage reinforcement learning for visual reasoning in vision-language models PDF

[32] Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning of Vision Language Models PDF

[35] PeRL: Permutation-Enhanced Reinforcement Learning for Interleaved Vision-Language Reasoning PDF

[61] Reason-rft: Reinforcement fine-tuning for visual reasoning PDF

[62] Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning PDF

[63] Latent chain-of-thought for visual reasoning PDF

[64] R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization PDF

[65] Grounded Reinforcement Learning for Visual Reasoning PDF

Double Entropy Loss regularization

[51] The entropy mechanism of reinforcement learning for reasoning language models PDF

[52] Relative entropy regularized sample-efficient reinforcement learning with continuous actions PDF

[53] APO: Enhancing Reasoning Ability of MLLMs via Asymmetric Policy Optimization PDF

[54] Greedification operators for policy optimization: Investigating forward and reverse kl divergences PDF

[55] Kl-entropy-regularized rl with a generative model is minimax optimal PDF

[56] Cautious policy programming: exploiting KL regularization for monotonic policy improvement in reinforcement learning PDF

[57] Enforcing kl regularization in general tsallis entropy reinforcement learning via advantage learning PDF

[58] Statistical analysis of Inverse Entropy-regularized Reinforcement Learning PDF

[59] Your Policy Regularizer is Secretly an Adversary PDF

[60] Entropy Regularization for Scalable, Safe and Robust Reinforcement Learning PDF

Integration of perception into RLVR optimization objective

[17] Perceptual Decoupling for Scalable Multi-modal Reasoning via Reward-Optimized Captioning PDF

[29] Perception before reasoning: Two-stage reinforcement learning for visual reasoning in vision-language models PDF

[31] VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception PDF

[50] VQAThinker: Exploring Generalizable and Explainable Video Quality Assessment via Reinforcement Learning PDF

[66] Breaking the sft plateau: Multimodal structured reinforcement learning for chart-to-code generation PDF

[67] Visuriddles: Fine-grained perception is a primary bottleneck for multimodal large language models in abstract visual reasoning PDF

[68] Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning PDF

[69] Unlocking Multimodal Mathematical Reasoning via Process Reward Model PDF

[70] Learning only with images: Visual reinforcement learning with reasoning, rendering, and visual feedback PDF

[71] Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens PDF

Table of Contents