Spotlight on Token Perception for Multimodal Reinforcement Learning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Multimodal ReasoningLVLMReinforcement Learning

While Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capabilities of Large Vision-Language Models (LVLMs), most existing methods in multimodal reasoning neglect the critical role of visual perception within the RLVR optimization process. In this paper, we undertake a pioneering exploration of multimodal RLVR through the novel perspective of token perception, which measures the visual dependency of each generated token. With a granular analysis of Chain-of-Thought (CoT) processes, we uncover two key insights: first, token perception in a rollout trajectory is sparsely distributed, where only a small fraction of tokens have high visual dependency for visually-grounded reasoning; second, different trajectories exhibit significant divergence in their overall visual dependency. Based on these observations, we propose $\textbf{V}$ isually- $\textbf{P}$ erceptive $\textbf{P}$ olicy $\textbf{O}$ ptimization ( $\textbf{VPPO}$ ), a novel policy gradient algorithm that explicitly leverages token perception to refine the learning signal. Specifically, VPPO achieves this through a dual mechanism: it reweights a trajectory's advantage by its overall visual dependency, and focuses policy updates exclusively on perceptually pivotal tokens. On a comprehensive suite of eight perception and reasoning benchmarks, VPPO demonstrates substantial gains over leading open-source RL-tuned models, with its effectiveness consistently validated across 7B and 32B model scales. Our findings not only establish a new token-level perceptual perspective for analyzing multimodal RLVR but also present a novel and effective optimization strategy to significantly enhance the multimodal reasoning capabilities of LVLMs.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces a token perception perspective for analyzing multimodal RLVR and proposes VPPO, a policy gradient algorithm that reweights trajectories and filters tokens based on visual dependency. It resides in the Perception-Focused RL for VLMs leaf, which contains six papers total. This leaf sits within the broader Vision-Language Model Reinforcement Learning branch, distinguishing itself from the Visual Reasoning and Grounding sibling leaf by focusing on perceptual alignment and hallucination reduction rather than explicit chain-of-thought reasoning. The leaf is moderately populated, suggesting an active but not overcrowded research direction.

The taxonomy reveals that the paper's closest neighbors include Vision-R1 and Perception Before Reasoning, both of which treat visual encoding as a learnable RL component. The sibling Visual Reasoning and Grounding leaf contains ten papers emphasizing spatial grounding and step-by-step reasoning, while the Hybrid and Multi-Stage RL Frameworks leaf explores multi-reward or multi-stage training. The scope note for Perception-Focused RL explicitly excludes reasoning-centric methods with explicit chain-of-thought, positioning this work at the boundary: it analyzes CoT processes but optimizes perception rather than reasoning structure. This placement suggests the paper bridges perceptual optimization and reasoning analysis.

Among seventeen candidates examined, the token perception perspective (Contribution A) showed no clear refutation across ten candidates, indicating limited prior work on granular token-level visual dependency analysis in multimodal RLVR. The VPPO algorithm (Contribution B) examined four candidates and found one refutable match, suggesting some overlap with existing policy gradient methods that incorporate perceptual signals. The dual-mechanism strategy (Contribution C) examined three candidates with no refutation, implying that the specific combination of trajectory shaping and token filtering may be less explored. The limited search scope means these findings reflect top-seventeen semantic matches, not exhaustive coverage.

Given the moderate leaf population and the limited search scope, the work appears to occupy a relatively novel position within perception-focused VLM reinforcement learning. The token perception lens and dual-mechanism design show fewer overlaps among the examined candidates, though the VPPO algorithm has at least one prior method with similar perceptual weighting. The analysis is constrained by the seventeen-candidate scope and does not capture the full landscape of multimodal RL or broader vision-language optimization literature.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: multimodal reinforcement learning with visual perception optimization. The field addresses how agents can learn effective policies when visual inputs must be processed, refined, or actively controlled to support decision-making. The taxonomy reveals five main branches: Vision-Language Model Reinforcement Learning focuses on integrating linguistic and visual reasoning within RL frameworks, often leveraging pretrained VLMs and refining their perceptual grounding through reward signals (e.g., Visual RFT[2], Vision-R1[22]). Embodied Agent Reinforcement Learning emphasizes physical or simulated robots that must navigate and manipulate objects using vision (e.g., DreamVLA[8], MemoryVLA[50]). Visual Representation Learning for RL explores how to construct or adapt visual encodings—whether through self-supervision, world models, or entity abstraction—to improve sample efficiency and generalization (e.g., Self-Supervised 3D[4], Entity Abstraction[43]). Specialized RL Applications in Vision targets domain-specific challenges such as autonomous driving perception or robotic manipulation where visual feedback directly shapes control (e.g., QT-Opt[14], Robotic Arm Perception[29]). Finally, Cognitive and Perceptual Modeling with RL investigates how agents can learn perceptual strategies themselves, adjusting attention or sensor parameters adaptively (e.g., Active Visual Perception[6], Adaptive Perception Control[32]). A particularly active line of work within Vision-Language Model RL examines whether and how to optimize perception before or during reasoning. Token Perception Multimodal[0] sits squarely in this Perception-Focused RL for VLMs cluster, sharing thematic ground with Vision-R1[22] and Perception Before Reasoning[24], all of which treat visual encoding as a learnable component that can be refined via RL to better support downstream reasoning or action selection. In contrast, nearby efforts like DeepPerception[37] and Visual Perception Reward[9] emphasize reward shaping or auxiliary objectives that guide perceptual modules without necessarily restructuring the VLM's internal token flow. The central trade-off across these branches concerns whether to treat vision as a fixed preprocessing step, a jointly learned representation, or an actively controlled process—each choice influencing sample complexity, interpretability, and the degree to which perceptual improvements transfer across tasks.

Claimed Contributions

Token perception perspective for analyzing multimodal RLVR

10 retrieved papers

The authors introduce a novel analytical framework that examines multimodal reinforcement learning through token-level visual dependency. They discover that only a small fraction of tokens exhibit high visual dependency and that trajectories show significant divergence in overall visual grounding.

10 retrieved papers

Visually-Perceptive Policy Optimization (VPPO) algorithm

Can Refute

4 retrieved papers

The authors propose VPPO, a new policy gradient method that uses visual dependency scores to reweight trajectory advantages at the macro level and filter gradients to pivotal tokens at the micro level, thereby focusing learning on visually-grounded reasoning.

4 retrieved papers

Can Refute

Dual-mechanism optimization strategy with trajectory shaping and token filtering

3 retrieved papers

The method implements a hierarchical control mechanism combining Trajectory-level Advantage Shaping (TAS) that prioritizes visually-grounded trajectories and Token-level Gradient Filtering (TGF) that concentrates updates on high visual dependency tokens.

3 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[9] Advancing Multimodal Reasoning Capabilities of Multimodal Large Language Models via Visual Perception Reward PDF

T Xiao, X Xu, Z Huang, H Gao, Q Liu (2025)

[22] Vision-r1: Evolving human-free alignment in large vision-language models via vision-guided reinforcement learning PDF

Zhan Yufei, Zhu, Yousong, Zheng Shurong, Zhao Hong-yin, Yang Fan, Tang Ming, Wang Jinqiao (2025)

[24] Perception Before Reasoning: Two-Stage Reinforcement Learning for Visual Reasoning in Vision-Language Models PDF

Chen Yan, Li Long, Yan Chen, Xi, Teng, Long Li, Zeng Long, Teng Xi, Wang Jingdong, Long Zeng, Jingdong Wang (2025) • arXiv.org

[37] DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding PDF

Ma, Xinyu, Ding Zi-yang, Luo Zhi-cong, Chen Chi, Guo, Zonghao, Wong, Derek F., Feng Xiaoyi, Sun, Maosong (2025)

[49] Align and Surpass Human Camouflaged Perception: Visual Refocus Reinforcement Fine-Tuning PDF

Shen, Ruolin, Ji Xiaozhong, Ruolin Shen, Wu Kai, Xiaozhong Ji, Zhang, Jiangning, He Yijun, Jiangning Zhang, Yang Hai-hua, Yijun He, Hu XiaoBin, HaiHua Yang, Sun Xiao-yu, Xiaobin Hu, Xiaoyu Sun (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Token perception perspective for analyzing multimodal RLVR

[34] VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning PDF

Cannot Refute

[51] Perception tokens enhance visual reasoning in multimodal language models PDF

Cannot Refute

[52] v1: Learning to Point Visual Tokens for Multimodal Grounded Reasoning PDF

Cannot Refute

[53] VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning PDF

Cannot Refute

[54] Introducing visual perception token into multimodal large language model PDF

Cannot Refute

[55] Consolidating Reinforcement Learning for Multimodal Discrete Diffusion Models PDF

Cannot Refute

[56] Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models PDF

Cannot Refute

[57] Lvpruning: An effective yet simple language-guided vision token pruning approach for multi-modal large language models PDF

Cannot Refute

[58] TokenCarve: Information-Preserving Visual Token Compression in Multimodal Large Language Models PDF

Cannot Refute

[59] DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models PDF

Cannot Refute

Contribution

Visually-Perceptive Policy Optimization (VPPO) algorithm

[66] ASSIGNMENT FOR MULTIMODAL REASONING PDF

Can Refute

[63] GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs PDF

Cannot Refute

[64] Group Critical-token Policy Optimization for Autoregressive Image Generation PDF

Cannot Refute

[65] IMPROVING VISION LLM PERFORMANCE ON STANDARDIZED TEST QUESTIONS PDF

Cannot Refute

Contribution

Dual-mechanism optimization strategy with trajectory shaping and token filtering

[60] Aspo: Asymmetric importance sampling policy optimization PDF

Cannot Refute

[61] Asymmetric Proximal Policy Optimization: mini-critics boost LLM reasoning PDF

Cannot Refute

[62] Rebalancing Return Coverage for Conditional Sequence Modeling in Offline Reinforcement Learning PDF

Cannot Refute

Spotlight on Token Perception for Multimodal Reinforcement Learning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[9] Advancing Multimodal Reasoning Capabilities of Multimodal Large Language Models via Visual Perception Reward PDF

[22] Vision-r1: Evolving human-free alignment in large vision-language models via vision-guided reinforcement learning PDF

[24] Perception Before Reasoning: Two-Stage Reinforcement Learning for Visual Reasoning in Vision-Language Models PDF

[37] DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding PDF

[49] Align and Surpass Human Camouflaged Perception: Visual Refocus Reinforcement Fine-Tuning PDF

Contribution Analysis

Token perception perspective for analyzing multimodal RLVR

[34] VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning PDF

[51] Perception tokens enhance visual reasoning in multimodal language models PDF

[52] v1: Learning to Point Visual Tokens for Multimodal Grounded Reasoning PDF

[53] VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning PDF

[54] Introducing visual perception token into multimodal large language model PDF

[55] Consolidating Reinforcement Learning for Multimodal Discrete Diffusion Models PDF

[56] Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models PDF

[57] Lvpruning: An effective yet simple language-guided vision token pruning approach for multi-modal large language models PDF

[58] TokenCarve: Information-Preserving Visual Token Compression in Multimodal Large Language Models PDF

[59] DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models PDF

Visually-Perceptive Policy Optimization (VPPO) algorithm

[66] ASSIGNMENT FOR MULTIMODAL REASONING PDF

[63] GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs PDF

[64] Group Critical-token Policy Optimization for Autoregressive Image Generation PDF

[65] IMPROVING VISION LLM PERFORMANCE ON STANDARDIZED TEST QUESTIONS PDF

Dual-mechanism optimization strategy with trajectory shaping and token filtering

[60] Aspo: Asymmetric importance sampling policy optimization PDF

[61] Asymmetric Proximal Policy Optimization: mini-critics boost LLM reasoning PDF

[62] Rebalancing Return Coverage for Conditional Sequence Modeling in Offline Reinforcement Learning PDF

Table of Contents