Spotlight on Token Perception for Multimodal Reinforcement Learning

ICLR 2026 Conference SubmissionAnonymous Authors
Multimodal ReasoningLVLMReinforcement Learning
Abstract:

While Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capabilities of Large Vision-Language Models (LVLMs), most existing methods in multimodal reasoning neglect the critical role of visual perception within the RLVR optimization process. In this paper, we undertake a pioneering exploration of multimodal RLVR through the novel perspective of token perception, which measures the visual dependency of each generated token. With a granular analysis of Chain-of-Thought (CoT) processes, we uncover two key insights: first, token perception in a rollout trajectory is sparsely distributed, where only a small fraction of tokens have high visual dependency for visually-grounded reasoning; second, different trajectories exhibit significant divergence in their overall visual dependency. Based on these observations, we propose V\textbf{V}isually-P\textbf{P}erceptive P\textbf{P}olicy O\textbf{O}ptimization (VPPO\textbf{VPPO}), a novel policy gradient algorithm that explicitly leverages token perception to refine the learning signal. Specifically, VPPO achieves this through a dual mechanism: it reweights a trajectory's advantage by its overall visual dependency, and focuses policy updates exclusively on perceptually pivotal tokens. On a comprehensive suite of eight perception and reasoning benchmarks, VPPO demonstrates substantial gains over leading open-source RL-tuned models, with its effectiveness consistently validated across 7B and 32B model scales. Our findings not only establish a new token-level perceptual perspective for analyzing multimodal RLVR but also present a novel and effective optimization strategy to significantly enhance the multimodal reasoning capabilities of LVLMs.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces a token perception perspective for analyzing multimodal RLVR and proposes VPPO, a policy gradient algorithm that reweights trajectories and filters tokens based on visual dependency. It resides in the Perception-Focused RL for VLMs leaf, which contains six papers total. This leaf sits within the broader Vision-Language Model Reinforcement Learning branch, distinguishing itself from the Visual Reasoning and Grounding sibling leaf by focusing on perceptual alignment and hallucination reduction rather than explicit chain-of-thought reasoning. The leaf is moderately populated, suggesting an active but not overcrowded research direction.

The taxonomy reveals that the paper's closest neighbors include Vision-R1 and Perception Before Reasoning, both of which treat visual encoding as a learnable RL component. The sibling Visual Reasoning and Grounding leaf contains ten papers emphasizing spatial grounding and step-by-step reasoning, while the Hybrid and Multi-Stage RL Frameworks leaf explores multi-reward or multi-stage training. The scope note for Perception-Focused RL explicitly excludes reasoning-centric methods with explicit chain-of-thought, positioning this work at the boundary: it analyzes CoT processes but optimizes perception rather than reasoning structure. This placement suggests the paper bridges perceptual optimization and reasoning analysis.

Among seventeen candidates examined, the token perception perspective (Contribution A) showed no clear refutation across ten candidates, indicating limited prior work on granular token-level visual dependency analysis in multimodal RLVR. The VPPO algorithm (Contribution B) examined four candidates and found one refutable match, suggesting some overlap with existing policy gradient methods that incorporate perceptual signals. The dual-mechanism strategy (Contribution C) examined three candidates with no refutation, implying that the specific combination of trajectory shaping and token filtering may be less explored. The limited search scope means these findings reflect top-seventeen semantic matches, not exhaustive coverage.

Given the moderate leaf population and the limited search scope, the work appears to occupy a relatively novel position within perception-focused VLM reinforcement learning. The token perception lens and dual-mechanism design show fewer overlaps among the examined candidates, though the VPPO algorithm has at least one prior method with similar perceptual weighting. The analysis is constrained by the seventeen-candidate scope and does not capture the full landscape of multimodal RL or broader vision-language optimization literature.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
17
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: multimodal reinforcement learning with visual perception optimization. The field addresses how agents can learn effective policies when visual inputs must be processed, refined, or actively controlled to support decision-making. The taxonomy reveals five main branches: Vision-Language Model Reinforcement Learning focuses on integrating linguistic and visual reasoning within RL frameworks, often leveraging pretrained VLMs and refining their perceptual grounding through reward signals (e.g., Visual RFT[2], Vision-R1[22]). Embodied Agent Reinforcement Learning emphasizes physical or simulated robots that must navigate and manipulate objects using vision (e.g., DreamVLA[8], MemoryVLA[50]). Visual Representation Learning for RL explores how to construct or adapt visual encodings—whether through self-supervision, world models, or entity abstraction—to improve sample efficiency and generalization (e.g., Self-Supervised 3D[4], Entity Abstraction[43]). Specialized RL Applications in Vision targets domain-specific challenges such as autonomous driving perception or robotic manipulation where visual feedback directly shapes control (e.g., QT-Opt[14], Robotic Arm Perception[29]). Finally, Cognitive and Perceptual Modeling with RL investigates how agents can learn perceptual strategies themselves, adjusting attention or sensor parameters adaptively (e.g., Active Visual Perception[6], Adaptive Perception Control[32]). A particularly active line of work within Vision-Language Model RL examines whether and how to optimize perception before or during reasoning. Token Perception Multimodal[0] sits squarely in this Perception-Focused RL for VLMs cluster, sharing thematic ground with Vision-R1[22] and Perception Before Reasoning[24], all of which treat visual encoding as a learnable component that can be refined via RL to better support downstream reasoning or action selection. In contrast, nearby efforts like DeepPerception[37] and Visual Perception Reward[9] emphasize reward shaping or auxiliary objectives that guide perceptual modules without necessarily restructuring the VLM's internal token flow. The central trade-off across these branches concerns whether to treat vision as a fixed preprocessing step, a jointly learned representation, or an actively controlled process—each choice influencing sample complexity, interpretability, and the degree to which perceptual improvements transfer across tasks.

Claimed Contributions

Token perception perspective for analyzing multimodal RLVR

The authors introduce a novel analytical framework that examines multimodal reinforcement learning through token-level visual dependency. They discover that only a small fraction of tokens exhibit high visual dependency and that trajectories show significant divergence in overall visual grounding.

10 retrieved papers
Visually-Perceptive Policy Optimization (VPPO) algorithm

The authors propose VPPO, a new policy gradient method that uses visual dependency scores to reweight trajectory advantages at the macro level and filter gradients to pivotal tokens at the micro level, thereby focusing learning on visually-grounded reasoning.

4 retrieved papers
Can Refute
Dual-mechanism optimization strategy with trajectory shaping and token filtering

The method implements a hierarchical control mechanism combining Trajectory-level Advantage Shaping (TAS) that prioritizes visually-grounded trajectories and Token-level Gradient Filtering (TGF) that concentrates updates on high visual dependency tokens.

3 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Token perception perspective for analyzing multimodal RLVR

The authors introduce a novel analytical framework that examines multimodal reinforcement learning through token-level visual dependency. They discover that only a small fraction of tokens exhibit high visual dependency and that trajectories show significant divergence in overall visual grounding.

Contribution

Visually-Perceptive Policy Optimization (VPPO) algorithm

The authors propose VPPO, a new policy gradient method that uses visual dependency scores to reweight trajectory advantages at the macro level and filter gradients to pivotal tokens at the micro level, thereby focusing learning on visually-grounded reasoning.

Contribution

Dual-mechanism optimization strategy with trajectory shaping and token filtering

The method implements a hierarchical control mechanism combining Trajectory-level Advantage Shaping (TAS) that prioritizes visually-grounded trajectories and Token-level Gradient Filtering (TGF) that concentrates updates on high visual dependency tokens.