From Narrow to Panoramic Vision: Attention-Guided Cold-Start Reshapes Multimodal Reasoning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.5 Download Report PDF

Multimodal reasoningMultimodal RLMultimodal Large Language ModelAttention Analysis

The cold-start initialization stage plays a pivotal role in training Multimodal Large Reasoning Models (MLRMs), yet its mechanisms remain insufficiently understood. To analyze this stage, we introduce the Visual Attention Score (VAS), an attention-based metric that quantifies how much a model attends to visual tokens. We find that reasoning performance is strongly correlated with VAS (r=0.9616): models with higher VAS achieve substantially stronger multimodal reasoning. Surprisingly, multimodal cold-start fails to raise VAS, leaving distributions close to the base model, whereas text-only cold-start induces a clear increase. We term this counter-intuitive phenomenon Lazy Attention Localization. To validate its causal role, we design training-free interventions that directly manipulate attention allocation at inference time, yielding consistent 1--2% gains without retraining. Building on these insights, we propose Attention-Guided Visual Anchoring and Reflection (AVAR), a comprehensive cold-start framework that integrates visual-anchored data synthesis, attention-guided objectives, and visual-anchored reward shaping. Applied to Qwen2.5-VL-7B, AVAR delivers an average gain of 7.0% across 7 multimodal reasoning benchmarks. Ablation studies further confirm that each component of AVAR contributes step-wise to the overall gains.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces the Visual Attention Score (VAS) metric to quantify visual token attention and proposes the AVAR framework for cold-start training in multimodal reasoning models. It occupies the sole position in the 'Cold-Start Training with Attention Optimization' leaf, which sits under the broader 'Attention-Guided Visual Grounding Mechanisms' branch. This leaf contains only the original paper itself, indicating a relatively sparse research direction within the taxonomy. The broader branch includes two sibling leaves focused on reinforcement learning-based iterative focusing and dynamic interactive inference, suggesting the field explores diverse attention-guided approaches but has limited prior work specifically targeting cold-start attention optimization.

The taxonomy reveals neighboring work in zero-shot transfer learning and multimodal fusion for recommendations, but these branches address fundamentally different problems. The zero-shot category focuses on 3D grounding without supervision, while the recommendation branches tackle data sparsity through graph convolution and LLM-guided profiling. The original paper's emphasis on attention allocation during cold-start initialization distinguishes it from sibling leaves that assume pre-existing grounding capabilities or operate at inference time. The taxonomy structure suggests attention-guided visual grounding is an active area, but cold-start training optimization represents a less-explored intersection within this broader landscape.

Among 21 candidates examined, the VAS metric and Lazy Attention Localization phenomenon show no clear refutation across 10 candidates reviewed. The training-free intervention contribution examined 10 candidates and found 1 potentially refutable match, suggesting some overlap with existing inference-time manipulation methods. The AVAR framework, examined against only 1 candidate, shows no refutation but reflects limited search coverage. The statistics indicate the VAS metric and Lazy Attention phenomenon appear more novel within the examined scope, while training-free interventions encounter more substantial prior work among the limited candidates reviewed.

Based on the top-21 semantic matches examined, the work appears to occupy a relatively unexplored niche at the intersection of cold-start training and attention optimization. The single-paper leaf and limited refutation across most contributions suggest novelty within the examined scope, though the small candidate pool and sparse taxonomy leaf prevent definitive claims about broader field coverage. The analysis captures attention-guided grounding methods but may not reflect exhaustive coverage of cold-start training literature or inference-time intervention techniques.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Multimodal reasoning with attention-guided visual grounding in cold-start training. The field addresses how models can effectively align visual and textual information when limited or no prior training data is available, a challenge that spans several interconnected research directions. The taxonomy reveals three main branches: Attention-Guided Visual Grounding Mechanisms focus on designing attention architectures that explicitly link language tokens to image regions, enabling interpretable cross-modal alignment; Zero-Shot and Cross-Modal Transfer Learning explores how pre-trained models can generalize to novel tasks or domains without task-specific fine-tuning; and Multimodal Fusion for Recommendation Systems investigates how visual and textual signals can be combined to address data sparsity in user preference modeling. Works like VLM Region Reasoning[3] and VLM Grounder[4] exemplify the grounding mechanisms branch by refining region-level visual understanding, while approaches such as Grounding Reinforcement Learning[1] and GraphFusion HRL[2] demonstrate how structured reasoning and hierarchical representations support transfer and fusion objectives. A particularly active line of work centers on optimizing attention mechanisms under cold-start conditions, where models must learn effective grounding without extensive labeled examples. Panoramic Vision[0] situates itself within this branch by emphasizing attention optimization strategies that bootstrap visual grounding from minimal supervision, contrasting with VLM Region Reasoning[3], which assumes richer region annotations, and VLM Grounder[4], which relies on pre-existing grounding capabilities. Meanwhile, Heterogeneous Graph LLM[5] explores graph-based fusion for integrating diverse modalities, highlighting a trade-off between structured relational modeling and the direct attention-driven alignment pursued by Panoramic Vision[0]. These contrasting emphases reveal an open question: whether cold-start grounding is best achieved through end-to-end attention learning or by leveraging auxiliary structures like graphs or reinforcement signals to guide early-stage alignment.

Claimed Contributions

Visual Attention Score (VAS) metric and Lazy Attention Localization phenomenon

10 retrieved papers

The authors propose VAS, a metric measuring attention allocation to visual versus system tokens, and discover it strongly predicts reasoning performance. They reveal that multimodal cold-start fails to increase VAS while text-only initialization paradoxically raises it, a phenomenon they term Lazy Attention Localization.

10 retrieved papers

Training-free attention manipulation interventions

Can Refute

10 retrieved papers

The authors develop inference-time interventions that reallocate attention from system tokens to visual tokens without any model retraining. These interventions achieve 1–2% performance gains across different models, providing causal evidence for the role of visual attention in multimodal reasoning.

10 retrieved papers

Can Refute

Attention-Guided Visual Anchoring and Reflection (AVAR) framework

1 retrieved paper

The authors introduce AVAR, a complete cold-start training framework combining three components: visual-anchored reflection data synthesis that embeds visual grounding into reasoning chains, attention-guided training objectives that reshape attention distribution, and visual-anchored reward shaping for reinforcement learning. Applied to Qwen2.5-VL-7B, it achieves 7.0% average improvement across seven benchmarks.

1 retrieved paper

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Visual Attention Score (VAS) metric and Lazy Attention Localization phenomenon

[7] ChainV: Atomic Visual Hints Make Multimodal Reasoning Shorter and Better PDF

Cannot Refute

[8] See What You Are Told: Visual Attention Sink in Large Multimodal Models PDF

Cannot Refute

[9] Vman: visual-modified attention network for multimodal paradigms PDF

Cannot Refute

[10] Flashvlm: Text-guided visual token selection for large multimodal models PDF

Cannot Refute

[11] Few-shot learning with visual distribution calibration and cross-modal distribution alignment PDF

Cannot Refute

[12] Think Twice to See More: Iterative Visual Reasoning in Medical VLMs PDF

Cannot Refute

[13] Why is spatial reasoning hard for vlms? an attention mechanism perspective on focus areas PDF

Cannot Refute

[14] Visual attention never fades: Selective progressive attention recalibration for detailed image captioning in multimodal large language models PDF

Cannot Refute

[15] X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval PDF

Cannot Refute

[16] Unveiling visual perception in language models: An attention head analysis approach PDF

Cannot Refute

Contribution

Training-free attention manipulation interventions

[23] Mllms know where to look: Training-free perception of small visual details with multimodal llms PDF

Can Refute

[14] Visual attention never fades: Selective progressive attention recalibration for detailed image captioning in multimodal large language models PDF

Cannot Refute

[17] Frustratingly easy test-time adaptation of vision-language models PDF

Cannot Refute

[18] Efficient Test-Time Adaptation of Vision-Language Models PDF

Cannot Refute

[19] Training-Free Layout Control with Cross-Attention Guidance PDF

Cannot Refute

[20] Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models PDF

Cannot Refute

[21] Training-Free Zero-Shot Temporal Action Detection with Vision-Language Models PDF

Cannot Refute

[22] Noisy Test-Time Adaptation in Vision-Language Models PDF

Cannot Refute

[24] Inferaligner: Inference-time alignment for harmlessness through cross-model guidance PDF

Cannot Refute

[25] BaFTA: Backprop-Free Test-Time Adaptation For Zero-Shot Vision-Language Models PDF

Cannot Refute

Contribution

Attention-Guided Visual Anchoring and Reflection (AVAR) framework

[6] Replicating human bias through synthetic data generation using deep learning PDF

Cannot Refute

From Narrow to Panoramic Vision: Attention-Guided Cold-Start Reshapes Multimodal Reasoning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Visual Attention Score (VAS) metric and Lazy Attention Localization phenomenon

[7] ChainV: Atomic Visual Hints Make Multimodal Reasoning Shorter and Better PDF

[8] See What You Are Told: Visual Attention Sink in Large Multimodal Models PDF

[9] Vman: visual-modified attention network for multimodal paradigms PDF

[10] Flashvlm: Text-guided visual token selection for large multimodal models PDF

[11] Few-shot learning with visual distribution calibration and cross-modal distribution alignment PDF

[12] Think Twice to See More: Iterative Visual Reasoning in Medical VLMs PDF

[13] Why is spatial reasoning hard for vlms? an attention mechanism perspective on focus areas PDF

[14] Visual attention never fades: Selective progressive attention recalibration for detailed image captioning in multimodal large language models PDF

[15] X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval PDF

[16] Unveiling visual perception in language models: An attention head analysis approach PDF

Training-free attention manipulation interventions

[23] Mllms know where to look: Training-free perception of small visual details with multimodal llms PDF

[14] Visual attention never fades: Selective progressive attention recalibration for detailed image captioning in multimodal large language models PDF

[17] Frustratingly easy test-time adaptation of vision-language models PDF

[18] Efficient Test-Time Adaptation of Vision-Language Models PDF

[19] Training-Free Layout Control with Cross-Attention Guidance PDF

[20] Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models PDF

[21] Training-Free Zero-Shot Temporal Action Detection with Vision-Language Models PDF

[22] Noisy Test-Time Adaptation in Vision-Language Models PDF

[24] Inferaligner: Inference-time alignment for harmlessness through cross-model guidance PDF

[25] BaFTA: Backprop-Free Test-Time Adaptation For Zero-Shot Vision-Language Models PDF

Attention-Guided Visual Anchoring and Reflection (AVAR) framework

[6] Replicating human bias through synthetic data generation using deep learning PDF

Table of Contents