To Sink or Not to Sink: Visual Information Pathways in Large Vision-Language Models
Overview
Overall Novelty Assessment
The paper identifies and analyzes ViT attention sinks in large vision-language models, proposing a training-free repositioning strategy and a DIYSink framework with dual MLP projection and dynamic token selection. It resides in the 'Visual Token Attention Sink Discovery' leaf, which contains only two papers including this work and one sibling (Visual Attention Sink). This represents a relatively sparse research direction within a 16-paper taxonomy, suggesting the specific focus on ViT-level attention sinks is an emerging area with limited prior exploration.
The taxonomy reveals that most related work concentrates on downstream interventions rather than ViT-level analysis. Neighboring leaves address cross-modal attention dynamics across layers, hallucination mitigation through attention reallocation, and token compression for efficiency. The paper's focus on visual encoder sinks distinguishes it from these directions: while sibling work examines sink phenomena, the broader field emphasizes LLM-side attention patterns or efficiency gains. The taxonomy's scope notes explicitly separate sink characterization from mitigation methods, positioning this work as foundational analysis rather than application-oriented intervention.
Among 17 candidates examined across three contributions, none were found to clearly refute the work. The ViT sink identification contribution examined 2 candidates with no refutations, the training-free repositioning strategy examined 10 candidates with no refutations, and the DIYSink framework examined 5 candidates with no refutations. This suggests that within the limited search scope, the specific combination of ViT-level sink analysis, training-free repositioning, and the dual-MLP framework appears relatively unexplored. However, the small candidate pool means the analysis captures top semantic matches rather than exhaustive field coverage.
Based on the limited literature search, the work appears to occupy a distinct position by shifting attention sink analysis from LLM layers to the vision encoder. The sparse taxonomy leaf and absence of refuting candidates among 17 examined papers suggest novelty, though the small search scope limits definitive conclusions. The framework's combination of sink identification, repositioning, and dynamic selection represents a multi-faceted approach not clearly anticipated by the examined prior work.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors identify and analyze ViT attention sinks, which are high-norm visual tokens from the Vision Transformer that encapsulate high-level semantic concepts. They show these sinks propagate into the LLM alongside LLM-emerged sinks, and demonstrate that ViT sinks capture coarse-grained contextual information beneficial for certain reasoning tasks.
The authors propose a simple inference-time strategy that repositions ViT sink tokens to the beginning of the visual token sequence. This training-free method requires no additional training and can be applied post hoc to any existing LVLM, showing improvements particularly for tasks requiring high-level understanding.
The authors introduce DIYSink, a training-based framework featuring dual MLP projectors that separately process sink and non-sink tokens, combined with dynamic token selection mechanisms (CoT routing or learned reweighting) to adaptively choose which tokens to use based on task demands and image complexity.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[1] See what you are told: Visual attention sink in large multimodal models PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Identification and analysis of ViT attention sinks in LVLMs
The authors identify and analyze ViT attention sinks, which are high-norm visual tokens from the Vision Transformer that encapsulate high-level semantic concepts. They show these sinks propagate into the LLM alongside LLM-emerged sinks, and demonstrate that ViT sinks capture coarse-grained contextual information beneficial for certain reasoning tasks.
Training-free sink-to-the-front repositioning strategy
The authors propose a simple inference-time strategy that repositions ViT sink tokens to the beginning of the visual token sequence. This training-free method requires no additional training and can be applied post hoc to any existing LVLM, showing improvements particularly for tasks requiring high-level understanding.
[24] SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference PDF
[25] Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling PDF
[26] Identifying and mitigating position bias of multi-image vision-language models PDF
[27] Token Expand-Merge: Training-Free Token Compression for Vision-Language-Action Models PDF
[28] Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models PDF
[29] Dream2Real: Zero-Shot 3D Object Rearrangement with Vision-Language Models PDF
[30] HiPrune: Training-Free Visual Token Pruning via Hierarchical Attention in Vision-Language Models PDF
[31] SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models PDF
[32] Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training PDF
[33] Divide, Conquer and Combine: A Training-Free Framework for High-Resolution Image Perception in Multimodal Large Language Models PDF
DIYSink framework with dual MLP projection and dynamic token selection
The authors introduce DIYSink, a training-based framework featuring dual MLP projectors that separately process sink and non-sink tokens, combined with dynamic token selection mechanisms (CoT routing or learned reweighting) to adaptively choose which tokens to use based on task demands and image complexity.