Mitigating Hallucination in Vision-Language Model with Depth and Spatial-aware Key-Value Refinement

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Hallucination in Vision Language ModelDepth and Spatial-aware key value Cache RefinementKey-Value Cache ManipulationMulti Modal

Large vision–language models (VLMs) deliver state-of-the-art results on a wide range of multimodal tasks, yet they remain prone to visual hallucinations, producing content that is not grounded in the input image. Despite progress with visual supervision, reinforcement learning, and post-hoc attention reshaping, the representational origins of hallucinations remain unclear. Our study reveals that successful grounding emerges when adjacent visual tokens exhibit coherent alignment, while hallucinations arise when key vectors scatter isotropically, weakening cross-modal attention and blurring object boundaries. Building on this insight, we propose Depth and Spatial aware Cache Refinement (DSCR), a lightweight and training-free method that augments the Transformer's key-value (KV) cache with depth cues and 2D spatial proximity. DSCR clusters vectors within objects and separates those across surfaces, guiding attention toward relevant regions without any fine-tuning. Comprehensive evaluations show that DSCR consistently reduces hallucinations, delivering up to 23% accuracy gains across MME, POPE, RePOPE, CHAIR, and a new depth-sensitive benchmark. Our findings highlight KV-coherence as a core factor behind hallucinations and demonstrate a practical, model-agnostic solution for enhancing VLM reliability.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes DSCR, a training-free method that refines key-value caches using depth and spatial proximity to reduce hallucinations in vision-language models. It resides in the Attention and Representation Analysis leaf under Mechanistic Analysis and Root Cause Investigation, alongside four sibling papers examining attention mechanisms and multimodal alignment. This leaf is relatively sparse within a fifty-paper taxonomy, suggesting that mechanistic investigations into attention-level phenomena remain less crowded than mitigation-focused branches like Training-Free Decoding Strategies or Benchmark Development.

The taxonomy reveals neighboring leaves addressing Component-Level Analysis (visual encoders, architectural choices) and Bias and Prior Effects (language priors, modality biases). DSCR diverges from these by targeting attention coherence rather than component-level redesign or bias correction. Its training-free nature also distinguishes it from sibling work on visual supervision and instruction tuning, positioning it closer to inference-time interventions found in the Training-Free Decoding Strategies branch. The scope notes clarify that attention-level analysis excludes component-level studies, reinforcing DSCR's focus on representational dynamics within existing architectures.

Among twenty candidates examined, none clearly refute the three contributions. The DSCR method itself was compared against eight candidates with zero refutable overlaps, while the KV-coherence analysis examined nine candidates without finding prior work establishing the same representational insight. The depth-sensitive benchmark contribution reviewed three candidates, also yielding no refutations. These statistics reflect a limited semantic search scope rather than exhaustive coverage, indicating that within the examined subset, no directly overlapping prior work emerged. The absence of refutations suggests potential novelty, though broader searches might reveal additional related efforts.

Given the limited search scale and sparse taxonomy leaf, the work appears to introduce a distinct mechanistic perspective on hallucination origins. The training-free cache refinement approach and depth-aware evaluation benchmark occupy underexplored niches within the examined literature. However, the analysis covers only top-twenty semantic matches, leaving open the possibility of related work in adjacent communities or less semantically similar publications. The findings should be interpreted as indicative rather than definitive, pending broader literature review.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: mitigating visual hallucination in vision-language models. The field has organized itself into several complementary branches. Hallucination Detection and Evaluation focuses on benchmarking and measuring the extent of hallucinations, with works like Evaluating Object Hallucination in[12] and Hallusionbench[23] establishing metrics and datasets. Mechanistic Analysis and Root Cause Investigation digs into why hallucinations occur, examining attention patterns, representation biases, and modality priors. Mitigation Approaches encompasses a diverse set of techniques ranging from training-time interventions to inference-time corrections, exemplified by methods like Woodpecker[3] and Mitigating Object Hallucinations in[7]. Specialized Hallucination Contexts addresses domain-specific challenges such as multilingual settings or multi-object scenarios, while Survey and Comprehensive Reviews provide broad overviews of the landscape, including A Survey on Hallucination[4] and related syntheses. Within the mechanistic branch, a handful of works probe how models internally process visual information and where misalignments arise. Mitigating Hallucination in Vision-Language[0] sits squarely in the Attention and Representation Analysis cluster, investigating how attention mechanisms and learned representations contribute to hallucination phenomena. This contrasts with nearby efforts like Hallucination augmented contrastive learning[16], which leverages contrastive objectives to reduce hallucinations, and Mitigating modality prior-induced hallucinations[47], which targets biases stemming from language priors. A central open question across these lines is whether hallucinations stem primarily from misaligned cross-modal attention, insufficient visual grounding, or overly strong language priors. By focusing on attention and representation dynamics, Mitigating Hallucination in Vision-Language[0] complements detection-focused studies and offers insights that inform both training strategies and architectural refinements aimed at more faithful vision-language alignment.

Claimed Contributions

Depth and Spatial aware Cache Refinement (DSCR)

8 retrieved papers

DSCR is a training-free, model-agnostic technique that refines the key-value cache in vision-language models by incorporating depth and spatial proximity information. It clusters key vectors within objects and separates those across surfaces to guide attention toward relevant regions without fine-tuning.

8 retrieved papers

Analysis of KV-coherence and hallucination relationship

9 retrieved papers

The authors provide the first analysis showing that hallucinations in vision-language models occur when key vectors lose coherence and scatter isotropically, whereas successful grounding requires coherent alignment of adjacent visual tokens. This insight is supported by PCA-based visualizations and attention diagnostics.

9 retrieved papers

Depth-sensitive hallucination benchmark

3 retrieved papers

The authors introduce a new depth-sensitive benchmark designed to evaluate vision-language models in challenging scenarios involving occlusion boundaries and semantically distinct objects at similar depths, complementing existing hallucination evaluation datasets.

3 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[11] Mitigating object hallucinations in large vision-language models with assembly of global and local attention PDF

Wenbin An, Feng Tian, Sicong Leng, Jiahao Nie, Haonan Lin, Qianying Wang, Ping Chen, Xiaoqin Zhang, Guang Dai, Shijian Lu (2025)

[16] Hallucination augmented contrastive learning for multimodal large language model PDF

Chaoya Jiang, Xu Haiyang, Mengfan Dong, Haiyang Xu, Jiaxing Chen, Ye Wei, Ming Yan, Wei Ye, Qinghao Ye, Mingshi Yan, Ji Zhang, Fei Huang, Shikun Zhang (2024)

[43] Mitigating hallucination in visual-language models via re-balancing contrastive decoding PDF

Liang Xiao-yu, Yu Jiayuan, Xiaoyu Liang, Jiayuan Yu, Zhuang, Jiedong, Lianrui Mu, Hu Jiaqi, Jiedong Zhuang, Yang, Yuchen, Jiaqi Hu, Ye Jiangnan, Yuchen Yang, Lu Lu, Jiangnan Ye, Chen Jian, Hu, Haoji, Jian Chen, Haoji Hu (2024)

[47] Mitigating modality prior-induced hallucinations in multimodal large language models via deciphering attention causality PDF

Zhou Guanyu, Yan Yi-bo, Guanyu Zhou, Zou Xin, Yibo Yan, Wang Kun, Xin Zou, Liu Aiwei, Kun Wang, Hu, Xuming, Aiwei Liu, Xuming Hu (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Depth and Spatial aware Cache Refinement (DSCR)

[63] MiniCache: KV Cache Compression in Depth Dimension for Large Language Models PDF

Cannot Refute

[64] Gui-kv: Efficient gui agents via kv cache with spatio-temporal awareness PDF

Cannot Refute

[65] Look-m: Look-once optimization in kv cache for efficient multimodal long-context inference PDF

Cannot Refute

[66] Videollm-mod: Efficient video-language streaming with mixture-of-depths vision computation PDF

Cannot Refute

[67] Decision-Making Large Language Model for Wireless Communication: A Comprehensive Survey on Key Techniques PDF

Cannot Refute

[68] PureKV: Plug-and-Play KV Cache Optimization with Spatial-Temporal Sparse Attention for Vision-Language Large Models PDF

Cannot Refute

[69] Graph Guided Vision Language Modeling for Complex Document Understanding PDF

Cannot Refute

[70] Sharp Eyes and Memory for VideoLLMs: Information-Aware Visual Token Pruning for Efficient and Reliable VideoLLM Reasoning PDF

Cannot Refute

Contribution

Analysis of KV-coherence and hallucination relationship

[51] Aligning Attention Distribution to Information Flow for Hallucination Mitigation in Large Vision-Language Models PDF

Cannot Refute

[52] Collaboration Wins More: Dual-Modal Collaborative Attention Reinforcement for Mitigating Large Vision Language Models Hallucination PDF

Cannot Refute

[53] DHCP: Detecting Hallucinations by Cross-modal Attention Pattern in Large Vision-Language Models PDF

Cannot Refute

[54] CLAIM: Mitigating Multilingual Object Hallucination in Large Vision-Language Models with Cross-Lingual Attention Intervention PDF

Cannot Refute

[55] AttAnchor: Guiding Cross-Modal Token Alignment in VLMs with Attention Anchors PDF

Cannot Refute

[56] Multimodal Vision Language Models in Interactive and Physical Environments PDF

Cannot Refute

[57] Unified Dual-Strategy Framework for Multi-Task Visual Question Answering PDF

Cannot Refute

[58] Mitigating Hallucination for Large Vision Language Model by Inter-Modality Correlation Calibration Decoding PDF

Cannot Refute

[59] VisionFocus: Towards Efficient Hallucination Mitigation via Token-Aware Visual Enhancement PDF

Cannot Refute

Contribution

Depth-sensitive hallucination benchmark

[60] Losing the Plot: How VLM responses degrade on imperfect charts PDF

Cannot Refute

[61] Visual amodal completion PDF

Cannot Refute

[62] Vision Language Models Map Logos to Text via Semantic Entanglement in the Visual Projector PDF

Cannot Refute

Mitigating Hallucination in Vision-Language Model with Depth and Spatial-aware Key-Value Refinement

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[11] Mitigating object hallucinations in large vision-language models with assembly of global and local attention PDF

[16] Hallucination augmented contrastive learning for multimodal large language model PDF

[43] Mitigating hallucination in visual-language models via re-balancing contrastive decoding PDF

[47] Mitigating modality prior-induced hallucinations in multimodal large language models via deciphering attention causality PDF

Contribution Analysis

Depth and Spatial aware Cache Refinement (DSCR)

[63] MiniCache: KV Cache Compression in Depth Dimension for Large Language Models PDF

[64] Gui-kv: Efficient gui agents via kv cache with spatio-temporal awareness PDF

[65] Look-m: Look-once optimization in kv cache for efficient multimodal long-context inference PDF

[66] Videollm-mod: Efficient video-language streaming with mixture-of-depths vision computation PDF

[67] Decision-Making Large Language Model for Wireless Communication: A Comprehensive Survey on Key Techniques PDF

[68] PureKV: Plug-and-Play KV Cache Optimization with Spatial-Temporal Sparse Attention for Vision-Language Large Models PDF

[69] Graph Guided Vision Language Modeling for Complex Document Understanding PDF

[70] Sharp Eyes and Memory for VideoLLMs: Information-Aware Visual Token Pruning for Efficient and Reliable VideoLLM Reasoning PDF

Analysis of KV-coherence and hallucination relationship

[51] Aligning Attention Distribution to Information Flow for Hallucination Mitigation in Large Vision-Language Models PDF

[52] Collaboration Wins More: Dual-Modal Collaborative Attention Reinforcement for Mitigating Large Vision Language Models Hallucination PDF

[53] DHCP: Detecting Hallucinations by Cross-modal Attention Pattern in Large Vision-Language Models PDF

[54] CLAIM: Mitigating Multilingual Object Hallucination in Large Vision-Language Models with Cross-Lingual Attention Intervention PDF

[55] AttAnchor: Guiding Cross-Modal Token Alignment in VLMs with Attention Anchors PDF

[56] Multimodal Vision Language Models in Interactive and Physical Environments PDF

[57] Unified Dual-Strategy Framework for Multi-Task Visual Question Answering PDF

[58] Mitigating Hallucination for Large Vision Language Model by Inter-Modality Correlation Calibration Decoding PDF

[59] VisionFocus: Towards Efficient Hallucination Mitigation via Token-Aware Visual Enhancement PDF

Depth-sensitive hallucination benchmark

[60] Losing the Plot: How VLM responses degrade on imperfect charts PDF

[61] Visual amodal completion PDF

[62] Vision Language Models Map Logos to Text via Semantic Entanglement in the Visual Projector PDF

Table of Contents