Mitigating Hallucination in Vision-Language Model with Depth and Spatial-aware Key-Value Refinement

ICLR 2026 Conference SubmissionAnonymous Authors
Hallucination in Vision Language ModelDepth and Spatial-aware key value Cache RefinementKey-Value Cache ManipulationMulti Modal
Abstract:

Large vision–language models (VLMs) deliver state-of-the-art results on a wide range of multimodal tasks, yet they remain prone to visual hallucinations, producing content that is not grounded in the input image. Despite progress with visual supervision, reinforcement learning, and post-hoc attention reshaping, the representational origins of hallucinations remain unclear. Our study reveals that successful grounding emerges when adjacent visual tokens exhibit coherent alignment, while hallucinations arise when key vectors scatter isotropically, weakening cross-modal attention and blurring object boundaries. Building on this insight, we propose Depth and Spatial aware Cache Refinement (DSCR), a lightweight and training-free method that augments the Transformer's key-value (KV) cache with depth cues and 2D spatial proximity. DSCR clusters vectors within objects and separates those across surfaces, guiding attention toward relevant regions without any fine-tuning. Comprehensive evaluations show that DSCR consistently reduces hallucinations, delivering up to 23% accuracy gains across MME, POPE, RePOPE, CHAIR, and a new depth-sensitive benchmark. Our findings highlight KV-coherence as a core factor behind hallucinations and demonstrate a practical, model-agnostic solution for enhancing VLM reliability.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes DSCR, a training-free method that refines key-value caches using depth and spatial proximity to reduce hallucinations in vision-language models. It resides in the Attention and Representation Analysis leaf under Mechanistic Analysis and Root Cause Investigation, alongside four sibling papers examining attention mechanisms and multimodal alignment. This leaf is relatively sparse within a fifty-paper taxonomy, suggesting that mechanistic investigations into attention-level phenomena remain less crowded than mitigation-focused branches like Training-Free Decoding Strategies or Benchmark Development.

The taxonomy reveals neighboring leaves addressing Component-Level Analysis (visual encoders, architectural choices) and Bias and Prior Effects (language priors, modality biases). DSCR diverges from these by targeting attention coherence rather than component-level redesign or bias correction. Its training-free nature also distinguishes it from sibling work on visual supervision and instruction tuning, positioning it closer to inference-time interventions found in the Training-Free Decoding Strategies branch. The scope notes clarify that attention-level analysis excludes component-level studies, reinforcing DSCR's focus on representational dynamics within existing architectures.

Among twenty candidates examined, none clearly refute the three contributions. The DSCR method itself was compared against eight candidates with zero refutable overlaps, while the KV-coherence analysis examined nine candidates without finding prior work establishing the same representational insight. The depth-sensitive benchmark contribution reviewed three candidates, also yielding no refutations. These statistics reflect a limited semantic search scope rather than exhaustive coverage, indicating that within the examined subset, no directly overlapping prior work emerged. The absence of refutations suggests potential novelty, though broader searches might reveal additional related efforts.

Given the limited search scale and sparse taxonomy leaf, the work appears to introduce a distinct mechanistic perspective on hallucination origins. The training-free cache refinement approach and depth-aware evaluation benchmark occupy underexplored niches within the examined literature. However, the analysis covers only top-twenty semantic matches, leaving open the possibility of related work in adjacent communities or less semantically similar publications. The findings should be interpreted as indicative rather than definitive, pending broader literature review.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
20
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: mitigating visual hallucination in vision-language models. The field has organized itself into several complementary branches. Hallucination Detection and Evaluation focuses on benchmarking and measuring the extent of hallucinations, with works like Evaluating Object Hallucination in[12] and Hallusionbench[23] establishing metrics and datasets. Mechanistic Analysis and Root Cause Investigation digs into why hallucinations occur, examining attention patterns, representation biases, and modality priors. Mitigation Approaches encompasses a diverse set of techniques ranging from training-time interventions to inference-time corrections, exemplified by methods like Woodpecker[3] and Mitigating Object Hallucinations in[7]. Specialized Hallucination Contexts addresses domain-specific challenges such as multilingual settings or multi-object scenarios, while Survey and Comprehensive Reviews provide broad overviews of the landscape, including A Survey on Hallucination[4] and related syntheses. Within the mechanistic branch, a handful of works probe how models internally process visual information and where misalignments arise. Mitigating Hallucination in Vision-Language[0] sits squarely in the Attention and Representation Analysis cluster, investigating how attention mechanisms and learned representations contribute to hallucination phenomena. This contrasts with nearby efforts like Hallucination augmented contrastive learning[16], which leverages contrastive objectives to reduce hallucinations, and Mitigating modality prior-induced hallucinations[47], which targets biases stemming from language priors. A central open question across these lines is whether hallucinations stem primarily from misaligned cross-modal attention, insufficient visual grounding, or overly strong language priors. By focusing on attention and representation dynamics, Mitigating Hallucination in Vision-Language[0] complements detection-focused studies and offers insights that inform both training strategies and architectural refinements aimed at more faithful vision-language alignment.

Claimed Contributions

Depth and Spatial aware Cache Refinement (DSCR)

DSCR is a training-free, model-agnostic technique that refines the key-value cache in vision-language models by incorporating depth and spatial proximity information. It clusters key vectors within objects and separates those across surfaces to guide attention toward relevant regions without fine-tuning.

8 retrieved papers
Analysis of KV-coherence and hallucination relationship

The authors provide the first analysis showing that hallucinations in vision-language models occur when key vectors lose coherence and scatter isotropically, whereas successful grounding requires coherent alignment of adjacent visual tokens. This insight is supported by PCA-based visualizations and attention diagnostics.

9 retrieved papers
Depth-sensitive hallucination benchmark

The authors introduce a new depth-sensitive benchmark designed to evaluate vision-language models in challenging scenarios involving occlusion boundaries and semantically distinct objects at similar depths, complementing existing hallucination evaluation datasets.

3 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Depth and Spatial aware Cache Refinement (DSCR)

DSCR is a training-free, model-agnostic technique that refines the key-value cache in vision-language models by incorporating depth and spatial proximity information. It clusters key vectors within objects and separates those across surfaces to guide attention toward relevant regions without fine-tuning.

Contribution

Analysis of KV-coherence and hallucination relationship

The authors provide the first analysis showing that hallucinations in vision-language models occur when key vectors lose coherence and scatter isotropically, whereas successful grounding requires coherent alignment of adjacent visual tokens. This insight is supported by PCA-based visualizations and attention diagnostics.

Contribution

Depth-sensitive hallucination benchmark

The authors introduce a new depth-sensitive benchmark designed to evaluate vision-language models in challenging scenarios involving occlusion boundaries and semantically distinct objects at similar depths, complementing existing hallucination evaluation datasets.