Visual Multi-Agent System: Mitigating Hallucination Snowballing via Visual Flow

ICLR 2026 Conference SubmissionAnonymous Authors
multi-agent systemvisual hallucination snowballing
Abstract:

Multi-Agent System (MAS) powered by Visual Language Models (VLMs) enables challenging tasks but suffers from a novel failure term, multi-agent visual hallucination snowballing, where hallucinations are seeded in a single agent and amplified by following ones due to the over-reliance on textual flow to relay visual information. Through turn-, layer-, and token-wise attention analyses, we provide detailed insights into the essence of hallucination snowballing regarding the reduction of visual attention allocation. It leads us to identify a subset of vision tokens with a unimodal attention peak in middle layers that best preserve visual evidence but gradually diminish in deeper agent turns, resulting in the visual hallucination snowballing in MAS. Thus, we propose ViF, a lightweight, plug-and-play mitigation paradigm that relays inter-agent messages with Visual Flow powered by the selected visual relay tokens and applies attention reallocation to amplify this pattern. The experiment results demonstrate that our method markedly reduces hallucination snowballing, consistently improving the performance across eight benchmarks based on four common MAS structures and ten base models. The implementation source code will be made publicly available.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper formalizes multi-agent visual hallucination snowballing—where errors seed in one agent and amplify through subsequent agents—and proposes ViF, a visual flow paradigm using selected relay tokens and attention reallocation. It resides in the Visual Attention and Token-Level Interventions leaf, which contains only two papers total. This sparse population suggests the specific focus on token-level attention manipulation for multi-agent visual hallucination is relatively underexplored compared to broader hallucination mitigation strategies.

The taxonomy reveals a crowded landscape in adjacent areas: Cross-Modal Verification and Debate contains four papers addressing visual-language consistency through debate or external tools, while Text-Based LLM Hallucination Mitigation encompasses multiple leaves with debate, retrieval, and filtering methods. The paper's emphasis on visual token selection and attention reallocation distinguishes it from these neighboring approaches, which typically rely on agent consensus or external verification rather than fine-grained visual evidence preservation across agent turns.

Among thirty candidates examined, none clearly refute the three core contributions: formalizing snowballing (ten candidates, zero refutable), identifying unimodal attention peaks in vision tokens (ten candidates, zero refutable), and the ViF paradigm (ten candidates, zero refutable). This limited search scope suggests the specific combination of multi-agent dynamics, visual token analysis, and attention reallocation appears novel within the examined literature, though the analysis does not cover exhaustive prior work beyond top-K semantic matches.

Based on the restricted search and sparse taxonomy leaf, the work appears to occupy a distinct niche at the intersection of multi-agent systems and vision-language attention mechanisms. The absence of refutable candidates across all contributions within thirty examined papers indicates potential novelty, though broader literature beyond semantic search may reveal related efforts in visual grounding or multi-agent error propagation.

Taxonomy

Core-task Taxonomy Papers
47
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: mitigating visual hallucination snowballing in multi-agent systems. The field structure reflects a broad concern with hallucination across different modalities and system architectures. Vision-Language Model Hallucination Mitigation focuses on techniques that address errors arising when models interpret visual inputs, often through attention mechanisms or token-level interventions. Text-Based LLM Hallucination Mitigation tackles purely linguistic errors, employing retrieval augmentation, reasoning verification, and agentic workflows. Multi-Agent System Architectures and Coordination explores how agents collaborate, communicate, and reach consensus, with works like MetaGPT[5] exemplifying structured role-based frameworks. Domain-Specific Multi-Agent Applications demonstrate how these systems are deployed in specialized contexts such as legal intake, gaming, and code generation, while Surveys, Taxonomies, and Theoretical Frameworks provide overarching perspectives on hallucination detection and mitigation strategies. General Multi-Agent Hallucination Mitigation addresses cross-cutting challenges that span both vision and text modalities within collaborative settings. A particularly active line of work examines how hallucinations propagate and compound when multiple agents interact, raising questions about error detection, correction protocols, and the trade-offs between agent autonomy and oversight. Visual Multi-Agent System[0] sits within the Visual Attention and Token-Level Interventions branch, emphasizing fine-grained control over how vision-language models process visual tokens to prevent hallucination cascades in collaborative scenarios. This contrasts with nearby efforts like Mitigating Large Vision-Language Model[1], which also targets visual hallucinations but may focus more on single-model refinement rather than multi-agent dynamics. Meanwhile, works such as Sentinel Agents for Secure[7] and Minimizing hallucinations and communication[8] explore complementary strategies—introducing specialized monitoring agents or optimizing inter-agent communication—to curb error propagation. The central challenge remains balancing the benefits of diverse agent perspectives against the risk that one agent's hallucination might mislead others, a theme that Visual Multi-Agent System[0] addresses through targeted visual attention interventions.

Claimed Contributions

Formalization of multi-agent visual hallucination snowballing phenomenon

The authors formally define and characterize a novel failure mode in Visual Language Model-based Multi-Agent Systems where visual hallucinations originating in one agent are amplified through subsequent agents due to over-reliance on textual information flow, and they establish its connection to reduced visual attention allocation across agent turns.

10 retrieved papers
Identification of critical vision tokens with unimodal attention peaks

Through turn-wise, layer-wise, and token-wise attention analyses, the authors identify a specific subset of vision tokens characterized by unimodal attention peaks in middle layers that best preserve visual evidence and are essential for maintaining visual information flow in multi-agent systems.

10 retrieved papers
ViF: Visual Flow mitigation paradigm with attention reallocation

The authors propose ViF, a lightweight and model-agnostic method that mitigates hallucination snowballing by introducing visual flow powered by selected visual relay tokens to relay inter-agent messages and applying attention reallocation to amplify beneficial attention patterns, rather than relying solely on textual flows.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Formalization of multi-agent visual hallucination snowballing phenomenon

The authors formally define and characterize a novel failure mode in Visual Language Model-based Multi-Agent Systems where visual hallucinations originating in one agent are amplified through subsequent agents due to over-reliance on textual information flow, and they establish its connection to reduced visual attention allocation across agent turns.

Contribution

Identification of critical vision tokens with unimodal attention peaks

Through turn-wise, layer-wise, and token-wise attention analyses, the authors identify a specific subset of vision tokens characterized by unimodal attention peaks in middle layers that best preserve visual evidence and are essential for maintaining visual information flow in multi-agent systems.

Contribution

ViF: Visual Flow mitigation paradigm with attention reallocation

The authors propose ViF, a lightweight and model-agnostic method that mitigates hallucination snowballing by introducing visual flow powered by selected visual relay tokens to relay inter-agent messages and applying attention reallocation to amplify beneficial attention patterns, rather than relying solely on textual flows.