Neural Message-Passing on Attention Graphs for Hallucination Detection

ICLR 2026 Conference SubmissionAnonymous Authors
hallucination detectiongraph neural networksLLMsattention graphs
Abstract:

Large Language Models (LLMs) often generate incorrect or unsupported content, known as hallucinations. Existing detection methods rely on heuristics or simple models over isolated computational traces such as activations, or attention maps. We unify these signals by representing them as attributed graphs, where tokens are nodes, edges follow attentional flows, and both carry features from attention scores and activations. Our approach, CHARM, casts hallucination detection as a graph learning task and tackles it by applying GNNs over the above attributed graphs. We show that CHARM provably subsumes prior attention-based heuristics and, experimentally, it consistently outperforms other leading approaches across diverse benchmarks. Our results shed light on the relevant role played by the graph structure and on the benefits of combining computational traces, whilst showing CHARM exhibits promising zero-shot performance on cross-dataset transfer.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces CHARM, a graph neural network framework that unifies attention scores and activation features into attributed graphs for hallucination detection. It resides in the 'Graph-Based Attention Analysis' leaf under 'Internal Model State Analysis', which contains only two papers including this one. This represents a relatively sparse research direction within the broader taxonomy of 50 papers across 36 topics, suggesting the graph-based formulation of attention mechanisms for hallucination detection remains an emerging approach rather than a crowded subfield.

The taxonomy reveals that CHARM's parent category, 'Internal Model State Analysis', contains two sibling leaves: 'Neural Probe and Feature Fusion Approaches' (2 papers) and 'Uncertainty and Probability-Based Detection' (2 papers). These neighboring directions analyze hidden states through trained probes or exploit output probability distributions, respectively. CHARM diverges by treating attention as relational structure rather than isolated features, positioning it at the intersection of graph learning and internal model diagnostics. The broader 'Detection Methodologies and Frameworks' branch also includes external knowledge-based methods and self-verification approaches, which CHARM does not incorporate.

Among 24 candidates examined across three contributions, no papers were identified as clearly refuting any of CHARM's claims. The 'Unified attributed graph representation' examined 4 candidates with none refutable; the 'GNN-based framework' examined 10 candidates with none refutable; and the 'Theoretical subsumption' examined 10 candidates with none refutable. This suggests that within the limited search scope—focused on top-K semantic matches and citation expansion—the specific combination of graph representation, GNN application, and theoretical analysis appears distinct from prior work, though the search was not exhaustive.

The analysis indicates CHARM occupies a novel position within the examined literature, particularly in its unified graph formulation and formal subsumption claims. However, the limited search scope (24 candidates from semantic retrieval) means this assessment reflects novelty relative to closely related work rather than the entire field. The sparse population of the 'Graph-Based Attention Analysis' leaf and absence of refuting candidates among examined papers suggest the approach introduces fresh technical machinery, though broader field coverage would strengthen confidence in this conclusion.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
24
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Hallucination detection in large language models. The field has organized itself around several complementary perspectives. Detection Methodologies and Frameworks encompasses techniques that probe internal model states—such as attention patterns, hidden representations, and graph-based analyses—as well as external consistency checks and self-verification protocols. Domain-Specific Hallucination Detection targets particular application areas like vision-language models, code generation, and product listings, recognizing that hallucination patterns vary across modalities and tasks. Mitigation and Correction Strategies explore interventions ranging from fine-tuning and decoding adjustments to post-hoc editing and verification chains, exemplified by works like Chain of Verification[2] and Entity-level Mitigation[11]. Theoretical Foundations and Comprehensive Analyses provide taxonomies, surveys, and conceptual frameworks that map the landscape, while Related Phenomena and Evaluation Challenges address measurement difficulties, benchmark design, and the broader trustworthiness context within which hallucination sits. Recent activity reveals a tension between lightweight, interpretable detection methods and more resource-intensive but comprehensive approaches. Many studies pursue internal model diagnostics—analyzing attention flows, hidden states, or uncertainty signals—to catch hallucinations without external knowledge bases, as seen in Neural Probe Detection[26] and Source Context Attention[28]. Others combine internal and external signals for robustness, such as Internal-External Fusion[50]. Neural Message Passing[0] falls squarely within the graph-based attention analysis cluster, treating attention structures as message-passing graphs to identify anomalous patterns that correlate with hallucinated content. This approach shares conceptual ground with Graph Signal Processing[42], which similarly leverages graph-theoretic tools on attention mechanisms, but Neural Message Passing[0] emphasizes dynamic message flow rather than static signal properties. The broader challenge remains balancing detection accuracy with computational overhead, a trade-off that continues to shape method development across all branches.

Claimed Contributions

Unified attributed graph representation of LLM computational traces

The authors introduce a unified framework that represents LLM computational traces (attention scores and activations) as attributed graphs. In this formulation, tokens become nodes, edges are defined by attention flows, and both nodes and edges carry features derived from computational traces across layers.

4 retrieved papers
CHARM: GNN-based hallucination detection framework

The authors propose CHARM, a method that formulates hallucination detection as a graph learning problem and applies Graph Neural Networks (GNNs) with message-passing over computational trace graphs. This approach can handle both token-level and response-level detection granularities.

10 retrieved papers
Theoretical subsumption of attention-based heuristics

The authors formally prove that CHARM can express and generalize existing attention-based hallucination detection methods, such as Lookback Lens and LLM-Check, demonstrating the expressiveness of their graph-based framework through theoretical analysis.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Unified attributed graph representation of LLM computational traces

The authors introduce a unified framework that represents LLM computational traces (attention scores and activations) as attributed graphs. In this formulation, tokens become nodes, edges are defined by attention flows, and both nodes and edges carry features derived from computational traces across layers.

Contribution

CHARM: GNN-based hallucination detection framework

The authors propose CHARM, a method that formulates hallucination detection as a graph learning problem and applies Graph Neural Networks (GNNs) with message-passing over computational trace graphs. This approach can handle both token-level and response-level detection granularities.

Contribution

Theoretical subsumption of attention-based heuristics

The authors formally prove that CHARM can express and generalize existing attention-based hallucination detection methods, such as Lookback Lens and LLM-Check, demonstrating the expressiveness of their graph-based framework through theoretical analysis.