GNN Explanations that do not Explain and How to find Them

ICLR 2026 Conference SubmissionAnonymous Authors
graph neural networksexplainabilityself-explainableauditingfaithfulness
Abstract:

Explanations provided by Self-explainable Graph Neural Networks (SE-GNNs) are fundamental for understanding the model's inner workings and for identifying potential misuse of sensitive attributes. Although recent works have highlighted that these explanations can be suboptimal and potentially misleading, a characterization of their failure cases is unavailable. In this work, we identify a critical failure of SE-GNN explanations: explanations can be unambiguously unrelated to how the SE-GNNs infer labels. We show that, on the one hand, many SE-GNNs can achieve optimal true risk while producing these degenerate explanations, and on the other, most faithfulness metrics can fail to identify these failure modes. Our empirical analysis reveals that degenerate explanations can be maliciously planted (allowing an attacker to hide the use of sensitive attributes) and can also emerge naturally, highlighting the need for reliable auditing. To address this, we introduce a novel faithfulness metric that reliably marks degenerate explanations as unfaithful, in both malicious and natural settings. Our code is available in the supplemental.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper identifies a critical failure mode in self-explainable GNN explanations—namely, that explanations can be entirely unrelated to the model's actual inference process—and proposes a novel faithfulness metric (EST) to detect such degenerate cases. It resides in the Faithfulness Metric Development leaf alongside two sibling papers: one evaluating explainability for graph neural networks and another assessing attribution methods. This leaf contains only three papers total, suggesting a relatively sparse but focused research direction within the broader faithfulness evaluation landscape.

The taxonomy reveals that faithfulness evaluation comprises three distinct leaves: metric development, comparative evaluation studies, and ground-truth benchmark design. The paper's focus on developing a new metric positions it within the first category, while its empirical analysis of existing metrics' failures connects to comparative evaluation work. Neighboring branches include self-explainable GNN architectures and post-hoc explanation methods, with the paper's critical stance toward self-explainable models bridging these areas. The taxonomy's scope notes clarify that this work differs from empirical benchmarking studies by proposing a novel metric rather than merely comparing existing approaches.

Among eighteen candidates examined across three contributions, none were found to clearly refute the paper's claims. The first contribution (identifying the failure case) examined five candidates with zero refutations; the second (EST metric) examined three with zero refutations; the third (benchmark design) examined ten with zero refutations. This suggests that within the limited search scope—focused on top semantic matches and citation expansion—the specific combination of detecting degenerate explanations and proposing EST appears relatively unexplored. The benchmark contribution examined the largest candidate pool, yet still found no overlapping prior work.

Based on the limited literature search of eighteen candidates, the work appears to occupy a distinct position within faithfulness evaluation. The taxonomy structure indicates this is an active area with critical examination of self-explainable models, yet the specific focus on degenerate explanations and the EST metric shows no clear precedent among examined papers. However, the search scope does not cover the entire field, and the sparse leaf population suggests this direction may benefit from broader contextualization as the area develops.

Taxonomy

Core-task Taxonomy Papers
23
3
Claimed Contributions
18
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Detecting unfaithful explanations in self-explainable graph neural networks. The field organizes around four main branches that reflect distinct but complementary concerns. Faithfulness Evaluation and Metrics focuses on developing rigorous measures to assess whether explanations truly reflect model reasoning, with works like Evaluating explainability for graph[1] and Evaluating attribution for graph[4] establishing foundational benchmarks. Self-Explainable GNN Architectures explores models designed to produce interpretable outputs by construction, such as Discovering Invariant Rationales for[2] and CI-GNN[22], which embed explanation mechanisms directly into the learning process. Post-Hoc Explanation Generation examines methods that extract explanations after training, often trading off computational cost against interpretability guarantees. Domain-Specific Applications tailors these techniques to specialized contexts like vulnerability detection, as seen in Interpreters for GNN-Based Vulnerability[18], where domain constraints shape both explanation needs and evaluation criteria. A particularly active tension emerges between intrinsic faithfulness guarantees and practical evaluation challenges. Several studies question whether self-explainable models deliver on their promises: How Faithful are Self-Explainable[12] and Reconsidering Faithfulness in Regular[10] critically examine whether built-in explanation mechanisms genuinely align with model decisions, while Is your explanation reliable[5] probes the stability of these interpretations under perturbation. GNN Explanations that do[0] sits squarely within this critical evaluation stream, developing detection methods for unfaithful explanations alongside neighbors like Faithful interpretation for graph[3], which proposes alternative faithfulness criteria, and Reconsidering Faithfulness in Regular[10], which reconsiders foundational assumptions about what faithfulness means in graph contexts. These works collectively push beyond simply generating explanations toward rigorously validating their trustworthiness, addressing a gap between the appeal of self-explainable architectures and the empirical verification of their interpretability claims.

Claimed Contributions

Identification of critical failure case in SE-GNN explanations

The authors identify and characterize a fundamental failure mode where self-explainable GNNs can produce explanations that are completely unrelated to the model's actual decision-making process, despite achieving optimal predictive performance. They provide theoretical conditions under which this occurs and demonstrate it empirically.

5 retrieved papers
Novel faithfulness metric EST

The authors propose the Extension Sufficiency Test (EST), a new metric for evaluating explanation faithfulness that holistically considers all supergraphs of an explanation. EST is shown to be more robust than existing metrics at detecting unfaithful explanations in both malicious and natural settings.

3 retrieved papers
Benchmark for evaluating faithfulness metrics

The authors introduce a controlled benchmark that evaluates faithfulness metrics based on their ability to reject known-unfaithful explanations, using manipulated SE-GNNs that are trained to output degenerate explanations while maintaining high accuracy.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Identification of critical failure case in SE-GNN explanations

The authors identify and characterize a fundamental failure mode where self-explainable GNNs can produce explanations that are completely unrelated to the model's actual decision-making process, despite achieving optimal predictive performance. They provide theoretical conditions under which this occurs and demonstrate it empirically.

Contribution

Novel faithfulness metric EST

The authors propose the Extension Sufficiency Test (EST), a new metric for evaluating explanation faithfulness that holistically considers all supergraphs of an explanation. EST is shown to be more robust than existing metrics at detecting unfaithful explanations in both malicious and natural settings.

Contribution

Benchmark for evaluating faithfulness metrics

The authors introduce a controlled benchmark that evaluates faithfulness metrics based on their ability to reject known-unfaithful explanations, using manipulated SE-GNNs that are trained to output degenerate explanations while maintaining high accuracy.