GNN Explanations that do not Explain and How to find Them

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

graph neural networksexplainabilityself-explainableauditingfaithfulness

Explanations provided by Self-explainable Graph Neural Networks (SE-GNNs) are fundamental for understanding the model's inner workings and for identifying potential misuse of sensitive attributes. Although recent works have highlighted that these explanations can be suboptimal and potentially misleading, a characterization of their failure cases is unavailable. In this work, we identify a critical failure of SE-GNN explanations: explanations can be unambiguously unrelated to how the SE-GNNs infer labels. We show that, on the one hand, many SE-GNNs can achieve optimal true risk while producing these degenerate explanations, and on the other, most faithfulness metrics can fail to identify these failure modes. Our empirical analysis reveals that degenerate explanations can be maliciously planted (allowing an attacker to hide the use of sensitive attributes) and can also emerge naturally, highlighting the need for reliable auditing. To address this, we introduce a novel faithfulness metric that reliably marks degenerate explanations as unfaithful, in both malicious and natural settings. Our code is available in the supplemental.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper identifies a critical failure mode in self-explainable GNN explanations—namely, that explanations can be entirely unrelated to the model's actual inference process—and proposes a novel faithfulness metric (EST) to detect such degenerate cases. It resides in the Faithfulness Metric Development leaf alongside two sibling papers: one evaluating explainability for graph neural networks and another assessing attribution methods. This leaf contains only three papers total, suggesting a relatively sparse but focused research direction within the broader faithfulness evaluation landscape.

The taxonomy reveals that faithfulness evaluation comprises three distinct leaves: metric development, comparative evaluation studies, and ground-truth benchmark design. The paper's focus on developing a new metric positions it within the first category, while its empirical analysis of existing metrics' failures connects to comparative evaluation work. Neighboring branches include self-explainable GNN architectures and post-hoc explanation methods, with the paper's critical stance toward self-explainable models bridging these areas. The taxonomy's scope notes clarify that this work differs from empirical benchmarking studies by proposing a novel metric rather than merely comparing existing approaches.

Among eighteen candidates examined across three contributions, none were found to clearly refute the paper's claims. The first contribution (identifying the failure case) examined five candidates with zero refutations; the second (EST metric) examined three with zero refutations; the third (benchmark design) examined ten with zero refutations. This suggests that within the limited search scope—focused on top semantic matches and citation expansion—the specific combination of detecting degenerate explanations and proposing EST appears relatively unexplored. The benchmark contribution examined the largest candidate pool, yet still found no overlapping prior work.

Based on the limited literature search of eighteen candidates, the work appears to occupy a distinct position within faithfulness evaluation. The taxonomy structure indicates this is an active area with critical examination of self-explainable models, yet the specific focus on degenerate explanations and the EST metric shows no clear precedent among examined papers. However, the search scope does not cover the entire field, and the sparse leaf population suggests this direction may benefit from broader contextualization as the area develops.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Detecting unfaithful explanations in self-explainable graph neural networks. The field organizes around four main branches that reflect distinct but complementary concerns. Faithfulness Evaluation and Metrics focuses on developing rigorous measures to assess whether explanations truly reflect model reasoning, with works like Evaluating explainability for graph[1] and Evaluating attribution for graph[4] establishing foundational benchmarks. Self-Explainable GNN Architectures explores models designed to produce interpretable outputs by construction, such as Discovering Invariant Rationales for[2] and CI-GNN[22], which embed explanation mechanisms directly into the learning process. Post-Hoc Explanation Generation examines methods that extract explanations after training, often trading off computational cost against interpretability guarantees. Domain-Specific Applications tailors these techniques to specialized contexts like vulnerability detection, as seen in Interpreters for GNN-Based Vulnerability[18], where domain constraints shape both explanation needs and evaluation criteria. A particularly active tension emerges between intrinsic faithfulness guarantees and practical evaluation challenges. Several studies question whether self-explainable models deliver on their promises: How Faithful are Self-Explainable[12] and Reconsidering Faithfulness in Regular[10] critically examine whether built-in explanation mechanisms genuinely align with model decisions, while Is your explanation reliable[5] probes the stability of these interpretations under perturbation. GNN Explanations that do[0] sits squarely within this critical evaluation stream, developing detection methods for unfaithful explanations alongside neighbors like Faithful interpretation for graph[3], which proposes alternative faithfulness criteria, and Reconsidering Faithfulness in Regular[10], which reconsiders foundational assumptions about what faithfulness means in graph contexts. These works collectively push beyond simply generating explanations toward rigorously validating their trustworthiness, addressing a gap between the appeal of self-explainable architectures and the empirical verification of their interpretability claims.

Claimed Contributions

Identification of critical failure case in SE-GNN explanations

5 retrieved papers

The authors identify and characterize a fundamental failure mode where self-explainable GNNs can produce explanations that are completely unrelated to the model's actual decision-making process, despite achieving optimal predictive performance. They provide theoretical conditions under which this occurs and demonstrate it empirically.

5 retrieved papers

Novel faithfulness metric EST

3 retrieved papers

The authors propose the Extension Sufficiency Test (EST), a new metric for evaluating explanation faithfulness that holistically considers all supergraphs of an explanation. EST is shown to be more robust than existing metrics at detecting unfaithful explanations in both malicious and natural settings.

3 retrieved papers

Benchmark for evaluating faithfulness metrics

10 retrieved papers

The authors introduce a controlled benchmark that evaluates faithfulness metrics based on their ability to reject known-unfaithful explanations, using manipulated SE-GNNs that are trained to output degenerate explanations while maintaining high accuracy.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] Evaluating explainability for graph neural networks PDF

Chirag Agarwal, Owen Queen, Himabindu Lakkaraju, Marinka Zitnik, Marinka Å½itnik (2023)

[10] Reconsidering Faithfulness in Regular, Self-Explainable and Domain Invariant GNNs PDF

Azzolin, Steve, Longa, Antonio, Teso Stefano, Passerini, Andrea (2024) • International Conference on Learning Representations

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Identification of critical failure case in SE-GNN explanations

[25] Towards multi-grained explainability for graph neural networks PDF

Cannot Refute

[30] Graph-guided textual explanation generation framework PDF

Cannot Refute

[31] Conversational Graph-LLM Reasoning for Interactive Preference Modeling and Explainable Recommendation PDF

Cannot Refute

[32] TopInG: Topologically Interpretable Graph Learning via Persistent Rationale Filtration PDF

Cannot Refute

[33] Adversarial cooperative rationalization: The risk of spurious correlations in even clean datasets PDF

Cannot Refute

Contribution

Novel faithfulness metric EST

[10] Reconsidering Faithfulness in Regular, Self-Explainable and Domain Invariant GNNs PDF

Cannot Refute

[34] On glocal explainability of graph neural networks PDF

Cannot Refute

[35] Predicting polyester Performance of powder coating material using 3D graph network PDF

Cannot Refute

Contribution

Benchmark for evaluating faithfulness metrics

[3] Faithful interpretation for graph neural networks PDF

Cannot Refute

[4] Evaluating attribution for graph neural networks PDF

Cannot Refute

[5] Is your explanation reliable: Confidence-aware explanation on graph neural networks PDF

Cannot Refute

[9] Towards faithful and consistent explanations for graph neural networks PDF

Cannot Refute

[24] Graphframex: Towards systematic evaluation of explainability methods for graph neural networks PDF

Cannot Refute

[25] Towards multi-grained explainability for graph neural networks PDF

Cannot Refute

[26] GraphXAI: a survey of graph neural networks (GNNs) for explainable AI (XAI) PDF

Cannot Refute

[27] Robustness questions the interpretability of graph neural networks: what to do? PDF

Cannot Refute

[28] SEHG: Bridging Interpretability and Prediction in Self-Explainable Heterogeneous Graph Neural Networks PDF

Cannot Refute

[29] Zorro: Valid, Sparse, and Stable Explanations in Graph Neural Networks PDF

Cannot Refute

GNN Explanations that do not Explain and How to find Them

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] Evaluating explainability for graph neural networks PDF

[10] Reconsidering Faithfulness in Regular, Self-Explainable and Domain Invariant GNNs PDF

Contribution Analysis

Identification of critical failure case in SE-GNN explanations

[25] Towards multi-grained explainability for graph neural networks PDF

[30] Graph-guided textual explanation generation framework PDF

[31] Conversational Graph-LLM Reasoning for Interactive Preference Modeling and Explainable Recommendation PDF

[32] TopInG: Topologically Interpretable Graph Learning via Persistent Rationale Filtration PDF

[33] Adversarial cooperative rationalization: The risk of spurious correlations in even clean datasets PDF

Novel faithfulness metric EST

[10] Reconsidering Faithfulness in Regular, Self-Explainable and Domain Invariant GNNs PDF

[34] On glocal explainability of graph neural networks PDF

[35] Predicting polyester Performance of powder coating material using 3D graph network PDF

Benchmark for evaluating faithfulness metrics

[3] Faithful interpretation for graph neural networks PDF

[4] Evaluating attribution for graph neural networks PDF

[5] Is your explanation reliable: Confidence-aware explanation on graph neural networks PDF

[9] Towards faithful and consistent explanations for graph neural networks PDF

[24] Graphframex: Towards systematic evaluation of explainability methods for graph neural networks PDF

[25] Towards multi-grained explainability for graph neural networks PDF

[26] GraphXAI: a survey of graph neural networks (GNNs) for explainable AI (XAI) PDF

[27] Robustness questions the interpretability of graph neural networks: what to do? PDF

[28] SEHG: Bridging Interpretability and Prediction in Self-Explainable Heterogeneous Graph Neural Networks PDF

[29] Zorro: Valid, Sparse, and Stable Explanations in Graph Neural Networks PDF

Table of Contents