PolyGraphScore: a classifier-based metric for evaluating graph generative models

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

graph generative modelsmodel evaluationmaximum mean discrepancygenerative models

Existing methods for evaluating graph generative models primarily rely on Maximum Mean Discrepancy (MMD) metrics based on graph descriptors. While these metrics can rank generative models, they do not provide an absolute measure of performance. Their values are also highly sensitive to extrinsic parameters, namely kernel and descriptor parametrization, making them incomparable across different graph descriptors. We introduce PolyGraphScore (PGS), a new evaluation framework that addresses these limitations. It approximates the Jensen-Shannon (JS) distance of graph distributions by fitting binary classifiers to distinguish between real and generated graphs, featurized by these descriptors. The data log-likelihood of these classifiers approximates a variational lower bound on the JS distance between the two distributions. Resulting scores are constrained to the unit interval $[0,1]$ and are comparable across different graph descriptors. We further derive a theoretically grounded summary score that combines these individual metrics to provide a maximally tight lower bound on the distance for the given descriptors. Thorough experiments demonstrate that PGS provides a more robust and insightful evaluation compared to MMD metrics.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces PolyGraphScore (PGS), a classifier-based evaluation framework that approximates Jensen-Shannon distance between real and generated graph distributions. It resides in the Classifier-Based and Distribution Distance Metrics leaf, which contains only three papers total, including this work. This is a relatively sparse research direction within the broader Evaluation Metrics and Frameworks branch, suggesting that classifier-based approaches to graph generation evaluation remain an emerging area compared to the more established descriptor-based methods found in neighboring leaves.

The taxonomy reveals that evaluation metrics for graph generative models are organized into three main leaves: Classifier-Based approaches (3 papers), Graph Descriptor and Feature-Based Metrics (2 papers), and Benchmarking Frameworks (5 papers). The paper's sibling works include methods using contrastive learned features and edge dependency analysis. Neighboring leaves contain descriptor-based approaches that rely on Maximum Mean Discrepancy (MMD) metrics, which the paper explicitly critiques for lacking absolute performance measures and comparability across descriptors. This positioning suggests the work bridges classifier-based evaluation with traditional descriptor-based methods.

Among 27 candidates examined through limited semantic search, the analysis identified potential prior work overlap for two of three contributions. The PGS framework itself (7 candidates examined, 1 refutable) and the summary score mechanism (10 candidates examined, 1 refutable) both show evidence of related prior work within the limited search scope. The open-source library contribution (10 candidates examined, 0 refutable) appears more distinctive. These statistics indicate that while the core evaluation approach has some precedent in the examined literature, the specific implementation and theoretical grounding may offer incremental advances over existing classifier-based methods.

Based on the limited search of 27 semantically related papers, the work appears to make incremental contributions to an emerging evaluation paradigm. The sparse population of its taxonomy leaf (3 papers) suggests room for methodological development, though the refutable pairs indicate that key ideas have partial precedent. The analysis does not cover exhaustive citation networks or domain-specific evaluation literature, so additional related work may exist beyond the top-K semantic matches examined.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Evaluating graph generative models. The field has organized itself around several major branches that reflect both methodological and application-driven concerns. At the highest level, Evaluation Metrics and Frameworks focuses on how to measure the quality of generated graphs, encompassing classifier-based approaches, distribution distance metrics, and domain-specific quality measures. Model Architectures and Approaches covers the diverse generative techniques themselves, ranging from autoregressive and diffusion-based methods to GANs and variational frameworks. Domain-Specific Applications highlights specialized contexts such as molecular design, scene graph generation, and knowledge graph construction, where tailored evaluation criteria often emerge. Surveys and General Reviews provide broad perspectives on the landscape, while Related Graph Learning Tasks addresses adjacent problems like graph prediction and analytics that inform evaluation strategies. Representative works such as MolGAN[3] and Nevae[2] illustrate how architectural choices intersect with evaluation needs, while benchmarks like Benchmarking Graph Generation[46] and Synthetic Graph Benchmark[47] provide standardized testbeds. Within the Evaluation Metrics and Frameworks branch, a particularly active line of work explores classifier-based and distribution distance metrics, balancing statistical rigor with computational feasibility. Some studies emphasize contrastive or learned feature representations, as seen in Contrastively Learned Features[23], while others investigate the role of edge dependencies and structural properties, exemplified by Edge Dependency Role[33]. PolyGraphScore[0] situates itself in this cluster by proposing a polynomial-time scoring mechanism that addresses scalability challenges inherent in distribution-based evaluation. Compared to neighbors like Contrastively Learned Features[23], which leverages learned embeddings, and Edge Dependency Role[33], which examines structural correlations, PolyGraphScore[0] emphasizes efficient computation without sacrificing expressiveness. This positioning reflects ongoing tensions in the field between the desire for rich, nuanced metrics and the practical need for scalable evaluation as generative models grow in complexity and output size.

Claimed Contributions

PolyGraphScore (PGS) evaluation framework

Can Refute

7 retrieved papers

The authors propose PolyGraphScore, a novel evaluation framework for graph generative models that estimates the Jensen-Shannon distance between real and generated graph distributions using probabilistic classification on graph descriptors. Unlike MMD metrics, PGS produces scores in the unit interval [0,1] that are directly comparable across different graph descriptors.

7 retrieved papers

Can Refute

Theoretically grounded summary score combining multiple descriptors

Can Refute

10 retrieved papers

The authors develop a principled method to combine PGS scores from multiple graph descriptors into a single summary score. This combined score provides the tightest available variational lower bound on the Jensen-Shannon distance while identifying the most informative descriptor.

10 retrieved papers

Can Refute

Open-source PolyGraph library with new benchmark datasets

10 retrieved papers

The authors provide an open-source library containing implementations of their proposed PolyGraphScore method, MMD estimators, and introduce new larger benchmark datasets to enable more reliable evaluation of graph generative models.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[23] Evaluating Graph Generative Models with Contrastively Learned Features PDF

Shirzad, Hamed, Hamed Shirzad, Hassani, Kaveh, Kaveh Hassani, Sutherland, Danica J., Danica J. Sutherland (2022)

[33] On the Role of Edge Dependency in Graph Generative Models PDF

Chanpuriya, Sudhanshu, Sudhanshu Chanpuriya, Musco, Cameron, Cameron Musco, Sotiropoulos Konstantinos, Konstantinos Sotiropoulos, Tsourakakis, Charalampos, Charalampos E. Tsourakakis (2023)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

PolyGraphScore (PGS) evaluation framework

[51] PolyGraph Discrepancy: a classifier-based metric for graph generation PDF

Can Refute

[52] Towards High-Fidelity and Controllable Bioacoustic Generation via Enhanced Diffusion Learning PDF

Cannot Refute

[53] Connecting Jensen-Shannon and Kullback-Leibler Divergences: A New Bound for Representation Learning PDF

Cannot Refute

[54] Generative maximum entropy learning for multiclass classification PDF

Cannot Refute

[55] Graph Generative Models from Information Theory PDF

Cannot Refute

[56] GraphWGAN: Graph Representation Learning with Wasserstein Generative Adversarial Networks PDF

Cannot Refute

[57] Generative models for non-vectorial data PDF

Cannot Refute

Contribution

Theoretically grounded summary score combining multiple descriptors

[51] PolyGraph Discrepancy: a classifier-based metric for graph generation PDF

Can Refute

[65] Variational Graph Auto-Encoders PDF

Cannot Refute

[66] Few-Shot Object Detection via Variational Feature Aggregation PDF

Cannot Refute

[67] Spiking variational graph representation inference for video summarization PDF

Cannot Refute

[68] Federated Graph Anomaly Detection via Disentangled Representation Learning PDF

Cannot Refute

[69] CCGIB: A cross-channel graph information bottleneck principle PDF

Cannot Refute

[70] Variational few-shot learning PDF

Cannot Refute

[71] Seegera: Self-supervised semi-implicit graph variational auto-encoders with masking PDF

Cannot Refute

[72] Multi-modal variational graph auto-encoder for recommendation systems PDF

Cannot Refute

[73] HELA-VFA: A hellinger distance-attention-based feature aggregation network for few-shot classification PDF

Cannot Refute

Contribution

Open-source PolyGraph library with new benchmark datasets

[4] Training and Evaluating Graph Generative Models PDF

Cannot Refute

[6] A Systematic Survey on Deep Generative Models for Graph Generation PDF

Cannot Refute

[11] PGB: Benchmarking Differentially Private Synthetic Graph Generation Algorithms PDF

Cannot Refute

[58] MolScore: a scoring, evaluation and benchmarking framework for generative models in de novo drug design PDF

Cannot Refute

[59] GPT4Graph: Can Large Language Models Understand Graph Structured Data ? An Empirical Evaluation and Benchmarking PDF

Cannot Refute

[60] A comprehensive survey of dynamic graph neural networks: Models, frameworks, benchmarks, experiments and challenges PDF

Cannot Refute

[61] Open graph benchmark: Datasets for machine learning on graphs PDF

Cannot Refute

[62] A framework for large-scale synthetic graph dataset generation PDF

Cannot Refute

[63] Graphrnn: Generating realistic graphs with deep auto-regressive models PDF

Cannot Refute

[64] IntelliGraphs: Datasets for Benchmarking Knowledge Graph Generation PDF

Cannot Refute

PolyGraphScore: a classifier-based metric for evaluating graph generative models

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[23] Evaluating Graph Generative Models with Contrastively Learned Features PDF

[33] On the Role of Edge Dependency in Graph Generative Models PDF

Contribution Analysis

PolyGraphScore (PGS) evaluation framework

[51] PolyGraph Discrepancy: a classifier-based metric for graph generation PDF

[52] Towards High-Fidelity and Controllable Bioacoustic Generation via Enhanced Diffusion Learning PDF

[53] Connecting Jensen-Shannon and Kullback-Leibler Divergences: A New Bound for Representation Learning PDF

[54] Generative maximum entropy learning for multiclass classification PDF

[55] Graph Generative Models from Information Theory PDF

[56] GraphWGAN: Graph Representation Learning with Wasserstein Generative Adversarial Networks PDF

[57] Generative models for non-vectorial data PDF

Theoretically grounded summary score combining multiple descriptors

[51] PolyGraph Discrepancy: a classifier-based metric for graph generation PDF

[65] Variational Graph Auto-Encoders PDF

[66] Few-Shot Object Detection via Variational Feature Aggregation PDF

[67] Spiking variational graph representation inference for video summarization PDF

[68] Federated Graph Anomaly Detection via Disentangled Representation Learning PDF

[69] CCGIB: A cross-channel graph information bottleneck principle PDF

[70] Variational few-shot learning PDF

[71] Seegera: Self-supervised semi-implicit graph variational auto-encoders with masking PDF

[72] Multi-modal variational graph auto-encoder for recommendation systems PDF

[73] HELA-VFA: A hellinger distance-attention-based feature aggregation network for few-shot classification PDF

Open-source PolyGraph library with new benchmark datasets

[4] Training and Evaluating Graph Generative Models PDF

[6] A Systematic Survey on Deep Generative Models for Graph Generation PDF

[11] PGB: Benchmarking Differentially Private Synthetic Graph Generation Algorithms PDF

[58] MolScore: a scoring, evaluation and benchmarking framework for generative models in de novo drug design PDF

[59] GPT4Graph: Can Large Language Models Understand Graph Structured Data ? An Empirical Evaluation and Benchmarking PDF

[60] A comprehensive survey of dynamic graph neural networks: Models, frameworks, benchmarks, experiments and challenges PDF

[61] Open graph benchmark: Datasets for machine learning on graphs PDF

[62] A framework for large-scale synthetic graph dataset generation PDF

[63] Graphrnn: Generating realistic graphs with deep auto-regressive models PDF

[64] IntelliGraphs: Datasets for Benchmarking Knowledge Graph Generation PDF

Table of Contents