PolyGraphScore: a classifier-based metric for evaluating graph generative models

ICLR 2026 Conference SubmissionAnonymous Authors
graph generative modelsmodel evaluationmaximum mean discrepancygenerative models
Abstract:

Existing methods for evaluating graph generative models primarily rely on Maximum Mean Discrepancy (MMD) metrics based on graph descriptors. While these metrics can rank generative models, they do not provide an absolute measure of performance. Their values are also highly sensitive to extrinsic parameters, namely kernel and descriptor parametrization, making them incomparable across different graph descriptors. We introduce PolyGraphScore (PGS), a new evaluation framework that addresses these limitations. It approximates the Jensen-Shannon (JS) distance of graph distributions by fitting binary classifiers to distinguish between real and generated graphs, featurized by these descriptors. The data log-likelihood of these classifiers approximates a variational lower bound on the JS distance between the two distributions. Resulting scores are constrained to the unit interval [0,1][0,1] and are comparable across different graph descriptors. We further derive a theoretically grounded summary score that combines these individual metrics to provide a maximally tight lower bound on the distance for the given descriptors. Thorough experiments demonstrate that PGS provides a more robust and insightful evaluation compared to MMD metrics.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces PolyGraphScore (PGS), a classifier-based evaluation framework that approximates Jensen-Shannon distance between real and generated graph distributions. It resides in the Classifier-Based and Distribution Distance Metrics leaf, which contains only three papers total, including this work. This is a relatively sparse research direction within the broader Evaluation Metrics and Frameworks branch, suggesting that classifier-based approaches to graph generation evaluation remain an emerging area compared to the more established descriptor-based methods found in neighboring leaves.

The taxonomy reveals that evaluation metrics for graph generative models are organized into three main leaves: Classifier-Based approaches (3 papers), Graph Descriptor and Feature-Based Metrics (2 papers), and Benchmarking Frameworks (5 papers). The paper's sibling works include methods using contrastive learned features and edge dependency analysis. Neighboring leaves contain descriptor-based approaches that rely on Maximum Mean Discrepancy (MMD) metrics, which the paper explicitly critiques for lacking absolute performance measures and comparability across descriptors. This positioning suggests the work bridges classifier-based evaluation with traditional descriptor-based methods.

Among 27 candidates examined through limited semantic search, the analysis identified potential prior work overlap for two of three contributions. The PGS framework itself (7 candidates examined, 1 refutable) and the summary score mechanism (10 candidates examined, 1 refutable) both show evidence of related prior work within the limited search scope. The open-source library contribution (10 candidates examined, 0 refutable) appears more distinctive. These statistics indicate that while the core evaluation approach has some precedent in the examined literature, the specific implementation and theoretical grounding may offer incremental advances over existing classifier-based methods.

Based on the limited search of 27 semantically related papers, the work appears to make incremental contributions to an emerging evaluation paradigm. The sparse population of its taxonomy leaf (3 papers) suggests room for methodological development, though the refutable pairs indicate that key ideas have partial precedent. The analysis does not cover exhaustive citation networks or domain-specific evaluation literature, so additional related work may exist beyond the top-K semantic matches examined.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
27
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: Evaluating graph generative models. The field has organized itself around several major branches that reflect both methodological and application-driven concerns. At the highest level, Evaluation Metrics and Frameworks focuses on how to measure the quality of generated graphs, encompassing classifier-based approaches, distribution distance metrics, and domain-specific quality measures. Model Architectures and Approaches covers the diverse generative techniques themselves, ranging from autoregressive and diffusion-based methods to GANs and variational frameworks. Domain-Specific Applications highlights specialized contexts such as molecular design, scene graph generation, and knowledge graph construction, where tailored evaluation criteria often emerge. Surveys and General Reviews provide broad perspectives on the landscape, while Related Graph Learning Tasks addresses adjacent problems like graph prediction and analytics that inform evaluation strategies. Representative works such as MolGAN[3] and Nevae[2] illustrate how architectural choices intersect with evaluation needs, while benchmarks like Benchmarking Graph Generation[46] and Synthetic Graph Benchmark[47] provide standardized testbeds. Within the Evaluation Metrics and Frameworks branch, a particularly active line of work explores classifier-based and distribution distance metrics, balancing statistical rigor with computational feasibility. Some studies emphasize contrastive or learned feature representations, as seen in Contrastively Learned Features[23], while others investigate the role of edge dependencies and structural properties, exemplified by Edge Dependency Role[33]. PolyGraphScore[0] situates itself in this cluster by proposing a polynomial-time scoring mechanism that addresses scalability challenges inherent in distribution-based evaluation. Compared to neighbors like Contrastively Learned Features[23], which leverages learned embeddings, and Edge Dependency Role[33], which examines structural correlations, PolyGraphScore[0] emphasizes efficient computation without sacrificing expressiveness. This positioning reflects ongoing tensions in the field between the desire for rich, nuanced metrics and the practical need for scalable evaluation as generative models grow in complexity and output size.

Claimed Contributions

PolyGraphScore (PGS) evaluation framework

The authors propose PolyGraphScore, a novel evaluation framework for graph generative models that estimates the Jensen-Shannon distance between real and generated graph distributions using probabilistic classification on graph descriptors. Unlike MMD metrics, PGS produces scores in the unit interval [0,1] that are directly comparable across different graph descriptors.

7 retrieved papers
Can Refute
Theoretically grounded summary score combining multiple descriptors

The authors develop a principled method to combine PGS scores from multiple graph descriptors into a single summary score. This combined score provides the tightest available variational lower bound on the Jensen-Shannon distance while identifying the most informative descriptor.

10 retrieved papers
Can Refute
Open-source PolyGraph library with new benchmark datasets

The authors provide an open-source library containing implementations of their proposed PolyGraphScore method, MMD estimators, and introduce new larger benchmark datasets to enable more reliable evaluation of graph generative models.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

PolyGraphScore (PGS) evaluation framework

The authors propose PolyGraphScore, a novel evaluation framework for graph generative models that estimates the Jensen-Shannon distance between real and generated graph distributions using probabilistic classification on graph descriptors. Unlike MMD metrics, PGS produces scores in the unit interval [0,1] that are directly comparable across different graph descriptors.

Contribution

Theoretically grounded summary score combining multiple descriptors

The authors develop a principled method to combine PGS scores from multiple graph descriptors into a single summary score. This combined score provides the tightest available variational lower bound on the Jensen-Shannon distance while identifying the most informative descriptor.

Contribution

Open-source PolyGraph library with new benchmark datasets

The authors provide an open-source library containing implementations of their proposed PolyGraphScore method, MMD estimators, and introduce new larger benchmark datasets to enable more reliable evaluation of graph generative models.

PolyGraphScore: a classifier-based metric for evaluating graph generative models | Novelty Validation