CheXGenBench: A Unified Benchmark for Fidelity, Privacy and Utility of Synthetic Chest Radiographs

ICLR 2026 Conference SubmissionAnonymous Authors
Biomedical ImagingText-to-Image GenerationMedical Image AnalysisChest RadiographsBenchmark
Abstract:

Structured benchmarks have advanced text-conditional image generation for real-world imagery, however, no such benchmark exists for synthetic radiograph generation. Despite being a highly active area of research, existing studies continue adopting inconsistent evaluation protocols and lack a unified assessment of the three most critical criteria: generative fidelity, privacy risk, and downstream utility. To address these limitations, we introduce CheXGenBench, the first unified evaluation framework for synthetic chest radiograph generation that simultaneously assesses fidelity, privacy risks, and clinical utility across frontier text-to-image (T2I) generative models. Our evaluation protocol, comprising over 20 quantitative metrics, covers 11 leading T2I architectures with plug-and-play integration for newer models. Through a rigorous and fair evaluation protocol, we establish a new SoTA in synthetic chest X-ray generation. Furthermore, our results uncover several critical limitations in the applicability of current generative models, which include (1) even SoTA models struggle with long-tailed medical distributions, (2) models pose high privacy risks regardless of fidelity quality, and (3) synthetic data offers limited utility for downstream multimodal tasks. Drawing from these results, we propose concrete research directions to advance the field. Finally, we curate and release SynthCheX-75K, a high-quality synthetic dataset comprising 75K radiographs generated by our top-performing model (Sana 0.6B). The fine-tuned models and the SynthCheX-75K dataset would be released after acceptance, while the anonymised code is available at https://anonymous.4open.science/r/CheXGenBench-52F0/README.md

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces CheXGenBench, a unified evaluation framework for synthetic chest radiograph generation that simultaneously assesses fidelity, privacy risks, and clinical utility across 11 text-to-image architectures. It resides in the 'Unified Benchmarking and Multi-Metric Evaluation' leaf, which contains only three papers total, indicating a relatively sparse research direction. The sibling papers (SPINE and Utility Synthetic Images) similarly advocate for holistic assessment protocols, suggesting this leaf represents an emerging consensus around comprehensive benchmarking rather than isolated metrics.

The taxonomy reveals that while generative architectures (GANs, diffusion models) and application-driven synthesis are well-populated branches, the evaluation methodologies branch remains comparatively underdeveloped. Neighboring leaves focus on single-aspect assessments: 'Fidelity and Clinical Realism Assessment' examines visual quality via radiologist studies, 'Privacy Risk and Memorization Analysis' addresses data leakage concerns, and 'Downstream Task Utility Evaluation' measures classifier performance. CheXGenBench's multi-faceted approach bridges these fragmented evaluation threads, positioning it at the intersection of previously siloed assessment dimensions.

Among 30 candidates examined, none clearly refute the three core contributions. The unified framework contribution examined 10 candidates with zero refutable matches, suggesting limited prior work proposing simultaneous fidelity-privacy-utility benchmarks at this scale. The state-of-the-art model and evaluation protocol contribution similarly found no refutations across 10 candidates, though the search scope cannot confirm exhaustive novelty. The SynthCheX-75K dataset contribution also showed zero refutations among 10 examined papers, indicating potential novelty in dataset scale or composition within the limited search window.

Based on the top-30 semantic matches and taxonomy structure, the work appears to occupy a genuine gap in comprehensive benchmarking for medical image synthesis. However, the limited search scope means adjacent evaluation frameworks in broader computer vision or alternative medical imaging domains may not have been captured. The analysis covers synthetic chest X-ray generation specifically but does not extend to evaluation methodologies in other radiology subfields or general-purpose image synthesis benchmarks.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: synthetic chest radiograph generation evaluation. The field has matured into a structured landscape organized around six main branches. Generative Model Architectures and Training Approaches explore foundational techniques—ranging from early DCGAN Medical Synthesis[5] and Progressive GAN Framework[4] to more recent diffusion-based methods like Denoising Diffusion Probabilistic[11] and Adapted Latent Diffusion[29]—that establish how synthetic chest X-rays are produced. Controlled and Conditional Synthesis focuses on fine-grained manipulation, enabling disease-specific or attribute-driven generation through works such as Disease Aware StyleGAN[10] and Feature Controlled Synthesis[32]. Evaluation Methodologies and Benchmarking Frameworks address the critical question of how to assess realism and clinical utility, with studies like Visual Turing Test[7] and Clinical Realism Evaluation[19] proposing human-expert protocols alongside automated metrics. Application-Driven Synthesis and Data Augmentation examines practical deployment for tasks like COVID Detection Synthesis[14] and Enhanced Disease Prediction[21], while Cross-Domain and Multi-Modal Synthesis investigates translation between imaging modalities or text-to-image generation as in Text to CXR[15]. Finally, Methodological Reviews and Comparative Studies, including Current State Outlook[22] and Comparative Exploration GANs[39], synthesize progress and identify open challenges across these threads. Several active lines reveal ongoing tensions between model sophistication, controllability, and rigorous validation. On one hand, generative architectures have grown increasingly powerful, yet concerns about clinical fidelity and potential biases persist, as highlighted by Beware Diffusion Models[25] and Critical Assessment Pneumonia[3]. On the other hand, application-driven efforts demonstrate that synthetic data can improve downstream classifiers when carefully validated, though the gap between perceptual quality and diagnostic utility remains debated. CheXGenBench[0] situates itself squarely within the Unified Benchmarking and Multi-Metric Evaluation cluster, proposing a comprehensive framework that bridges multiple evaluation dimensions—visual realism, diagnostic consistency, and downstream task performance. This positions it alongside SPINE[17] and Utility Synthetic Images[30], which similarly advocate for holistic assessment protocols rather than isolated metrics, addressing the field's need for standardized benchmarks that can guide both model development and clinical adoption.

Claimed Contributions

CheXGenBench unified evaluation framework

A comprehensive benchmark framework that evaluates synthetic chest X-ray generation models across three dimensions: generative fidelity and mode coverage, privacy and patient re-identification risks, and downstream clinical utility. The framework includes over 20 quantitative metrics and supports plug-and-play integration of new models.

10 retrieved papers
New state-of-the-art model and evaluation protocol

The authors establish new state-of-the-art performance in synthetic chest radiograph generation by evaluating 11 leading text-to-image architectures using standardized training protocols and identifying Sana 0.6B as the top-performing model through their comprehensive benchmark.

10 retrieved papers
SynthCheX-75K synthetic dataset

A curated dataset of 75,000 high-quality synthetic chest radiographs generated using the benchmark's best-performing model. This dataset can serve as a standalone training resource, augment existing datasets for rare conditions, or function as an out-of-distribution test set.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

CheXGenBench unified evaluation framework

A comprehensive benchmark framework that evaluates synthetic chest X-ray generation models across three dimensions: generative fidelity and mode coverage, privacy and patient re-identification risks, and downstream clinical utility. The framework includes over 20 quantitative metrics and supports plug-and-play integration of new models.

Contribution

New state-of-the-art model and evaluation protocol

The authors establish new state-of-the-art performance in synthetic chest radiograph generation by evaluating 11 leading text-to-image architectures using standardized training protocols and identifying Sana 0.6B as the top-performing model through their comprehensive benchmark.

Contribution

SynthCheX-75K synthetic dataset

A curated dataset of 75,000 high-quality synthetic chest radiographs generated using the benchmark's best-performing model. This dataset can serve as a standalone training resource, augment existing datasets for rare conditions, or function as an out-of-distribution test set.