CheXGenBench: A Unified Benchmark for Fidelity, Privacy and Utility of Synthetic Chest Radiographs

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

Biomedical ImagingText-to-Image GenerationMedical Image AnalysisChest RadiographsBenchmark

Structured benchmarks have advanced text-conditional image generation for real-world imagery, however, no such benchmark exists for synthetic radiograph generation. Despite being a highly active area of research, existing studies continue adopting inconsistent evaluation protocols and lack a unified assessment of the three most critical criteria: generative fidelity, privacy risk, and downstream utility. To address these limitations, we introduce CheXGenBench, the first unified evaluation framework for synthetic chest radiograph generation that simultaneously assesses fidelity, privacy risks, and clinical utility across frontier text-to-image (T2I) generative models. Our evaluation protocol, comprising over 20 quantitative metrics, covers 11 leading T2I architectures with plug-and-play integration for newer models. Through a rigorous and fair evaluation protocol, we establish a new SoTA in synthetic chest X-ray generation. Furthermore, our results uncover several critical limitations in the applicability of current generative models, which include (1) even SoTA models struggle with long-tailed medical distributions, (2) models pose high privacy risks regardless of fidelity quality, and (3) synthetic data offers limited utility for downstream multimodal tasks. Drawing from these results, we propose concrete research directions to advance the field. Finally, we curate and release SynthCheX-75K, a high-quality synthetic dataset comprising 75K radiographs generated by our top-performing model (Sana 0.6B). The fine-tuned models and the SynthCheX-75K dataset would be released after acceptance, while the anonymised code is available at https://anonymous.4open.science/r/CheXGenBench-52F0/README.md

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces CheXGenBench, a unified evaluation framework for synthetic chest radiograph generation that simultaneously assesses fidelity, privacy risks, and clinical utility across 11 text-to-image architectures. It resides in the 'Unified Benchmarking and Multi-Metric Evaluation' leaf, which contains only three papers total, indicating a relatively sparse research direction. The sibling papers (SPINE and Utility Synthetic Images) similarly advocate for holistic assessment protocols, suggesting this leaf represents an emerging consensus around comprehensive benchmarking rather than isolated metrics.

The taxonomy reveals that while generative architectures (GANs, diffusion models) and application-driven synthesis are well-populated branches, the evaluation methodologies branch remains comparatively underdeveloped. Neighboring leaves focus on single-aspect assessments: 'Fidelity and Clinical Realism Assessment' examines visual quality via radiologist studies, 'Privacy Risk and Memorization Analysis' addresses data leakage concerns, and 'Downstream Task Utility Evaluation' measures classifier performance. CheXGenBench's multi-faceted approach bridges these fragmented evaluation threads, positioning it at the intersection of previously siloed assessment dimensions.

Among 30 candidates examined, none clearly refute the three core contributions. The unified framework contribution examined 10 candidates with zero refutable matches, suggesting limited prior work proposing simultaneous fidelity-privacy-utility benchmarks at this scale. The state-of-the-art model and evaluation protocol contribution similarly found no refutations across 10 candidates, though the search scope cannot confirm exhaustive novelty. The SynthCheX-75K dataset contribution also showed zero refutations among 10 examined papers, indicating potential novelty in dataset scale or composition within the limited search window.

Based on the top-30 semantic matches and taxonomy structure, the work appears to occupy a genuine gap in comprehensive benchmarking for medical image synthesis. However, the limited search scope means adjacent evaluation frameworks in broader computer vision or alternative medical imaging domains may not have been captured. The analysis covers synthetic chest X-ray generation specifically but does not extend to evaluation methodologies in other radiology subfields or general-purpose image synthesis benchmarks.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: synthetic chest radiograph generation evaluation. The field has matured into a structured landscape organized around six main branches. Generative Model Architectures and Training Approaches explore foundational techniques—ranging from early DCGAN Medical Synthesis[5] and Progressive GAN Framework[4] to more recent diffusion-based methods like Denoising Diffusion Probabilistic[11] and Adapted Latent Diffusion[29]—that establish how synthetic chest X-rays are produced. Controlled and Conditional Synthesis focuses on fine-grained manipulation, enabling disease-specific or attribute-driven generation through works such as Disease Aware StyleGAN[10] and Feature Controlled Synthesis[32]. Evaluation Methodologies and Benchmarking Frameworks address the critical question of how to assess realism and clinical utility, with studies like Visual Turing Test[7] and Clinical Realism Evaluation[19] proposing human-expert protocols alongside automated metrics. Application-Driven Synthesis and Data Augmentation examines practical deployment for tasks like COVID Detection Synthesis[14] and Enhanced Disease Prediction[21], while Cross-Domain and Multi-Modal Synthesis investigates translation between imaging modalities or text-to-image generation as in Text to CXR[15]. Finally, Methodological Reviews and Comparative Studies, including Current State Outlook[22] and Comparative Exploration GANs[39], synthesize progress and identify open challenges across these threads. Several active lines reveal ongoing tensions between model sophistication, controllability, and rigorous validation. On one hand, generative architectures have grown increasingly powerful, yet concerns about clinical fidelity and potential biases persist, as highlighted by Beware Diffusion Models[25] and Critical Assessment Pneumonia[3]. On the other hand, application-driven efforts demonstrate that synthetic data can improve downstream classifiers when carefully validated, though the gap between perceptual quality and diagnostic utility remains debated. CheXGenBench[0] situates itself squarely within the Unified Benchmarking and Multi-Metric Evaluation cluster, proposing a comprehensive framework that bridges multiple evaluation dimensions—visual realism, diagnostic consistency, and downstream task performance. This positions it alongside SPINE[17] and Utility Synthetic Images[30], which similarly advocate for holistic assessment protocols rather than isolated metrics, addressing the field's need for standardized benchmarks that can guide both model development and clinical adoption.

Claimed Contributions

CheXGenBench unified evaluation framework

10 retrieved papers

A comprehensive benchmark framework that evaluates synthetic chest X-ray generation models across three dimensions: generative fidelity and mode coverage, privacy and patient re-identification risks, and downstream clinical utility. The framework includes over 20 quantitative metrics and supports plug-and-play integration of new models.

10 retrieved papers

New state-of-the-art model and evaluation protocol

10 retrieved papers

The authors establish new state-of-the-art performance in synthetic chest radiograph generation by evaluating 11 leading text-to-image architectures using standardized training protocols and identifying Sana 0.6B as the top-performing model through their comprehensive benchmark.

10 retrieved papers

SynthCheX-75K synthetic dataset

10 retrieved papers

A curated dataset of 75,000 high-quality synthetic chest radiographs generated using the benchmark's best-performing model. This dataset can serve as a standalone training resource, augment existing datasets for rare conditions, or function as an out-of-distribution test set.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[17] Introducing SPINE: A Holistic Approach to Synthetic Pulmonary Imaging Evaluation Through End-to-End Data and Model Management PDF

Nikolaos Ntampakis, Vasileios Argyriou, Konstantinos Diamantaras, Konstantinos Goulianas, Konstantinos I. Diamantaras, Panagiotis Sarigiannidis, Ilias Siniosoglou, Panos Sarigiannidis (2024) • IEEE Open Journal of Engineering in Medicine and Biology

[30] You Don't Have to Be Perfect to Be Amazing: Unveil the Utility of Synthetic Images PDF

Xiaodan Xing, Federico Felder, Yang Nan, Giorgos Papanastasiou, Simon Walsh, Guang Yang (2023) • International Conference on Medical Image Computing and Computer-Assisted Intervention

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

CheXGenBench unified evaluation framework

[30] You Don't Have to Be Perfect to Be Amazing: Unveil the Utility of Synthetic Images PDF

Cannot Refute

[65] Clinical evaluation of medical image synthesis: a case study in wireless capsule endoscopy PDF

Cannot Refute

[66] Design and development of a systematic validation protocol for synthetic melanoma images for responsible use in medical artificial intelligence PDF

Cannot Refute

[67] Generative Artificial Intelligence in Medical Imaging: Foundations, Progress, and Clinical Translation PDF

Cannot Refute

[68] Generating high-fidelity synthetic patient data for assessing machine learning healthcare software PDF

Cannot Refute

[69] Scorecard for synthetic medical data evaluation PDF

Cannot Refute

[70] Generative Adversarial Networks for Synthetic Biomedical Data: Ensuring Data Fidelity and Privacy Preservation PDF

Cannot Refute

[71] SynthVal: A Framework for Validating Synthetic Medical Images PDF

Cannot Refute

[72] Interpretable Similarity of Synthetic Image Utility PDF

Cannot Refute

[73] Eyes Tell the Truth: GazeVal Highlights Shortcomings of Generative AI in Medical Imaging PDF

Cannot Refute

Contribution

New state-of-the-art model and evaluation protocol

[20] A Generative Foundation Model for Chest Radiography PDF

Cannot Refute

[36] Evaluating and Improving the Effectiveness of Synthetic Chest X-Rays for Medical Image Analysis PDF

Cannot Refute

[51] Chest-diffusion: a light-weight text-to-image model for report-to-cxr generation PDF

Cannot Refute

[52] Generative AI Techniques in Medical Imaging Analysis: A Systematic Review PDF

Cannot Refute

[53] Cxr-clip: Toward large scale chest x-ray language-image pre-training PDF

Cannot Refute

[54] Any-to-Any Vision-Language Model for Multimodal X-ray Imaging and Radiological Report Generation PDF

Cannot Refute

[55] Cxr-irgen: An integrated vision and language model for the generation of clinically accurate chest x-ray image-report pairs PDF

Cannot Refute

[56] Synthetic lung x-ray generation through cross-attention and affinity transformation PDF

Cannot Refute

[57] Covid-19 pneumonia chest x-ray pattern synthesis by stable diffusion PDF

Cannot Refute

[58] Spot the fake lungs: Generating Synthetic Medical Images using Neural Diffusion Models PDF

Cannot Refute

Contribution

SynthCheX-75K synthetic dataset

[2] Disentangled Contrastive Learning From Synthetic Matching Pairs for Targeted Chest X-Ray Generation PDF

Cannot Refute

[4] Progressive GAN Framework for Realistic Chest X-Ray Synthesis and Data Augmentation PDF

Cannot Refute

[8] Roentgen: vision-language foundation model for chest x-ray generation PDF

Cannot Refute

[23] Chest x-ray image synthesis using deep convolutional gans PDF

Cannot Refute

[59] Generative models improve fairness of medical classifiers under distribution shifts PDF

Cannot Refute

[60] A visionâlanguage foundation model for the generation of realistic chest x-ray images PDF

Cannot Refute

[61] Mitigating risk in medical AI: balancing X-ray datasets for reliable detection PDF

Cannot Refute

[62] Gan-based data augmentation for chest x-ray classification PDF

Cannot Refute

[63] Synthetically enhanced: unveiling synthetic data's potential in medical imaging research PDF

Cannot Refute

[64] Dual-domain explainability-driven data augmentation for enhanced COVID-19 detection in chest X-rays PDF

Cannot Refute

CheXGenBench: A Unified Benchmark for Fidelity, Privacy and Utility of Synthetic Chest Radiographs

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[17] Introducing SPINE: A Holistic Approach to Synthetic Pulmonary Imaging Evaluation Through End-to-End Data and Model Management PDF

[30] You Don't Have to Be Perfect to Be Amazing: Unveil the Utility of Synthetic Images PDF

Contribution Analysis

CheXGenBench unified evaluation framework

[30] You Don't Have to Be Perfect to Be Amazing: Unveil the Utility of Synthetic Images PDF

[65] Clinical evaluation of medical image synthesis: a case study in wireless capsule endoscopy PDF

[66] Design and development of a systematic validation protocol for synthetic melanoma images for responsible use in medical artificial intelligence PDF

[67] Generative Artificial Intelligence in Medical Imaging: Foundations, Progress, and Clinical Translation PDF

[68] Generating high-fidelity synthetic patient data for assessing machine learning healthcare software PDF

[69] Scorecard for synthetic medical data evaluation PDF

[70] Generative Adversarial Networks for Synthetic Biomedical Data: Ensuring Data Fidelity and Privacy Preservation PDF

[71] SynthVal: A Framework for Validating Synthetic Medical Images PDF

[72] Interpretable Similarity of Synthetic Image Utility PDF

[73] Eyes Tell the Truth: GazeVal Highlights Shortcomings of Generative AI in Medical Imaging PDF

New state-of-the-art model and evaluation protocol

[20] A Generative Foundation Model for Chest Radiography PDF

[36] Evaluating and Improving the Effectiveness of Synthetic Chest X-Rays for Medical Image Analysis PDF

[51] Chest-diffusion: a light-weight text-to-image model for report-to-cxr generation PDF

[52] Generative AI Techniques in Medical Imaging Analysis: A Systematic Review PDF

[53] Cxr-clip: Toward large scale chest x-ray language-image pre-training PDF

[54] Any-to-Any Vision-Language Model for Multimodal X-ray Imaging and Radiological Report Generation PDF

[55] Cxr-irgen: An integrated vision and language model for the generation of clinically accurate chest x-ray image-report pairs PDF

[56] Synthetic lung x-ray generation through cross-attention and affinity transformation PDF

[57] Covid-19 pneumonia chest x-ray pattern synthesis by stable diffusion PDF

[58] Spot the fake lungs: Generating Synthetic Medical Images using Neural Diffusion Models PDF

SynthCheX-75K synthetic dataset

[2] Disentangled Contrastive Learning From Synthetic Matching Pairs for Targeted Chest X-Ray Generation PDF

[4] Progressive GAN Framework for Realistic Chest X-Ray Synthesis and Data Augmentation PDF

[8] Roentgen: vision-language foundation model for chest x-ray generation PDF

[23] Chest x-ray image synthesis using deep convolutional gans PDF

[59] Generative models improve fairness of medical classifiers under distribution shifts PDF

[60] A visionâlanguage foundation model for the generation of realistic chest x-ray images PDF

[61] Mitigating risk in medical AI: balancing X-ray datasets for reliable detection PDF

[62] Gan-based data augmentation for chest x-ray classification PDF

[63] Synthetically enhanced: unveiling synthetic data's potential in medical imaging research PDF

[64] Dual-domain explainability-driven data augmentation for enhanced COVID-19 detection in chest X-rays PDF

Table of Contents

[60] A visionâlanguage foundation model for the generation of realistic chest x-ray images PDF