ImageDoctor: Diagnosing Text-to-Image Generation via Grounded Image Reasoning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

Image reward model

The rapid advancement of text-to-image (T2I) models has increased the need for reliable human preference modeling, a demand further amplified by recent progress in reinforcement learning for preference alignment. However, existing approaches typically quantify the quality of a generated image using a single scalar, limiting their ability to provide comprehensive and interpretable feedback on image quality. To address this, we introduce ImageDoctor, a unified multi-aspect T2I model evaluation framework that assesses image quality across four complementary dimensions: plausibility, semantic alignment, aesthetics, and overall quality. ImageDoctor also provides pixel-level flaw indicators in the form of heatmaps, which highlight misaligned or implausible regions, and can be used as a dense reward for T2I model preference alignment. Inspired by the diagnostic process, we improve the detail sensitivity and reasoning capability of ImageDoctor by introducing a ``look-think-predict" paradigm, where the model first localizes potential flaws, then generates reasoning, and finally concludes the evaluation with quantitative scores. Built on top of a vision-language model and trained through a combination of supervised fine-tuning and reinforcement learning, ImageDoctor demonstrates strong alignment with human preference across multiple datasets, establishing its effectiveness as an evaluation metric. Furthermore, when used as a reward model for preference tuning, ImageDoctor significantly improves generation quality—achieving an improvement of 10% over scalar-based reward models.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Multi-aspect text-to-image generation quality evaluation with spatial localization. The field has evolved around several interconnected branches that address both generation and assessment challenges. Spatial Control and Layout-Guided Generation encompasses methods that use explicit spatial constraints—such as bounding boxes (Boxdiff[1]), attention mechanisms (Attention Refocusing[6], Localized Cross Attention[7]), or layout priors (Zero-shot Layout[45])—to guide where objects appear in synthesized images. Text-Prompt-Based Spatial Control explores how natural language descriptions can encode spatial relationships (Spatext[5], Spatial Prepositions[39]) without requiring structured annotations. Meanwhile, Generative Model Architectures and Training Paradigms investigates the underlying diffusion and transformer frameworks (Muse[2], Grounding Diffusion[14]) that enable fine-grained control. Domain-Specific and Application-Oriented Generation targets specialized contexts like satellite imagery (Satellite Street View[21]) or design tasks (DesignDiffusion[27]), while Auxiliary Generation and Reasoning Tasks address complementary problems such as visual reasoning and compositional understanding. Evaluation Frameworks and Benchmarks have become central to measuring how well models satisfy complex prompts across multiple quality dimensions—attribute binding, spatial accuracy, object counting, and compositional fidelity. Works like TIFA[4], GenEval[8], and T2I-CompBench[18] introduced systematic test suites, while Quality Assessment Survey[9] and Decade Survey[13] provide broader perspectives on progress and open challenges. Within this evaluation landscape, ImageDoctor[0] emphasizes multi-aspect quality assessment with spatial localization, positioning itself alongside efforts that diagnose specific failure modes and provide interpretable feedback. Compared to holistic benchmarks (T2I-CompBench++[15], DALL-EVAL[25]) that aggregate scores across many prompts, ImageDoctor[0] focuses on pinpointing where and why generation quality degrades, offering a more granular diagnostic lens. This contrasts with purely compositional metrics (GenEval[8]) or attribute-binding tests (TIFA[4]), highlighting a shift toward spatially aware, interpretable evaluation that can guide iterative model improvement.

Claimed Contributions

ImageDoctor unified multi-aspect T2I evaluation framework

10 retrieved papers

The authors propose ImageDoctor, a unified framework that evaluates text-to-image generation across four dimensions (plausibility, semantic alignment, aesthetics, overall quality) and provides pixel-level flaw indicators as heatmaps highlighting misaligned or implausible regions.

10 retrieved papers

Look-think-predict paradigm for grounded image reasoning

2 retrieved papers

The authors introduce a diagnostic reasoning paradigm where the model first localizes potential flaw regions (look), analyzes them through structured reasoning (think), and then produces final evaluation scores and heatmaps (predict), mimicking human evaluation processes.

2 retrieved papers

DenseFlow-GRPO reinforcement learning framework

Can Refute

9 retrieved papers

The authors present DenseFlow-GRPO, a novel reinforcement learning method that incorporates both image-level and pixel-level dense reward signals from ImageDoctor to provide spatially aligned supervision for text-to-image model training, enabling fine-grained region-aware optimization.

9 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[9] Quality assessment for text-to-image generation: A survey PDF

Yu Tian, Yue Liu, Shiqi Wang, Sam Kwong (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

ImageDoctor unified multi-aspect T2I evaluation framework

[9] Quality assessment for text-to-image generation: A survey PDF

Cannot Refute

[18] T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation PDF

Cannot Refute

[29] Visual programming for step-by-step text-to-image generation and evaluation PDF

Cannot Refute

[53] Holistic evaluation of text-to-image models PDF

Cannot Refute

[54] Evaluating text-to-visual generation with image-to-text generation PDF

Cannot Refute

[55] Multimodal Benchmarking and Recommendation of Text-to-Image Generation Models PDF

Cannot Refute

[56] A survey on quality metrics for text-to-image generation PDF

Cannot Refute

[57] Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis PDF

Cannot Refute

[58] UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation PDF

Cannot Refute

[59] Hrs-bench: Holistic, reliable and scalable benchmark for text-to-image models PDF

Cannot Refute

Contribution

Look-think-predict paradigm for grounded image reasoning

[51] VGR: Visual Grounded Reasoning PDF

Cannot Refute

[52] A computational model of event segmentation from perceptual prediction PDF

Cannot Refute

Contribution

DenseFlow-GRPO reinforcement learning framework

[64] Pixel-wise RL on Diffusion Models: Reinforcement Learning from Rich Feedback PDF

Can Refute

[66] Seeing What Matters: Visual Preference Policy Optimization for Visual Generation PDF

Can Refute

[60] Visual-CoG: Stage-Aware Reinforcement Learning with Chain of Guidance for Text-to-Image Generation PDF

Cannot Refute

[61] A dense reward view on aligning text-to-image diffusion with preference PDF

Cannot Refute

[62] Listener-Rewarded Thinking in VLMs for Image Preferences PDF

Cannot Refute

[63] Aligning Text-to-Image Diffusion Models With Constrained Reinforcement Learning PDF

Cannot Refute

[65] Designing, learning, and utilizing dense reward for generative AI alignment PDF

Cannot Refute

[67] Step-level Reward for Free in RL-based T2I Diffusion Model Fine-tuning PDF

Cannot Refute

[68] Subject-driven Text-to-Image Generation via Preference-based Reinforcement Learning PDF

Cannot Refute

ImageDoctor: Diagnosing Text-to-Image Generation via Grounded Image Reasoning

Overview

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[9] Quality assessment for text-to-image generation: A survey PDF

Contribution Analysis

ImageDoctor unified multi-aspect T2I evaluation framework

[9] Quality assessment for text-to-image generation: A survey PDF

[18] T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation PDF

[29] Visual programming for step-by-step text-to-image generation and evaluation PDF

[53] Holistic evaluation of text-to-image models PDF

[54] Evaluating text-to-visual generation with image-to-text generation PDF

[55] Multimodal Benchmarking and Recommendation of Text-to-Image Generation Models PDF

[56] A survey on quality metrics for text-to-image generation PDF

[57] Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis PDF

[58] UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation PDF

[59] Hrs-bench: Holistic, reliable and scalable benchmark for text-to-image models PDF

Look-think-predict paradigm for grounded image reasoning

[51] VGR: Visual Grounded Reasoning PDF

[52] A computational model of event segmentation from perceptual prediction PDF

DenseFlow-GRPO reinforcement learning framework

[64] Pixel-wise RL on Diffusion Models: Reinforcement Learning from Rich Feedback PDF

[66] Seeing What Matters: Visual Preference Policy Optimization for Visual Generation PDF

[60] Visual-CoG: Stage-Aware Reinforcement Learning with Chain of Guidance for Text-to-Image Generation PDF

[61] A dense reward view on aligning text-to-image diffusion with preference PDF

[62] Listener-Rewarded Thinking in VLMs for Image Preferences PDF

[63] Aligning Text-to-Image Diffusion Models With Constrained Reinforcement Learning PDF

[65] Designing, learning, and utilizing dense reward for generative AI alignment PDF

[67] Step-level Reward for Free in RL-based T2I Diffusion Model Fine-tuning PDF

[68] Subject-driven Text-to-Image Generation via Preference-based Reinforcement Learning PDF

Table of Contents