Abstract:

The rapid advancement of text-to-image (T2I) models has increased the need for reliable human preference modeling, a demand further amplified by recent progress in reinforcement learning for preference alignment. However, existing approaches typically quantify the quality of a generated image using a single scalar, limiting their ability to provide comprehensive and interpretable feedback on image quality. To address this, we introduce ImageDoctor, a unified multi-aspect T2I model evaluation framework that assesses image quality across four complementary dimensions: plausibility, semantic alignment, aesthetics, and overall quality. ImageDoctor also provides pixel-level flaw indicators in the form of heatmaps, which highlight misaligned or implausible regions, and can be used as a dense reward for T2I model preference alignment. Inspired by the diagnostic process, we improve the detail sensitivity and reasoning capability of ImageDoctor by introducing a ``look-think-predict" paradigm, where the model first localizes potential flaws, then generates reasoning, and finally concludes the evaluation with quantitative scores. Built on top of a vision-language model and trained through a combination of supervised fine-tuning and reinforcement learning, ImageDoctor demonstrates strong alignment with human preference across multiple datasets, establishing its effectiveness as an evaluation metric. Furthermore, when used as a reward model for preference tuning, ImageDoctor significantly improves generation quality—achieving an improvement of 10% over scalar-based reward models.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
21
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: Multi-aspect text-to-image generation quality evaluation with spatial localization. The field has evolved around several interconnected branches that address both generation and assessment challenges. Spatial Control and Layout-Guided Generation encompasses methods that use explicit spatial constraints—such as bounding boxes (Boxdiff[1]), attention mechanisms (Attention Refocusing[6], Localized Cross Attention[7]), or layout priors (Zero-shot Layout[45])—to guide where objects appear in synthesized images. Text-Prompt-Based Spatial Control explores how natural language descriptions can encode spatial relationships (Spatext[5], Spatial Prepositions[39]) without requiring structured annotations. Meanwhile, Generative Model Architectures and Training Paradigms investigates the underlying diffusion and transformer frameworks (Muse[2], Grounding Diffusion[14]) that enable fine-grained control. Domain-Specific and Application-Oriented Generation targets specialized contexts like satellite imagery (Satellite Street View[21]) or design tasks (DesignDiffusion[27]), while Auxiliary Generation and Reasoning Tasks address complementary problems such as visual reasoning and compositional understanding. Evaluation Frameworks and Benchmarks have become central to measuring how well models satisfy complex prompts across multiple quality dimensions—attribute binding, spatial accuracy, object counting, and compositional fidelity. Works like TIFA[4], GenEval[8], and T2I-CompBench[18] introduced systematic test suites, while Quality Assessment Survey[9] and Decade Survey[13] provide broader perspectives on progress and open challenges. Within this evaluation landscape, ImageDoctor[0] emphasizes multi-aspect quality assessment with spatial localization, positioning itself alongside efforts that diagnose specific failure modes and provide interpretable feedback. Compared to holistic benchmarks (T2I-CompBench++[15], DALL-EVAL[25]) that aggregate scores across many prompts, ImageDoctor[0] focuses on pinpointing where and why generation quality degrades, offering a more granular diagnostic lens. This contrasts with purely compositional metrics (GenEval[8]) or attribute-binding tests (TIFA[4]), highlighting a shift toward spatially aware, interpretable evaluation that can guide iterative model improvement.

Claimed Contributions

ImageDoctor unified multi-aspect T2I evaluation framework

The authors propose ImageDoctor, a unified framework that evaluates text-to-image generation across four dimensions (plausibility, semantic alignment, aesthetics, overall quality) and provides pixel-level flaw indicators as heatmaps highlighting misaligned or implausible regions.

10 retrieved papers
Look-think-predict paradigm for grounded image reasoning

The authors introduce a diagnostic reasoning paradigm where the model first localizes potential flaw regions (look), analyzes them through structured reasoning (think), and then produces final evaluation scores and heatmaps (predict), mimicking human evaluation processes.

2 retrieved papers
DenseFlow-GRPO reinforcement learning framework

The authors present DenseFlow-GRPO, a novel reinforcement learning method that incorporates both image-level and pixel-level dense reward signals from ImageDoctor to provide spatially aligned supervision for text-to-image model training, enabling fine-grained region-aware optimization.

9 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

ImageDoctor unified multi-aspect T2I evaluation framework

The authors propose ImageDoctor, a unified framework that evaluates text-to-image generation across four dimensions (plausibility, semantic alignment, aesthetics, overall quality) and provides pixel-level flaw indicators as heatmaps highlighting misaligned or implausible regions.

Contribution

Look-think-predict paradigm for grounded image reasoning

The authors introduce a diagnostic reasoning paradigm where the model first localizes potential flaw regions (look), analyzes them through structured reasoning (think), and then produces final evaluation scores and heatmaps (predict), mimicking human evaluation processes.

Contribution

DenseFlow-GRPO reinforcement learning framework

The authors present DenseFlow-GRPO, a novel reinforcement learning method that incorporates both image-level and pixel-level dense reward signals from ImageDoctor to provide spatially aligned supervision for text-to-image model training, enabling fine-grained region-aware optimization.