ImageDoctor: Diagnosing Text-to-Image Generation via Grounded Image Reasoning
Overview
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors propose ImageDoctor, a unified framework that evaluates text-to-image generation across four dimensions (plausibility, semantic alignment, aesthetics, overall quality) and provides pixel-level flaw indicators as heatmaps highlighting misaligned or implausible regions.
The authors introduce a diagnostic reasoning paradigm where the model first localizes potential flaw regions (look), analyzes them through structured reasoning (think), and then produces final evaluation scores and heatmaps (predict), mimicking human evaluation processes.
The authors present DenseFlow-GRPO, a novel reinforcement learning method that incorporates both image-level and pixel-level dense reward signals from ImageDoctor to provide spatially aligned supervision for text-to-image model training, enabling fine-grained region-aware optimization.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[9] Quality assessment for text-to-image generation: A survey PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
ImageDoctor unified multi-aspect T2I evaluation framework
The authors propose ImageDoctor, a unified framework that evaluates text-to-image generation across four dimensions (plausibility, semantic alignment, aesthetics, overall quality) and provides pixel-level flaw indicators as heatmaps highlighting misaligned or implausible regions.
[9] Quality assessment for text-to-image generation: A survey PDF
[18] T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation PDF
[29] Visual programming for step-by-step text-to-image generation and evaluation PDF
[53] Holistic evaluation of text-to-image models PDF
[54] Evaluating text-to-visual generation with image-to-text generation PDF
[55] Multimodal Benchmarking and Recommendation of Text-to-Image Generation Models PDF
[56] A survey on quality metrics for text-to-image generation PDF
[57] Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis PDF
[58] UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation PDF
[59] Hrs-bench: Holistic, reliable and scalable benchmark for text-to-image models PDF
Look-think-predict paradigm for grounded image reasoning
The authors introduce a diagnostic reasoning paradigm where the model first localizes potential flaw regions (look), analyzes them through structured reasoning (think), and then produces final evaluation scores and heatmaps (predict), mimicking human evaluation processes.
DenseFlow-GRPO reinforcement learning framework
The authors present DenseFlow-GRPO, a novel reinforcement learning method that incorporates both image-level and pixel-level dense reward signals from ImageDoctor to provide spatially aligned supervision for text-to-image model training, enabling fine-grained region-aware optimization.