ImagenWorld: Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-ended Real-World Tasks
Overview
Overall Novelty Assessment
The paper introduces ImagenWorld, a benchmark spanning six core tasks (generation and editing with single or multiple references) across six topical domains, supported by 3.6K condition sets and 20K human annotations. Within the taxonomy, it resides in the 'Multi-Task and Multi-Domain Benchmarks' leaf alongside two sibling papers. This leaf is relatively sparse, containing only three papers total, suggesting that comprehensive multi-task, multi-domain evaluation frameworks remain an underexplored area compared to the broader field of 50 papers across 15 leaf nodes.
The taxonomy reveals that most evaluation work concentrates in adjacent leaves: 'Task-Specific Evaluation Benchmarks' contains eight papers focusing on narrow editing tasks, while 'Evaluation Metrics and Human Alignment' holds one paper on automated metrics. The sibling papers in ImagenWorld's leaf address multimodal instruction-guided generation and general multi-task assessment, but the scope notes clarify that ImagenWorld's simultaneous coverage of diverse visual domains (artworks, screenshots, information graphics) distinguishes it from single-domain or single-task approaches. Neighboring branches like 'Controllable Generation' and 'Model Architectures' focus on methods rather than evaluation infrastructure.
Among 30 candidates examined, none clearly refute any of the three contributions. Contribution A (diverse tasks and domains) examined 10 candidates with 0 refutable; Contribution B (large-scale human study revealing failure modes) examined 10 with 0 refutable; Contribution C (explainable evaluation schema with localized error attribution) examined 10 with 0 refutable. The statistics suggest that within this limited search scope, the combination of multi-task coverage, domain diversity, and fine-grained error tagging appears relatively novel, though the small candidate pool means substantial prior work may exist beyond the top-30 semantic matches.
Based on the limited literature search (30 candidates from semantic retrieval), the work appears to occupy a sparsely populated niche within evaluation frameworks. The taxonomy structure shows that while task-specific benchmarks are common, comprehensive multi-domain testbeds with explainable error attribution are less prevalent. However, the analysis does not cover exhaustive citation networks or domain-specific venues, so definitive claims about novelty require broader investigation beyond the top-K matches examined here.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors present ImagenWorld, a benchmark comprising 3.6K condition sets that systematically covers six representative task types (generation and editing with single or multiple references) and six topical domains (artworks, photorealistic images, information graphics, textual graphics, computer graphics, and screenshots), providing a unified testbed for evaluating image generation models.
The authors perform a comprehensive human evaluation study supported by 20K fine-grained annotations across 14 models, uncovering systematic failure patterns such as distinct editing biases, struggles with text-heavy domains, and performance gaps between closed-source and open-source systems.
The authors introduce a structured evaluation framework where human annotators not only provide scores but also tag specific failure modes with textual descriptions and localized masks at both object and segment levels, enabling interpretable diagnosis of model weaknesses beyond opaque numerical metrics.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[18] MMIG-Bench: Towards Comprehensive and Explainable Evaluation of Multi-Modal Image Generation Models PDF
[22] Opengpt-4o-image: A comprehensive dataset for advanced image generation and editing PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
ImagenWorld benchmark with diverse tasks and domains
The authors present ImagenWorld, a benchmark comprising 3.6K condition sets that systematically covers six representative task types (generation and editing with single or multiple references) and six topical domains (artworks, photorealistic images, information graphics, textual graphics, computer graphics, and screenshots), providing a unified testbed for evaluating image generation models.
[4] FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space PDF
[51] OmniGen: Unified Image Generation PDF
[52] DreamOmni: Unified Image Generation and Editing PDF
[53] Ice-bench: A unified and comprehensive benchmark for image creating and editing PDF
[54] Anyedit: Mastering unified high-quality image editing for any idea PDF
[55] UniVG: A Generalist Diffusion Model for Unified Image Generation and Editing PDF
[56] Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis PDF
[57] UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics PDF
[58] Mix mstar: A synthetic benchmark dataset for multi-class rotation vehicle detection in large-scale sar images PDF
[59] UPGPT: Universal Diffusion Model for Person Image Generation, Editing and Pose Transfer PDF
Large-scale human study revealing model failure modes
The authors perform a comprehensive human evaluation study supported by 20K fine-grained annotations across 14 models, uncovering systematic failure patterns such as distinct editing biases, struggles with text-heavy domains, and performance gaps between closed-source and open-source systems.
[60] Rich human feedback for text-to-image generation PDF
[61] Imagenhub: Standardizing the evaluation of conditional image generation models PDF
[62] Rethinking FID: Towards a Better Evaluation Metric for Image Generation PDF
[63] Re-imagen: Retrieval-augmented text-to-image generator PDF
[64] New Job, New Gender? Measuring the Social Bias in Image Generation Models PDF
[65] Benchmarking spatial relationships in text-to-image generation PDF
[66] Quality assessment for text-to-image generation: A survey PDF
[67] Interactive Visual Assessment for Text-to-Image Generation Models PDF
[68] Linguistic binding in diffusion models: Enhancing attribute correspondence through attention map alignment PDF
[69] Adversarial nibbler: An open red-teaming method for identifying diverse harms in text-to-image generation PDF
Explainable evaluation schema with localized error attribution
The authors introduce a structured evaluation framework where human annotators not only provide scores but also tag specific failure modes with textual descriptions and localized masks at both object and segment levels, enabling interpretable diagnosis of model weaknesses beyond opaque numerical metrics.