ImagenWorld: Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-ended Real-World Tasks

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.7 Download Report PDF

Image GenerationImage EditingEvaluationBenchmark

Advances in diffusion, autoregressive, and hybrid models have enabled high-quality image synthesis for tasks such as text-to-image, editing, and reference-guided composition. Yet, existing benchmarks remain limited, either focus on isolated tasks, cover only narrow domains, or provide opaque scores without explaining failure modes. We introduce \textbf{ImagenWorld}, a benchmark of 3.6K condition sets spanning six core tasks (generation and editing, with single or multiple references) and six topical domains (artworks, photorealistic images, information graphics, textual graphics, computer graphics, and screenshots). The benchmark is supported by 20K fine-grained human annotations and an explainable evaluation schema that tags localized object-level and segment-level errors, complementing automated VLM-based metrics. Our large-scale evaluation of 14 models yields several insights: (1) models typically struggle more in editing tasks than in generation tasks, especially in local edits. (2) models excel in artistic and photorealistic settings but struggle with symbolic and text-heavy domains such as screenshots and information graphics. (3) closed-source systems lead overall, while targeted data curation (e.g., Qwen-Image) narrows the gap in text-heavy cases. (4) modern VLM-based metrics achieve Kendall accuracies up to 0.79, approximating human ranking, but fall short of fine-grained, explainable error attribution. ImagenWorld provides both a rigorous benchmark and a diagnostic tool to advance robust image generation.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces ImagenWorld, a benchmark spanning six core tasks (generation and editing with single or multiple references) across six topical domains, supported by 3.6K condition sets and 20K human annotations. Within the taxonomy, it resides in the 'Multi-Task and Multi-Domain Benchmarks' leaf alongside two sibling papers. This leaf is relatively sparse, containing only three papers total, suggesting that comprehensive multi-task, multi-domain evaluation frameworks remain an underexplored area compared to the broader field of 50 papers across 15 leaf nodes.

The taxonomy reveals that most evaluation work concentrates in adjacent leaves: 'Task-Specific Evaluation Benchmarks' contains eight papers focusing on narrow editing tasks, while 'Evaluation Metrics and Human Alignment' holds one paper on automated metrics. The sibling papers in ImagenWorld's leaf address multimodal instruction-guided generation and general multi-task assessment, but the scope notes clarify that ImagenWorld's simultaneous coverage of diverse visual domains (artworks, screenshots, information graphics) distinguishes it from single-domain or single-task approaches. Neighboring branches like 'Controllable Generation' and 'Model Architectures' focus on methods rather than evaluation infrastructure.

Among 30 candidates examined, none clearly refute any of the three contributions. Contribution A (diverse tasks and domains) examined 10 candidates with 0 refutable; Contribution B (large-scale human study revealing failure modes) examined 10 with 0 refutable; Contribution C (explainable evaluation schema with localized error attribution) examined 10 with 0 refutable. The statistics suggest that within this limited search scope, the combination of multi-task coverage, domain diversity, and fine-grained error tagging appears relatively novel, though the small candidate pool means substantial prior work may exist beyond the top-30 semantic matches.

Based on the limited literature search (30 candidates from semantic retrieval), the work appears to occupy a sparsely populated niche within evaluation frameworks. The taxonomy structure shows that while task-specific benchmarks are common, comprehensive multi-domain testbeds with explainable error attribution are less prevalent. However, the analysis does not cover exhaustive citation networks or domain-specific venues, so definitive claims about novelty require broader investigation beyond the top-K matches examined here.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Benchmarking image generation and editing models across diverse tasks and domains. The field has evolved into a rich ecosystem organized around six major branches. Evaluation Frameworks and Benchmarks establish standardized testbeds for assessing model capabilities, ranging from single-task metrics to comprehensive multi-domain suites like ImagenWorld[0] and MMIG Bench[18]. Model Architectures and Training Approaches encompass foundational techniques from early GANs such as StarGAN[17] and StarGAN v2[5] to modern diffusion-based systems. Controllable Generation and Semantic Manipulation focuses on methods that enable fine-grained steering of outputs through text, layout, or semantic guidance, exemplified by works like StyleCLIP[3] and SDEdit[2]. Application Domains and Specialized Tasks address domain-specific challenges in areas such as medical imaging, fashion, and video synthesis. Survey and Taxonomic Studies, including Multimodal Synthesis Survey[1] and Generative Vision Survey[26], provide meta-level perspectives on the landscape. Supporting Techniques and Infrastructure covers auxiliary components like data augmentation strategies reviewed in Diffusion Augmentation Review[7] and Image Augmentation Survey[28]. A particularly active tension exists between comprehensive multi-task benchmarks and specialized evaluation frameworks. While some efforts pursue breadth—testing models on numerous generation and editing operations simultaneously—others prioritize depth in specific modalities or interaction paradigms, as seen in OpenGPT Image[22] and EditInspector[27]. ImagenWorld[0] sits within the Multi-Task and Multi-Domain Benchmarks cluster, emphasizing holistic assessment across varied scenarios rather than narrow task optimization. This contrasts with more focused benchmarks like MMIG Bench[18], which targets multimodal instruction-guided generation, and specialized editing evaluations such as Complex Edit[33]. The central challenge remains balancing coverage against evaluation granularity: broad benchmarks risk superficial assessment, while narrow ones may miss emergent capabilities. ImagenWorld[0] addresses this by spanning multiple domains and task types, positioning itself as a unifying testbed that captures both generation fidelity and editing versatility across the spectrum of contemporary model capabilities.

Claimed Contributions

ImagenWorld benchmark with diverse tasks and domains

10 retrieved papers

The authors present ImagenWorld, a benchmark comprising 3.6K condition sets that systematically covers six representative task types (generation and editing with single or multiple references) and six topical domains (artworks, photorealistic images, information graphics, textual graphics, computer graphics, and screenshots), providing a unified testbed for evaluating image generation models.

10 retrieved papers

Large-scale human study revealing model failure modes

10 retrieved papers

The authors perform a comprehensive human evaluation study supported by 20K fine-grained annotations across 14 models, uncovering systematic failure patterns such as distinct editing biases, struggles with text-heavy domains, and performance gaps between closed-source and open-source systems.

10 retrieved papers

Explainable evaluation schema with localized error attribution

10 retrieved papers

The authors introduce a structured evaluation framework where human annotators not only provide scores but also tag specific failure modes with textual descriptions and localized masks at both object and segment levels, enabling interpretable diagnosis of model weaknesses beyond opaque numerical metrics.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[18] MMIG-Bench: Towards Comprehensive and Explainable Evaluation of Multi-Modal Image Generation Models PDF

Hua Hang, Zeng, Ziyun, Hang Hua, Song Yizhi, Ziyun Zeng, Tang Yunlong, Yizhi Song, He Liu, Yunlong Tang, Aliaga, Daniel, Liu He, Xiong Wei, Daniel G. Aliaga, Luo, Jiebo, Wei Xiong, Jiebo Luo (2025) • arXiv.org

[22] Opengpt-4o-image: A comprehensive dataset for advanced image generation and editing PDF

Chen Zhihong, Bai Xue-hai, Zhihong Chen, Shi Yang, Xue-Yuan Bai, Fu, Chaoyou, Yang Shi, Zhang HuanYu, Chaoyou Fu, Wang, Haotian, Huanyu Zhang, Sun Xiao-yan, Haotian Wang, Zhang Zhang, Xiaoyan Sun, Wang Liang, Zhang, Yuanxing, Liang Wang, Wan Pengfei, Yuanxing Zhang, Zhang Yi Fan, Pengfei Wan, Yi-Fan Zhang (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

ImagenWorld benchmark with diverse tasks and domains

[4] FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space PDF

Cannot Refute

[51] OmniGen: Unified Image Generation PDF

Cannot Refute

[52] DreamOmni: Unified Image Generation and Editing PDF

Cannot Refute

[53] Ice-bench: A unified and comprehensive benchmark for image creating and editing PDF

Cannot Refute

[54] Anyedit: Mastering unified high-quality image editing for any idea PDF

Cannot Refute

[55] UniVG: A Generalist Diffusion Model for Unified Image Generation and Editing PDF

Cannot Refute

[56] Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis PDF

Cannot Refute

[57] UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics PDF

Cannot Refute

[58] Mix mstar: A synthetic benchmark dataset for multi-class rotation vehicle detection in large-scale sar images PDF

Cannot Refute

[59] UPGPT: Universal Diffusion Model for Person Image Generation, Editing and Pose Transfer PDF

Cannot Refute

Contribution

Large-scale human study revealing model failure modes

[60] Rich human feedback for text-to-image generation PDF

Cannot Refute

[61] Imagenhub: Standardizing the evaluation of conditional image generation models PDF

Cannot Refute

[62] Rethinking FID: Towards a Better Evaluation Metric for Image Generation PDF

Cannot Refute

[63] Re-imagen: Retrieval-augmented text-to-image generator PDF

Cannot Refute

[64] New Job, New Gender? Measuring the Social Bias in Image Generation Models PDF

Cannot Refute

[65] Benchmarking spatial relationships in text-to-image generation PDF

Cannot Refute

[66] Quality assessment for text-to-image generation: A survey PDF

Cannot Refute

[67] Interactive Visual Assessment for Text-to-Image Generation Models PDF

Cannot Refute

[68] Linguistic binding in diffusion models: Enhancing attribute correspondence through attention map alignment PDF

Cannot Refute

[69] Adversarial nibbler: An open red-teaming method for identifying diverse harms in text-to-image generation PDF

Cannot Refute

Contribution

Explainable evaluation schema with localized error attribution

[70] PlanT: Explainable Planning Transformers via Object-Level Representations PDF

Cannot Refute

[71] Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving PDF

Cannot Refute

[72] Interpretable and accurate fine-grained recognition via region grouping PDF

Cannot Refute

[73] Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models: A Case Study on ChatGPT PDF

Cannot Refute

[74] Behaviour Discovery and Attribution for Explainable Reinforcement Learning PDF

Cannot Refute

[75] Odexai: A comprehensive object detection explainable ai evaluation PDF

Cannot Refute

[76] Shap-based interpretable object detection method for satellite imagery PDF

Cannot Refute

[77] Revealing hidden context bias in segmentation and object detection through concept-specific explanations PDF

Cannot Refute

[78] Explaining 3D Object Detection Through Shapley Value-Based Attribution Map PDF

Cannot Refute

[79] Interpreting Object-level Foundation Models via Visual Precision Search PDF

Cannot Refute

ImagenWorld: Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-ended Real-World Tasks

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[18] MMIG-Bench: Towards Comprehensive and Explainable Evaluation of Multi-Modal Image Generation Models PDF

[22] Opengpt-4o-image: A comprehensive dataset for advanced image generation and editing PDF

Contribution Analysis

ImagenWorld benchmark with diverse tasks and domains

[4] FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space PDF

[51] OmniGen: Unified Image Generation PDF

[52] DreamOmni: Unified Image Generation and Editing PDF

[53] Ice-bench: A unified and comprehensive benchmark for image creating and editing PDF

[54] Anyedit: Mastering unified high-quality image editing for any idea PDF

[55] UniVG: A Generalist Diffusion Model for Unified Image Generation and Editing PDF

[56] Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis PDF

[57] UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics PDF

[58] Mix mstar: A synthetic benchmark dataset for multi-class rotation vehicle detection in large-scale sar images PDF

[59] UPGPT: Universal Diffusion Model for Person Image Generation, Editing and Pose Transfer PDF

Large-scale human study revealing model failure modes

[60] Rich human feedback for text-to-image generation PDF

[61] Imagenhub: Standardizing the evaluation of conditional image generation models PDF

[62] Rethinking FID: Towards a Better Evaluation Metric for Image Generation PDF

[63] Re-imagen: Retrieval-augmented text-to-image generator PDF

[64] New Job, New Gender? Measuring the Social Bias in Image Generation Models PDF

[65] Benchmarking spatial relationships in text-to-image generation PDF

[66] Quality assessment for text-to-image generation: A survey PDF

[67] Interactive Visual Assessment for Text-to-Image Generation Models PDF

[68] Linguistic binding in diffusion models: Enhancing attribute correspondence through attention map alignment PDF

[69] Adversarial nibbler: An open red-teaming method for identifying diverse harms in text-to-image generation PDF

Explainable evaluation schema with localized error attribution

[70] PlanT: Explainable Planning Transformers via Object-Level Representations PDF

[71] Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving PDF

[72] Interpretable and accurate fine-grained recognition via region grouping PDF

[73] Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models: A Case Study on ChatGPT PDF

[74] Behaviour Discovery and Attribution for Explainable Reinforcement Learning PDF

[75] Odexai: A comprehensive object detection explainable ai evaluation PDF

[76] Shap-based interpretable object detection method for satellite imagery PDF

[77] Revealing hidden context bias in segmentation and object detection through concept-specific explanations PDF

[78] Explaining 3D Object Detection Through Shapley Value-Based Attribution Map PDF

[79] Interpreting Object-level Foundation Models via Visual Precision Search PDF

Table of Contents