Abstract:

Advances in diffusion, autoregressive, and hybrid models have enabled high-quality image synthesis for tasks such as text-to-image, editing, and reference-guided composition. Yet, existing benchmarks remain limited, either focus on isolated tasks, cover only narrow domains, or provide opaque scores without explaining failure modes. We introduce \textbf{ImagenWorld}, a benchmark of 3.6K condition sets spanning six core tasks (generation and editing, with single or multiple references) and six topical domains (artworks, photorealistic images, information graphics, textual graphics, computer graphics, and screenshots). The benchmark is supported by 20K fine-grained human annotations and an explainable evaluation schema that tags localized object-level and segment-level errors, complementing automated VLM-based metrics. Our large-scale evaluation of 14 models yields several insights: (1) models typically struggle more in editing tasks than in generation tasks, especially in local edits. (2) models excel in artistic and photorealistic settings but struggle with symbolic and text-heavy domains such as screenshots and information graphics. (3) closed-source systems lead overall, while targeted data curation (e.g., Qwen-Image) narrows the gap in text-heavy cases. (4) modern VLM-based metrics achieve Kendall accuracies up to 0.79, approximating human ranking, but fall short of fine-grained, explainable error attribution. ImagenWorld provides both a rigorous benchmark and a diagnostic tool to advance robust image generation.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces ImagenWorld, a benchmark spanning six core tasks (generation and editing with single or multiple references) across six topical domains, supported by 3.6K condition sets and 20K human annotations. Within the taxonomy, it resides in the 'Multi-Task and Multi-Domain Benchmarks' leaf alongside two sibling papers. This leaf is relatively sparse, containing only three papers total, suggesting that comprehensive multi-task, multi-domain evaluation frameworks remain an underexplored area compared to the broader field of 50 papers across 15 leaf nodes.

The taxonomy reveals that most evaluation work concentrates in adjacent leaves: 'Task-Specific Evaluation Benchmarks' contains eight papers focusing on narrow editing tasks, while 'Evaluation Metrics and Human Alignment' holds one paper on automated metrics. The sibling papers in ImagenWorld's leaf address multimodal instruction-guided generation and general multi-task assessment, but the scope notes clarify that ImagenWorld's simultaneous coverage of diverse visual domains (artworks, screenshots, information graphics) distinguishes it from single-domain or single-task approaches. Neighboring branches like 'Controllable Generation' and 'Model Architectures' focus on methods rather than evaluation infrastructure.

Among 30 candidates examined, none clearly refute any of the three contributions. Contribution A (diverse tasks and domains) examined 10 candidates with 0 refutable; Contribution B (large-scale human study revealing failure modes) examined 10 with 0 refutable; Contribution C (explainable evaluation schema with localized error attribution) examined 10 with 0 refutable. The statistics suggest that within this limited search scope, the combination of multi-task coverage, domain diversity, and fine-grained error tagging appears relatively novel, though the small candidate pool means substantial prior work may exist beyond the top-30 semantic matches.

Based on the limited literature search (30 candidates from semantic retrieval), the work appears to occupy a sparsely populated niche within evaluation frameworks. The taxonomy structure shows that while task-specific benchmarks are common, comprehensive multi-domain testbeds with explainable error attribution are less prevalent. However, the analysis does not cover exhaustive citation networks or domain-specific venues, so definitive claims about novelty require broader investigation beyond the top-K matches examined here.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Benchmarking image generation and editing models across diverse tasks and domains. The field has evolved into a rich ecosystem organized around six major branches. Evaluation Frameworks and Benchmarks establish standardized testbeds for assessing model capabilities, ranging from single-task metrics to comprehensive multi-domain suites like ImagenWorld[0] and MMIG Bench[18]. Model Architectures and Training Approaches encompass foundational techniques from early GANs such as StarGAN[17] and StarGAN v2[5] to modern diffusion-based systems. Controllable Generation and Semantic Manipulation focuses on methods that enable fine-grained steering of outputs through text, layout, or semantic guidance, exemplified by works like StyleCLIP[3] and SDEdit[2]. Application Domains and Specialized Tasks address domain-specific challenges in areas such as medical imaging, fashion, and video synthesis. Survey and Taxonomic Studies, including Multimodal Synthesis Survey[1] and Generative Vision Survey[26], provide meta-level perspectives on the landscape. Supporting Techniques and Infrastructure covers auxiliary components like data augmentation strategies reviewed in Diffusion Augmentation Review[7] and Image Augmentation Survey[28]. A particularly active tension exists between comprehensive multi-task benchmarks and specialized evaluation frameworks. While some efforts pursue breadth—testing models on numerous generation and editing operations simultaneously—others prioritize depth in specific modalities or interaction paradigms, as seen in OpenGPT Image[22] and EditInspector[27]. ImagenWorld[0] sits within the Multi-Task and Multi-Domain Benchmarks cluster, emphasizing holistic assessment across varied scenarios rather than narrow task optimization. This contrasts with more focused benchmarks like MMIG Bench[18], which targets multimodal instruction-guided generation, and specialized editing evaluations such as Complex Edit[33]. The central challenge remains balancing coverage against evaluation granularity: broad benchmarks risk superficial assessment, while narrow ones may miss emergent capabilities. ImagenWorld[0] addresses this by spanning multiple domains and task types, positioning itself as a unifying testbed that captures both generation fidelity and editing versatility across the spectrum of contemporary model capabilities.

Claimed Contributions

ImagenWorld benchmark with diverse tasks and domains

The authors present ImagenWorld, a benchmark comprising 3.6K condition sets that systematically covers six representative task types (generation and editing with single or multiple references) and six topical domains (artworks, photorealistic images, information graphics, textual graphics, computer graphics, and screenshots), providing a unified testbed for evaluating image generation models.

10 retrieved papers
Large-scale human study revealing model failure modes

The authors perform a comprehensive human evaluation study supported by 20K fine-grained annotations across 14 models, uncovering systematic failure patterns such as distinct editing biases, struggles with text-heavy domains, and performance gaps between closed-source and open-source systems.

10 retrieved papers
Explainable evaluation schema with localized error attribution

The authors introduce a structured evaluation framework where human annotators not only provide scores but also tag specific failure modes with textual descriptions and localized masks at both object and segment levels, enabling interpretable diagnosis of model weaknesses beyond opaque numerical metrics.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

ImagenWorld benchmark with diverse tasks and domains

The authors present ImagenWorld, a benchmark comprising 3.6K condition sets that systematically covers six representative task types (generation and editing with single or multiple references) and six topical domains (artworks, photorealistic images, information graphics, textual graphics, computer graphics, and screenshots), providing a unified testbed for evaluating image generation models.

Contribution

Large-scale human study revealing model failure modes

The authors perform a comprehensive human evaluation study supported by 20K fine-grained annotations across 14 models, uncovering systematic failure patterns such as distinct editing biases, struggles with text-heavy domains, and performance gaps between closed-source and open-source systems.

Contribution

Explainable evaluation schema with localized error attribution

The authors introduce a structured evaluation framework where human annotators not only provide scores but also tag specific failure modes with textual descriptions and localized masks at both object and segment levels, enabling interpretable diagnosis of model weaknesses beyond opaque numerical metrics.