Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play?

ICLR 2026 Conference SubmissionAnonymous Authors
Text-to-Image GenerationReasoningBenchmark
Abstract:

Text-to-image (T2I) generation aims to synthesize images from textual prompts, which jointly specify what must be shown and imply what can be inferred, which thus correspond to two core capabilities: \textbf{\textit{composition}} and \textbf{\textit{reasoning}}. Despite recent advances of T2I models in both composition and reasoning, existing benchmarks remain limited in evaluation. They not only fail to provide comprehensive coverage across and within both capabilities, but also largely restrict evaluation to low scene density and simple one-to-one reasoning. To address these limitations, we propose \textbf{\textsc{T2I-CoReBench}}, a comprehensive and complex benchmark that evaluates both composition and reasoning capabilities of T2I models. To ensure comprehensiveness, we structure composition around scene graph elements (\textit{instance}, \textit{attribute}, and \textit{relation}) and reasoning around the philosophical framework of inference (\textit{deductive}, \textit{inductive}, and \textit{abductive}), formulating a 12-dimensional evaluation taxonomy. To increase complexity, driven by the inherent real-world complexities, we curate each prompt with higher compositional density for composition and greater reasoning intensity for reasoning. To facilitate fine-grained and reliable evaluation, we also pair each evaluation prompt with a checklist that specifies individual \textit{yes/no} questions to assess each intended element independently. In statistics, our benchmark comprises 1,0801,080 challenging prompts and around 13,50013,500 checklist questions. Experiments across 28 current T2I models reveal that their composition capability still remains limited in high compositional scenarios, while the reasoning capability lags even further behind as a critical bottleneck, with all models struggling to infer implicit elements from prompts.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes T2I-CoReBench, a benchmark evaluating both composition and reasoning in text-to-image models through a 12-dimensional taxonomy. It resides in the Reasoning-Driven Generation Benchmarks leaf, which contains only three papers total. This leaf sits within the broader Reasoning Capability Evaluation branch, indicating a relatively sparse research direction compared to the more crowded Compositional Generation Evaluation branch with its eight-paper Comprehensive Multi-Dimensional Compositional Benchmarks cluster. The small sibling set suggests this combined composition-reasoning focus remains underexplored.

The taxonomy reveals neighboring work in Compositional Generation Evaluation, where benchmarks like T2I-CompBench and DALL-EVAL assess attribute binding and spatial relations without emphasizing reasoning. The Reasoning Capability Evaluation branch excludes purely compositional metrics, while the Visual Reasoning Skills Assessment leaf targets object recognition and counting rather than philosophical inference frameworks. T2I-CoReBench bridges these domains by structuring composition around scene graphs and reasoning around deductive, inductive, and abductive inference, occupying a distinct niche between compositional diagnostics and higher-order logical evaluation.

Among thirty candidates examined, the comprehensive benchmark contribution and the 12-dimensional taxonomy showed no clear refutation across ten candidates each. However, the checklist-based evaluation protocol encountered three refutable candidates among ten examined, suggesting prior work has explored similar fine-grained assessment strategies. The limited search scope means these statistics reflect top-ranked semantic matches rather than exhaustive coverage. The first two contributions appear more distinctive within the examined literature, while the evaluation protocol overlaps with existing automated assessment methods in the Evaluation Metrics and Automated Assessment cluster.

Based on the thirty-candidate search, the work appears to occupy a meaningful gap between compositional and reasoning evaluation, though the checklist protocol shows some precedent. The taxonomy structure confirms this is a less saturated area compared to compositional benchmarking. Limitations include the restricted search scope and the possibility that related reasoning frameworks exist outside the top-ranked matches examined here.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: evaluating composition and reasoning capabilities of text-to-image models. The field has organized itself around six main branches that reflect both diagnostic and constructive perspectives. Compositional Generation Evaluation focuses on benchmarking how well models handle multi-object scenes, attribute binding, and spatial relationships, with works like DALL-EVAL[1] and T2I-CompBench[2] establishing foundational metrics. Reasoning Capability Evaluation targets higher-level cognitive demands such as logic, counting, and relational inference. Meanwhile, Compositional Generation Enhancement and Reasoning-Enhanced Generation Methods explore training-time and inference-time interventions—ranging from layout-driven approaches like LayoutGPT[8] to attention refinement techniques—that aim to close the gap between prompt complexity and output fidelity. Cross-Modal and Multimodal Capabilities examine how vision-language models integrate textual and visual reasoning, while Semantic and Controllability Analysis investigates the interpretability and fine-grained control of generation processes. Recent efforts reveal a tension between holistic benchmarking and targeted diagnostic probes. On one hand, comprehensive suites like GenAI-Bench[12] and T2I-CompBench++[4] assess diverse compositional phenomena; on the other, specialized benchmarks such as R2I-Bench[17] and Textual-Visual Logic[37] isolate specific reasoning challenges like numerical constraints or logical consistency. Easier Painting[0] sits within the Reasoning-Driven Generation Benchmarks cluster, emphasizing systematic evaluation of reasoning-dependent prompts. Compared to R2I-Bench[17], which probes relational and numerical reasoning broadly, and Textual-Visual Logic[37], which stresses logical entailment, Easier Painting[0] appears to prioritize accessible yet rigorous test cases that reveal when models struggle with compositional reasoning under varied prompt structures. This positioning highlights an ongoing question: whether future progress depends more on richer diagnostic frameworks or on scalable enhancement methods that directly improve model architectures.

Claimed Contributions

T2I-COREBENCH: A comprehensive and complex benchmark for composition and reasoning

The authors introduce a new benchmark that systematically evaluates text-to-image models across 12 dimensions covering composition (instance, attribute, relation, text rendering) and reasoning (deductive, inductive, abductive). The benchmark comprises 1,080 challenging prompts and around 13,500 checklist questions designed to assess both explicit and implicit visual elements under real-world complexities.

10 retrieved papers
12-dimensional evaluation taxonomy structured around scene graphs and philosophical reasoning

The authors develop a comprehensive taxonomy that organizes composition evaluation using scene graph components and reasoning evaluation using a tripartite philosophical framework. This structured approach ensures systematic coverage of all relevant evaluation dimensions for text-to-image generation.

10 retrieved papers
Checklist-based fine-grained evaluation protocol with independent yes/no questions

The authors design an evaluation methodology where each prompt is accompanied by a checklist of atomic, independent yes/no questions. This enables fine-grained and reliable assessment of whether generated images faithfully capture both explicit compositional elements and implicit reasoning outcomes.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

T2I-COREBENCH: A comprehensive and complex benchmark for composition and reasoning

The authors introduce a new benchmark that systematically evaluates text-to-image models across 12 dimensions covering composition (instance, attribute, relation, text rendering) and reasoning (deductive, inductive, abductive). The benchmark comprises 1,080 challenging prompts and around 13,500 checklist questions designed to assess both explicit and implicit visual elements under real-world complexities.

Contribution

12-dimensional evaluation taxonomy structured around scene graphs and philosophical reasoning

The authors develop a comprehensive taxonomy that organizes composition evaluation using scene graph components and reasoning evaluation using a tripartite philosophical framework. This structured approach ensures systematic coverage of all relevant evaluation dimensions for text-to-image generation.

Contribution

Checklist-based fine-grained evaluation protocol with independent yes/no questions

The authors design an evaluation methodology where each prompt is accompanied by a checklist of atomic, independent yes/no questions. This enables fine-grained and reliable assessment of whether generated images faithfully capture both explicit compositional elements and implicit reasoning outcomes.