GIR-Bench: Versatile Benchmark for Generating Images with Reasoning

ICLR 2026 Conference SubmissionAnonymous Authors
EvaluationUnified Multimodal ModelVisual Generation
Abstract:

Unified multimodal models integrate the reasoning capacity of large language models with both image understanding and generation, showing great promise for advanced multimodal intelligence. However, the community still lacks a rigorous reasoning-centric benchmark to systematically evaluate the alignment between understanding and generation, and their generalization potential in complex visual tasks. To this end, we introduce \textbf{GIR-Bench}, a comprehensive benchmark that evaluates unified models across three complementary perspectives. Firstly, we explore whether models can consistently leverage the same knowledge for both understanding and generation (GIR-Bench-Uni). Secondly, we investigate whether models can perform reasoning-centric text-to-image generation that requires applying logical constraints and implicit knowledge to generate faithful visual content (GIR-Bench-T2I). Thirdly, we evaluate whether models can handle multi-step reasoning in editing (GIR-Bench-Edit). For each subset, we carefully design different task-specific evaluation pipelines tailored for each task. This enables fine-grained and interpretable evaluation while mitigating biases from the prevalent MLLM-as-a-Judge paradigm. Extensive ablations over various unified models and generation-only systems have shown that: Although unified models are more capable of reasoning-driven visual tasks, they still exhibit a persistent gap between understanding and generation. The data and code for GIR-Bench are available at \url{https://anonymous.4open.science/r/GIR-Bench-7E40}.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces GIR-Bench, a reasoning-centric benchmark for unified multimodal models that integrate understanding and generation. It resides in the 'Comprehensive Unified Model Evaluation' leaf, which contains four papers total (including this one). This leaf sits within the broader 'Evaluation Frameworks and Benchmarks' branch, indicating a moderately populated research direction focused on systematic assessment of unified models. The taxonomy reveals this is an active but not overcrowded area, with sibling papers like MME-Unify and Uni-MMMU pursuing related but distinct evaluation goals.

The taxonomy shows neighboring leaves addressing specialized evaluation tasks: 'Knowledge and Structured Visual Generation' (four papers), 'Reasoning-Based Editing Evaluation' (two papers), and 'Multi-Image and Multi-Modal Reasoning' (three papers). GIR-Bench bridges these specialized directions by incorporating editing and reasoning-driven generation within a unified framework. The 'Reasoning Paradigms for Multimodal Generation' branch (ten papers across three leaves) provides complementary context on how models perform reasoning, while GIR-Bench focuses on evaluating whether such reasoning translates into faithful generation. This positioning suggests the work connects evaluation methodology with reasoning paradigms.

Among thirty candidates examined, the contribution-level analysis reveals mixed novelty signals. The core benchmark contribution (GIR-Bench) shows no clear refutation across ten candidates, suggesting relative novelty in its comprehensive reasoning-centric design. However, the task-specific evaluation pipelines contribution encountered two refutable candidates among ten examined, indicating some overlap with prior evaluation methodologies. The three-perspective framework (Uni/T2I/Edit) also shows no refutation across ten candidates. These statistics reflect a limited search scope and suggest the benchmark's integrated approach may offer incremental advances over existing evaluation protocols.

Based on the thirty-candidate search, GIR-Bench appears to occupy a moderately novel position within comprehensive unified model evaluation. The taxonomy context indicates this is an evolving research direction with established neighbors but room for specialized contributions. The analysis does not cover exhaustive prior work in specialized evaluation domains or reasoning paradigms, leaving open questions about deeper connections to task-specific benchmarks outside the top-thirty semantic matches.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: Evaluating reasoning-driven image generation in unified multimodal models. The field has evolved around four main branches that together capture the landscape of unified multimodal systems. Unified Multimodal Model Architectures and Training focuses on building end-to-end frameworks that seamlessly integrate vision and language modalities, exemplified by works like DreamLLM[25] and Emu[42]. Reasoning Paradigms for Multimodal Generation explores how models can leverage intermediate reasoning steps—whether through chain-of-thought mechanisms as in Multimodal Chain-of-Thought[3], visual thinking processes like Visualization-of-Thought[9], or autonomous imagination strategies such as Autonomous Imagination[4]—to improve generation quality. Evaluation Frameworks and Benchmarks addresses the critical need for systematic assessment, with comprehensive suites like MME-Unify[17] and Uni-MMMU[27] measuring diverse capabilities. Finally, Specialized Generation and Understanding Tasks examines domain-specific challenges ranging from image editing to visual reasoning under constrained settings. A particularly active tension emerges between holistic evaluation approaches and targeted reasoning assessments. While broad benchmarks such as MME-Unify[17] and Uni-MMMU[27] aim to capture general-purpose multimodal competence across understanding and generation, there is growing interest in probing whether models genuinely reason before generating images or merely pattern-match from training data. GIR-Bench[0] situates itself within this comprehensive evaluation cluster, emphasizing reasoning-driven generation specifically. Compared to neighbors like Uni-MMMU[27], which evaluates understanding and generation more broadly, and RealUnify[40], which stresses real-world applicability, GIR-Bench[0] zooms in on the interplay between explicit reasoning processes and image synthesis quality. This focus reflects an open question across the field: how to rigorously measure whether intermediate reasoning steps—visual or textual—actually enhance generation fidelity and controllability, rather than serving as post-hoc rationalizations.

Claimed Contributions

GIR-Bench: A comprehensive reasoning-centric benchmark for unified multimodal models

The authors propose GIR-Bench, a new benchmark designed to systematically evaluate unified multimodal models' alignment between understanding and generation capabilities, and their generalization in complex visual tasks across three distinct evaluation perspectives.

10 retrieved papers
Task-specific evaluation pipelines beyond MLLM-as-a-Judge

The authors develop specialized evaluation pipelines tailored for each task that enable fine-grained and interpretable assessment while mitigating biases inherent in the prevalent MLLM-as-a-Judge evaluation approach.

10 retrieved papers
Can Refute
Three-perspective evaluation framework: GIR-Bench-Uni, GIR-Bench-T2I, and GIR-Bench-Edit

The authors construct three complementary evaluation components: GIR-Bench-Uni evaluates knowledge consistency between understanding and generation, GIR-Bench-T2I assesses reasoning-centric text-to-image generation, and GIR-Bench-Edit measures multi-step reasoning in editing tasks.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

GIR-Bench: A comprehensive reasoning-centric benchmark for unified multimodal models

The authors propose GIR-Bench, a new benchmark designed to systematically evaluate unified multimodal models' alignment between understanding and generation capabilities, and their generalization in complex visual tasks across three distinct evaluation perspectives.

Contribution

Task-specific evaluation pipelines beyond MLLM-as-a-Judge

The authors develop specialized evaluation pipelines tailored for each task that enable fine-grained and interpretable assessment while mitigating biases inherent in the prevalent MLLM-as-a-Judge evaluation approach.

Contribution

Three-perspective evaluation framework: GIR-Bench-Uni, GIR-Bench-T2I, and GIR-Bench-Edit

The authors construct three complementary evaluation components: GIR-Bench-Uni evaluates knowledge consistency between understanding and generation, GIR-Bench-T2I assesses reasoning-centric text-to-image generation, and GIR-Bench-Edit measures multi-step reasoning in editing tasks.