GIR-Bench: Versatile Benchmark for Generating Images with Reasoning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.5 Download Report PDF

EvaluationUnified Multimodal ModelVisual Generation

Unified multimodal models integrate the reasoning capacity of large language models with both image understanding and generation, showing great promise for advanced multimodal intelligence. However, the community still lacks a rigorous reasoning-centric benchmark to systematically evaluate the alignment between understanding and generation, and their generalization potential in complex visual tasks. To this end, we introduce \textbf{GIR-Bench}, a comprehensive benchmark that evaluates unified models across three complementary perspectives. Firstly, we explore whether models can consistently leverage the same knowledge for both understanding and generation (GIR-Bench-Uni). Secondly, we investigate whether models can perform reasoning-centric text-to-image generation that requires applying logical constraints and implicit knowledge to generate faithful visual content (GIR-Bench-T2I). Thirdly, we evaluate whether models can handle multi-step reasoning in editing (GIR-Bench-Edit). For each subset, we carefully design different task-specific evaluation pipelines tailored for each task. This enables fine-grained and interpretable evaluation while mitigating biases from the prevalent MLLM-as-a-Judge paradigm. Extensive ablations over various unified models and generation-only systems have shown that: Although unified models are more capable of reasoning-driven visual tasks, they still exhibit a persistent gap between understanding and generation. The data and code for GIR-Bench are available at \url{https://anonymous.4open.science/r/GIR-Bench-7E40}.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces GIR-Bench, a reasoning-centric benchmark for unified multimodal models that integrate understanding and generation. It resides in the 'Comprehensive Unified Model Evaluation' leaf, which contains four papers total (including this one). This leaf sits within the broader 'Evaluation Frameworks and Benchmarks' branch, indicating a moderately populated research direction focused on systematic assessment of unified models. The taxonomy reveals this is an active but not overcrowded area, with sibling papers like MME-Unify and Uni-MMMU pursuing related but distinct evaluation goals.

The taxonomy shows neighboring leaves addressing specialized evaluation tasks: 'Knowledge and Structured Visual Generation' (four papers), 'Reasoning-Based Editing Evaluation' (two papers), and 'Multi-Image and Multi-Modal Reasoning' (three papers). GIR-Bench bridges these specialized directions by incorporating editing and reasoning-driven generation within a unified framework. The 'Reasoning Paradigms for Multimodal Generation' branch (ten papers across three leaves) provides complementary context on how models perform reasoning, while GIR-Bench focuses on evaluating whether such reasoning translates into faithful generation. This positioning suggests the work connects evaluation methodology with reasoning paradigms.

Among thirty candidates examined, the contribution-level analysis reveals mixed novelty signals. The core benchmark contribution (GIR-Bench) shows no clear refutation across ten candidates, suggesting relative novelty in its comprehensive reasoning-centric design. However, the task-specific evaluation pipelines contribution encountered two refutable candidates among ten examined, indicating some overlap with prior evaluation methodologies. The three-perspective framework (Uni/T2I/Edit) also shows no refutation across ten candidates. These statistics reflect a limited search scope and suggest the benchmark's integrated approach may offer incremental advances over existing evaluation protocols.

Based on the thirty-candidate search, GIR-Bench appears to occupy a moderately novel position within comprehensive unified model evaluation. The taxonomy context indicates this is an evolving research direction with established neighbors but room for specialized contributions. The analysis does not cover exhaustive prior work in specialized evaluation domains or reasoning paradigms, leaving open questions about deeper connections to task-specific benchmarks outside the top-thirty semantic matches.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Evaluating reasoning-driven image generation in unified multimodal models. The field has evolved around four main branches that together capture the landscape of unified multimodal systems. Unified Multimodal Model Architectures and Training focuses on building end-to-end frameworks that seamlessly integrate vision and language modalities, exemplified by works like DreamLLM[25] and Emu[42]. Reasoning Paradigms for Multimodal Generation explores how models can leverage intermediate reasoning steps—whether through chain-of-thought mechanisms as in Multimodal Chain-of-Thought[3], visual thinking processes like Visualization-of-Thought[9], or autonomous imagination strategies such as Autonomous Imagination[4]—to improve generation quality. Evaluation Frameworks and Benchmarks addresses the critical need for systematic assessment, with comprehensive suites like MME-Unify[17] and Uni-MMMU[27] measuring diverse capabilities. Finally, Specialized Generation and Understanding Tasks examines domain-specific challenges ranging from image editing to visual reasoning under constrained settings. A particularly active tension emerges between holistic evaluation approaches and targeted reasoning assessments. While broad benchmarks such as MME-Unify[17] and Uni-MMMU[27] aim to capture general-purpose multimodal competence across understanding and generation, there is growing interest in probing whether models genuinely reason before generating images or merely pattern-match from training data. GIR-Bench[0] situates itself within this comprehensive evaluation cluster, emphasizing reasoning-driven generation specifically. Compared to neighbors like Uni-MMMU[27], which evaluates understanding and generation more broadly, and RealUnify[40], which stresses real-world applicability, GIR-Bench[0] zooms in on the interplay between explicit reasoning processes and image synthesis quality. This focus reflects an open question across the field: how to rigorously measure whether intermediate reasoning steps—visual or textual—actually enhance generation fidelity and controllability, rather than serving as post-hoc rationalizations.

Claimed Contributions

GIR-Bench: A comprehensive reasoning-centric benchmark for unified multimodal models

10 retrieved papers

The authors propose GIR-Bench, a new benchmark designed to systematically evaluate unified multimodal models' alignment between understanding and generation capabilities, and their generalization in complex visual tasks across three distinct evaluation perspectives.

10 retrieved papers

Task-specific evaluation pipelines beyond MLLM-as-a-Judge

Can Refute

10 retrieved papers

The authors develop specialized evaluation pipelines tailored for each task that enable fine-grained and interpretable assessment while mitigating biases inherent in the prevalent MLLM-as-a-Judge evaluation approach.

10 retrieved papers

Can Refute

Three-perspective evaluation framework: GIR-Bench-Uni, GIR-Bench-T2I, and GIR-Bench-Edit

10 retrieved papers

The authors construct three complementary evaluation components: GIR-Bench-Uni evaluates knowledge consistency between understanding and generation, GIR-Bench-T2I assesses reasoning-centric text-to-image generation, and GIR-Bench-Edit measures multi-step reasoning in editing tasks.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[17] MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models PDF

Zhang Yi Fan, Fu, Chaoyou, Shi Yang, Chen Hong-kai, Zhang Zhang, Wang Liang, Tan Tieniu (2025)

[27] Uni-mmmu: A massive multi-discipline multimodal unified benchmark PDF

Zou Kai, Huang Ziqi, Kai Zou, Dong Yu-hao, Ziqi Huang, Tian Shulin, Yuhao Dong, Zheng Dian, Shulin Tian, Liu Hongbo, Dian Zheng, He Jingwen, Hongbo Liu, Liu Bin, Jingwen He, Qiao Yu, Bin Liu, Liu, Ziwei, Yu Qiao, Ziwei Liu (2025)

[40] RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark PDF

Shi Yang, Dong Yu-hao, Yang Shi, Ding Yue, Yuhao Dong, Wang Yuran, Yue Ding, Zhu, Xuanyu, Yuran Wang, Zhou Sheng, Xuanyu Zhu, Liu Wenting, Sheng Zhou, Tian Haochen, Wenting Liu, Wang, Rundong, Haochen Tian, Rundong Wang, Liu, Zuyan, Huanqian Wang, Zeng, Bohan, Zuyan Liu, Chen, Ruizhe, Bohan Zeng, Qixun, Ruizhe Chen, Zhang Zhuoran, Qixun Wang, Xinlong, Zhuoran Zhang, Tong Chengzhuo, Xinlong Chen, Li Bozhou, Chengzhuo Tong, Fu, Chaoyou, Bozhou Li, Liu Qiang, Chaoyou Fu, Haotian, Qiang Liu, Yang, Wenjing, Haotian Wang, Zhang, Yuanxing, Wenjing Yang, Wan Pengfei, Yuanxing Zhang, Zhang Yi Fan, Pengfei Wan, Ziwei, Yi-Fan Zhang, Ziwei Liu (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

GIR-Bench: A comprehensive reasoning-centric benchmark for unified multimodal models

[17] MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models PDF

Cannot Refute

[27] Uni-mmmu: A massive multi-discipline multimodal unified benchmark PDF

Cannot Refute

[63] Seed-bench: Benchmarking multimodal llms with generative comprehension PDF

Cannot Refute

[69] Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark PDF

Cannot Refute

[70] Alignmmbench: Evaluating chinese multimodal alignment in large vision-language models PDF

Cannot Refute

[71] SciFIBench: Benchmarking Large Multimodal Models for Scientific Figure Interpretation PDF

Cannot Refute

[72] Flagevalmm: A flexible framework for comprehensive multimodal model evaluation PDF

Cannot Refute

[73] Video-Bench: Human-Aligned Video Generation Benchmark PDF

Cannot Refute

[74] Multimodal image synthesis and editing: A survey and taxonomy PDF

Cannot Refute

[75] UniEval: Unified Holistic Evaluation for Unified Multimodal Understanding and Generation PDF

Cannot Refute

Contribution

Task-specific evaluation pipelines beyond MLLM-as-a-Judge

[60] Mmbench: Is your multi-modal model an all-around player? PDF

Can Refute

[63] Seed-bench: Benchmarking multimodal llms with generative comprehension PDF

Can Refute

[59] LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models PDF

Cannot Refute

[61] Mm-vet: Evaluating large multimodal models for integrated capabilities PDF

Cannot Refute

[62] VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks PDF

Cannot Refute

[64] Mm-safetybench: A benchmark for safety evaluation of multimodal large language models PDF

Cannot Refute

[65] Automating steering for safe multimodal large language models PDF

Cannot Refute

[66] V-MAGE: A Game Evaluation Framework for Assessing Vision-Centric Capabilities in Multimodal Large Language Models PDF

Cannot Refute

[67] PuzzleBench: A Fully Dynamic Evaluation Framework for Large Multimodal Models on Puzzle Solving PDF

Cannot Refute

[68] From flatland to space: Teaching vision-language models to perceive and reason in 3d PDF

Cannot Refute

Contribution

Three-perspective evaluation framework: GIR-Bench-Uni, GIR-Bench-T2I, and GIR-Bench-Edit

[21] WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation PDF

Cannot Refute

[31] MMMG: A Massive, Multidisciplinary, Multi-Tier Generation Benchmark for Text-to-Image Reasoning PDF

Cannot Refute

[51] Semantic Similarity Distance: Towards better text-image consistency metric in text-to-image generation PDF

Cannot Refute

[52] R2I-Bench: Benchmarking Reasoning-Driven Text-to-Image Generation PDF

Cannot Refute

[53] Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering PDF

Cannot Refute

[54] Imagic: Text-Based Real Image Editing with Diffusion Models PDF

Cannot Refute

[55] Align Beyond Prompts: Evaluating World Knowledge Alignment in Text-to-Image Generation PDF

Cannot Refute

[56] Ssd: Towards Better Text-Image Consistency Metric in Text-to-Image Generation PDF

Cannot Refute

[57] WorldGenBench: A World-Knowledge-Integrated Benchmark for Reasoning-Driven Text-to-Image Generation PDF

Cannot Refute

[58] KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models PDF

Cannot Refute

GIR-Bench: Versatile Benchmark for Generating Images with Reasoning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[17] MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models PDF

[27] Uni-mmmu: A massive multi-discipline multimodal unified benchmark PDF

[40] RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark PDF

Contribution Analysis

GIR-Bench: A comprehensive reasoning-centric benchmark for unified multimodal models

[17] MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models PDF

[27] Uni-mmmu: A massive multi-discipline multimodal unified benchmark PDF

[63] Seed-bench: Benchmarking multimodal llms with generative comprehension PDF

[69] Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark PDF

[70] Alignmmbench: Evaluating chinese multimodal alignment in large vision-language models PDF

[71] SciFIBench: Benchmarking Large Multimodal Models for Scientific Figure Interpretation PDF

[72] Flagevalmm: A flexible framework for comprehensive multimodal model evaluation PDF

[73] Video-Bench: Human-Aligned Video Generation Benchmark PDF

[74] Multimodal image synthesis and editing: A survey and taxonomy PDF

[75] UniEval: Unified Holistic Evaluation for Unified Multimodal Understanding and Generation PDF

Task-specific evaluation pipelines beyond MLLM-as-a-Judge

[60] Mmbench: Is your multi-modal model an all-around player? PDF

[63] Seed-bench: Benchmarking multimodal llms with generative comprehension PDF

[59] LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models PDF

[61] Mm-vet: Evaluating large multimodal models for integrated capabilities PDF

[62] VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks PDF

[64] Mm-safetybench: A benchmark for safety evaluation of multimodal large language models PDF

[65] Automating steering for safe multimodal large language models PDF

[66] V-MAGE: A Game Evaluation Framework for Assessing Vision-Centric Capabilities in Multimodal Large Language Models PDF

[67] PuzzleBench: A Fully Dynamic Evaluation Framework for Large Multimodal Models on Puzzle Solving PDF

[68] From flatland to space: Teaching vision-language models to perceive and reason in 3d PDF

Three-perspective evaluation framework: GIR-Bench-Uni, GIR-Bench-T2I, and GIR-Bench-Edit

[21] WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation PDF

[31] MMMG: A Massive, Multidisciplinary, Multi-Tier Generation Benchmark for Text-to-Image Reasoning PDF

[51] Semantic Similarity Distance: Towards better text-image consistency metric in text-to-image generation PDF

[52] R2I-Bench: Benchmarking Reasoning-Driven Text-to-Image Generation PDF

[53] Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering PDF

[54] Imagic: Text-Based Real Image Editing with Diffusion Models PDF

[55] Align Beyond Prompts: Evaluating World Knowledge Alignment in Text-to-Image Generation PDF

[56] Ssd: Towards Better Text-Image Consistency Metric in Text-to-Image Generation PDF

[57] WorldGenBench: A World-Knowledge-Integrated Benchmark for Reasoning-Driven Text-to-Image Generation PDF

[58] KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models PDF

Table of Contents