Draw-In-Mind: Rebalancing Designer-Painter Roles in Unified Multimodal Models Benefits Image Editing

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Image EditingImage GenerationUnified Multimodal ModelMultimodal

In recent years, integrating multimodal understanding and generation into a single unified model has emerged as a promising paradigm. While this approach achieves strong results in text-to-image (T2I) generation, it still struggles with precise image editing. We attribute this limitation to an imbalanced division of responsibilities. The understanding module primarily functions as a translator that encodes user instructions into semantic conditions, while the generation module must simultaneously act as designer and painter, inferring the original layout, identifying the target editing region, and rendering the new content. This imbalance is counterintuitive because the understanding module is typically trained with several times more data on complex reasoning tasks than the generation module. To address this issue, we introduce Draw-In-Mind (DIM), a dataset comprising two complementary subsets: (i) DIM-T2I, containing 14M long-context image–text pairs to enhance complex instruction comprehension; and (ii) DIM-Edit, consisting of 233K chain-of-thought imaginations generated by GPT-4o, serving as explicit design blueprints for image edits. We connect a frozen Qwen2.5-VL-3B with a trainable SANA1.5-1.6B via a lightweight two-layer MLP, and train it on the proposed DIM dataset, resulting in DIM-4.6B-T2I/Edit. Despite its modest parameter scale, DIM-4.6B-Edit achieves SOTA or competitive performance on the ImgEdit and GEdit-Bench benchmarks, outperforming much larger models such as UniWorld-V1 and Step1X-Edit. These findings demonstrate that explicitly assigning the design responsibility to the understanding module provides significant benefits for image editing. Our dataset and models will be publicly available.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: instruction-guided image editing. The field has evolved into a rich landscape organized around several major branches. Diffusion-Based Editing Methods form a dense cluster, encompassing training-free approaches like Plug-and-Play Diffusion[4], instruction-tuned models such as InstructPix2Pix[7], and increasingly sophisticated MLLM-enhanced systems that leverage multimodal large language models for better instruction understanding. GAN-Based Editing Methods include earlier works like TediGAN[6] and StyleCLIP[14], which manipulate latent spaces for text-driven edits. Autoregressive and Alternative Generative Paradigms explore non-diffusion architectures, while Specialized Editing Domains tackle specific applications like 3D scene editing (Instruct-NeRF2NeRF[31]) or human-centric generation (Text2Human[15]). Region-Based and Multi-Object Editing addresses localized modifications, and Training-Free and Optimization-Based Editing focuses on methods that avoid costly retraining. Datasets and Benchmarks such as MagicBrush[2], HQ-Edit[3], and I2EBench[21] provide evaluation standards, while Surveys and Reviews like Instruction Editing Review[1] and Instruction Editing Survey[38] synthesize progress across these branches. Recent work has increasingly integrated multimodal reasoning to bridge the gap between natural language instructions and precise visual edits. Draw-In-Mind[0] exemplifies this trend within the MLLM-Enhanced Diffusion Editing cluster, emphasizing how large language models can decompose complex instructions into actionable editing steps. It shares this focus with SmartEdit[11] and InsightEdit[16], which similarly exploit reasoning capabilities to improve instruction fidelity. In contrast, methods like Multimodal Guided Editing[5] and LLM On-the-Fly Editing[18] explore broader multimodal integration or dynamic planning strategies. A key tension across these branches involves balancing editability with preservation of unaffected regions, a challenge that training-free methods address differently than fine-tuned models like Emu Edit[36] or UltraEdit[9]. Draw-In-Mind[0] positions itself among works that prioritize interpretable, reasoning-driven pipelines, distinguishing its approach from purely end-to-end learned systems while addressing similar goals of high-fidelity, instruction-aligned editing.

Claimed Contributions

Draw-In-Mind (DIM) dataset with two complementary subsets

10 retrieved papers

The authors introduce a unified dataset called Draw-In-Mind (DIM) that contains two parts: DIM-T2I with 14 million long-context image-text pairs for complex instruction comprehension, and DIM-Edit with 233K chain-of-thought imaginations that serve as explicit design blueprints for image editing tasks.

10 retrieved papers

Rebalanced designer-painter responsibility division paradigm

Can Refute

10 retrieved papers

The authors identify and address an imbalanced division of responsibilities in existing image editing models, where the generation module is overburdened with both design and painting tasks. They propose shifting the design responsibility to the understanding module while allowing the generation module to focus exclusively on painting.

10 retrieved papers

Can Refute

DIM-4.6B-T2I/Edit unified multimodal model

10 retrieved papers

The authors develop DIM-4.6B-T2I/Edit by connecting a frozen Qwen2.5-VL-3B multimodal model with a trainable SANA1.5-1.6B diffusion decoder using a simple two-layer MLP connector, trained on the DIM dataset to achieve state-of-the-art image editing performance with significantly fewer parameters than existing models.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[5] Guiding instruction-based image editing via multimodal large language models PDF

Fu, Tsu-Jui, Hu, Wenze, Tsu-Jui Fu, Du, Xianzhi, Wenze Hu, Wang, William Yang, Xianzhi Du, Yang, Yinfei, William Yang Wang, Gan Zhe, Yinfei Yang, Zhe Gan (2023)

[11] Smartedit: Exploring complex instruction-based image editing with multimodal large language models PDF

Yuzhou Huang, Liangbin Xie, Xin-Tao Wang, Ziyang Yuan, Xintao Wang, Xiaodong Cun, Yixiao Ge, Jiantao Zhou, Chao Dong, Rui Huang, Ruimao Zhang, Ying Shan (2024)

[16] Insightedit: Towards better instruction following for image editing PDF

Yingjing Xu, Jie Kong, Jiazhi Wang, Xiao Pan, Bo Lin, Qiang Liu (2025)

[18] Leveraging LLMs for On-the-Fly Instruction Guided Image Editing PDF

Rodrigo Santos, JoÃ£o Silva, AntÃ³nio Branco (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Draw-In-Mind (DIM) dataset with two complementary subsets

[51] Learning interleaved image-text comprehension in vision-language large models PDF

Cannot Refute

[52] TheaterGen: Character Management with LLM for Consistent Multi-turn Image Generation PDF

Cannot Refute

[53] Weaving context across images: Improving vision-language models through focus-centric visual chains PDF

Cannot Refute

[54] MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer PDF

Cannot Refute

[55] Imagechain: Advancing sequential image-to-text reasoning in multimodal large language models PDF

Cannot Refute

[56] MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs PDF

Cannot Refute

[57] Diffusion augmented retrieval: A training-free approach to interactive text-to-image retrieval PDF

Cannot Refute

[58] DialogDraw: Image Generation and Editing System Based on Multi-Turn Dialogue PDF

Cannot Refute

[59] FTII-Bench: A Comprehensive Multimodal Benchmark for Flow Text with Image Insertion PDF

Cannot Refute

[60] Graph-based captioning: Enhancing visual descriptions by interconnecting region captions PDF

Cannot Refute

Contribution

Rebalanced designer-painter responsibility division paradigm

[62] Query-Kontext: An Unified Multimodal Model for Image Generation and Editing PDF

Can Refute

[71] Ming-univision: Joint image understanding and generation with a unified continuous tokenizer PDF

Cannot Refute

[72] Uniworld: High-resolution semantic encoders for unified visual understanding and generation PDF

Cannot Refute

[73] Skywork unipic: Unified autoregressive modeling for visual understanding and generation PDF

Cannot Refute

[74] WiseEdit: Benchmarking Cognition-and Creativity-Informed Image Editing PDF

Cannot Refute

[75] EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture PDF

Cannot Refute

[76] Image Inpainting Models are Effective Tools for Instruction-guided Image Editing PDF

Cannot Refute

[77] SeCo: Semantic-Guided Multimodal Color Splash Effects PDF

Cannot Refute

[78] WSI-Agents: A Collaborative Multi-agent System for Multi-modal Whole Slide Image Analysis PDF

Cannot Refute

[79] A visual editor for semantics specifications using the eclipse graphical modeling framework PDF

Cannot Refute

Contribution

DIM-4.6B-T2I/Edit unified multimodal model

[61] Transfer between Modalities with MetaQueries PDF

Cannot Refute

[62] Query-Kontext: An Unified Multimodal Model for Image Generation and Editing PDF

Cannot Refute

[63] Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens PDF

Cannot Refute

[64] UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild PDF

Cannot Refute

[65] Omni-video: Democratizing unified video understanding and generation PDF

Cannot Refute

[66] Easygen: Easing multimodal generation with bidiffuser and llms PDF

Cannot Refute

[67] Unimo-g: Unified image generation through multimodal conditional diffusion PDF

Cannot Refute

[68] Av-dit: Efficient audio-visual diffusion transformer for joint audio and video generation PDF

Cannot Refute

[69] Making multimodal generation easier: When diffusion models meet llms PDF

Cannot Refute

[70] Draw your art dream: Diverse digital art synthesis with multimodal guided diffusion PDF

Cannot Refute

Draw-In-Mind: Rebalancing Designer-Painter Roles in Unified Multimodal Models Benefits Image Editing

Overview

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[5] Guiding instruction-based image editing via multimodal large language models PDF

[11] Smartedit: Exploring complex instruction-based image editing with multimodal large language models PDF

[16] Insightedit: Towards better instruction following for image editing PDF

[18] Leveraging LLMs for On-the-Fly Instruction Guided Image Editing PDF

Contribution Analysis

Draw-In-Mind (DIM) dataset with two complementary subsets

[51] Learning interleaved image-text comprehension in vision-language large models PDF

[52] TheaterGen: Character Management with LLM for Consistent Multi-turn Image Generation PDF

[53] Weaving context across images: Improving vision-language models through focus-centric visual chains PDF

[54] MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer PDF

[55] Imagechain: Advancing sequential image-to-text reasoning in multimodal large language models PDF

[56] MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs PDF

[57] Diffusion augmented retrieval: A training-free approach to interactive text-to-image retrieval PDF

[58] DialogDraw: Image Generation and Editing System Based on Multi-Turn Dialogue PDF

[59] FTII-Bench: A Comprehensive Multimodal Benchmark for Flow Text with Image Insertion PDF

[60] Graph-based captioning: Enhancing visual descriptions by interconnecting region captions PDF

Rebalanced designer-painter responsibility division paradigm

[62] Query-Kontext: An Unified Multimodal Model for Image Generation and Editing PDF

[71] Ming-univision: Joint image understanding and generation with a unified continuous tokenizer PDF

[72] Uniworld: High-resolution semantic encoders for unified visual understanding and generation PDF

[73] Skywork unipic: Unified autoregressive modeling for visual understanding and generation PDF

[74] WiseEdit: Benchmarking Cognition-and Creativity-Informed Image Editing PDF

[75] EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture PDF

[76] Image Inpainting Models are Effective Tools for Instruction-guided Image Editing PDF

[77] SeCo: Semantic-Guided Multimodal Color Splash Effects PDF

[78] WSI-Agents: A Collaborative Multi-agent System for Multi-modal Whole Slide Image Analysis PDF

[79] A visual editor for semantics specifications using the eclipse graphical modeling framework PDF

DIM-4.6B-T2I/Edit unified multimodal model

[61] Transfer between Modalities with MetaQueries PDF

[62] Query-Kontext: An Unified Multimodal Model for Image Generation and Editing PDF

[63] Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens PDF

[64] UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild PDF

[65] Omni-video: Democratizing unified video understanding and generation PDF

[66] Easygen: Easing multimodal generation with bidiffuser and llms PDF

[67] Unimo-g: Unified image generation through multimodal conditional diffusion PDF

[68] Av-dit: Efficient audio-visual diffusion transformer for joint audio and video generation PDF

[69] Making multimodal generation easier: When diffusion models meet llms PDF

[70] Draw your art dream: Diverse digital art synthesis with multimodal guided diffusion PDF

Table of Contents