Draw-In-Mind: Rebalancing Designer-Painter Roles in Unified Multimodal Models Benefits Image Editing

ICLR 2026 Conference SubmissionAnonymous Authors
Image EditingImage GenerationUnified Multimodal ModelMultimodal
Abstract:

In recent years, integrating multimodal understanding and generation into a single unified model has emerged as a promising paradigm. While this approach achieves strong results in text-to-image (T2I) generation, it still struggles with precise image editing. We attribute this limitation to an imbalanced division of responsibilities. The understanding module primarily functions as a translator that encodes user instructions into semantic conditions, while the generation module must simultaneously act as designer and painter, inferring the original layout, identifying the target editing region, and rendering the new content. This imbalance is counterintuitive because the understanding module is typically trained with several times more data on complex reasoning tasks than the generation module. To address this issue, we introduce Draw-In-Mind (DIM), a dataset comprising two complementary subsets: (i) DIM-T2I, containing 14M long-context image–text pairs to enhance complex instruction comprehension; and (ii) DIM-Edit, consisting of 233K chain-of-thought imaginations generated by GPT-4o, serving as explicit design blueprints for image edits. We connect a frozen Qwen2.5-VL-3B with a trainable SANA1.5-1.6B via a lightweight two-layer MLP, and train it on the proposed DIM dataset, resulting in DIM-4.6B-T2I/Edit. Despite its modest parameter scale, DIM-4.6B-Edit achieves SOTA or competitive performance on the ImgEdit and GEdit-Bench benchmarks, outperforming much larger models such as UniWorld-V1 and Step1X-Edit. These findings demonstrate that explicitly assigning the design responsibility to the understanding module provides significant benefits for image editing. Our dataset and models will be publicly available.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: instruction-guided image editing. The field has evolved into a rich landscape organized around several major branches. Diffusion-Based Editing Methods form a dense cluster, encompassing training-free approaches like Plug-and-Play Diffusion[4], instruction-tuned models such as InstructPix2Pix[7], and increasingly sophisticated MLLM-enhanced systems that leverage multimodal large language models for better instruction understanding. GAN-Based Editing Methods include earlier works like TediGAN[6] and StyleCLIP[14], which manipulate latent spaces for text-driven edits. Autoregressive and Alternative Generative Paradigms explore non-diffusion architectures, while Specialized Editing Domains tackle specific applications like 3D scene editing (Instruct-NeRF2NeRF[31]) or human-centric generation (Text2Human[15]). Region-Based and Multi-Object Editing addresses localized modifications, and Training-Free and Optimization-Based Editing focuses on methods that avoid costly retraining. Datasets and Benchmarks such as MagicBrush[2], HQ-Edit[3], and I2EBench[21] provide evaluation standards, while Surveys and Reviews like Instruction Editing Review[1] and Instruction Editing Survey[38] synthesize progress across these branches. Recent work has increasingly integrated multimodal reasoning to bridge the gap between natural language instructions and precise visual edits. Draw-In-Mind[0] exemplifies this trend within the MLLM-Enhanced Diffusion Editing cluster, emphasizing how large language models can decompose complex instructions into actionable editing steps. It shares this focus with SmartEdit[11] and InsightEdit[16], which similarly exploit reasoning capabilities to improve instruction fidelity. In contrast, methods like Multimodal Guided Editing[5] and LLM On-the-Fly Editing[18] explore broader multimodal integration or dynamic planning strategies. A key tension across these branches involves balancing editability with preservation of unaffected regions, a challenge that training-free methods address differently than fine-tuned models like Emu Edit[36] or UltraEdit[9]. Draw-In-Mind[0] positions itself among works that prioritize interpretable, reasoning-driven pipelines, distinguishing its approach from purely end-to-end learned systems while addressing similar goals of high-fidelity, instruction-aligned editing.

Claimed Contributions

Draw-In-Mind (DIM) dataset with two complementary subsets

The authors introduce a unified dataset called Draw-In-Mind (DIM) that contains two parts: DIM-T2I with 14 million long-context image-text pairs for complex instruction comprehension, and DIM-Edit with 233K chain-of-thought imaginations that serve as explicit design blueprints for image editing tasks.

10 retrieved papers
Rebalanced designer-painter responsibility division paradigm

The authors identify and address an imbalanced division of responsibilities in existing image editing models, where the generation module is overburdened with both design and painting tasks. They propose shifting the design responsibility to the understanding module while allowing the generation module to focus exclusively on painting.

10 retrieved papers
Can Refute
DIM-4.6B-T2I/Edit unified multimodal model

The authors develop DIM-4.6B-T2I/Edit by connecting a frozen Qwen2.5-VL-3B multimodal model with a trainable SANA1.5-1.6B diffusion decoder using a simple two-layer MLP connector, trained on the DIM dataset to achieve state-of-the-art image editing performance with significantly fewer parameters than existing models.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Draw-In-Mind (DIM) dataset with two complementary subsets

The authors introduce a unified dataset called Draw-In-Mind (DIM) that contains two parts: DIM-T2I with 14 million long-context image-text pairs for complex instruction comprehension, and DIM-Edit with 233K chain-of-thought imaginations that serve as explicit design blueprints for image editing tasks.

Contribution

Rebalanced designer-painter responsibility division paradigm

The authors identify and address an imbalanced division of responsibilities in existing image editing models, where the generation module is overburdened with both design and painting tasks. They propose shifting the design responsibility to the understanding module while allowing the generation module to focus exclusively on painting.

Contribution

DIM-4.6B-T2I/Edit unified multimodal model

The authors develop DIM-4.6B-T2I/Edit by connecting a frozen Qwen2.5-VL-3B multimodal model with a trainable SANA1.5-1.6B diffusion decoder using a simple two-layer MLP connector, trained on the DIM dataset to achieve state-of-the-art image editing performance with significantly fewer parameters than existing models.