Draw-In-Mind: Rebalancing Designer-Painter Roles in Unified Multimodal Models Benefits Image Editing
Overview
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors introduce a unified dataset called Draw-In-Mind (DIM) that contains two parts: DIM-T2I with 14 million long-context image-text pairs for complex instruction comprehension, and DIM-Edit with 233K chain-of-thought imaginations that serve as explicit design blueprints for image editing tasks.
The authors identify and address an imbalanced division of responsibilities in existing image editing models, where the generation module is overburdened with both design and painting tasks. They propose shifting the design responsibility to the understanding module while allowing the generation module to focus exclusively on painting.
The authors develop DIM-4.6B-T2I/Edit by connecting a frozen Qwen2.5-VL-3B multimodal model with a trainable SANA1.5-1.6B diffusion decoder using a simple two-layer MLP connector, trained on the DIM dataset to achieve state-of-the-art image editing performance with significantly fewer parameters than existing models.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[5] Guiding instruction-based image editing via multimodal large language models PDF
[11] Smartedit: Exploring complex instruction-based image editing with multimodal large language models PDF
[16] Insightedit: Towards better instruction following for image editing PDF
[18] Leveraging LLMs for On-the-Fly Instruction Guided Image Editing PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Draw-In-Mind (DIM) dataset with two complementary subsets
The authors introduce a unified dataset called Draw-In-Mind (DIM) that contains two parts: DIM-T2I with 14 million long-context image-text pairs for complex instruction comprehension, and DIM-Edit with 233K chain-of-thought imaginations that serve as explicit design blueprints for image editing tasks.
[51] Learning interleaved image-text comprehension in vision-language large models PDF
[52] TheaterGen: Character Management with LLM for Consistent Multi-turn Image Generation PDF
[53] Weaving context across images: Improving vision-language models through focus-centric visual chains PDF
[54] MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer PDF
[55] Imagechain: Advancing sequential image-to-text reasoning in multimodal large language models PDF
[56] MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs PDF
[57] Diffusion augmented retrieval: A training-free approach to interactive text-to-image retrieval PDF
[58] DialogDraw: Image Generation and Editing System Based on Multi-Turn Dialogue PDF
[59] FTII-Bench: A Comprehensive Multimodal Benchmark for Flow Text with Image Insertion PDF
[60] Graph-based captioning: Enhancing visual descriptions by interconnecting region captions PDF
Rebalanced designer-painter responsibility division paradigm
The authors identify and address an imbalanced division of responsibilities in existing image editing models, where the generation module is overburdened with both design and painting tasks. They propose shifting the design responsibility to the understanding module while allowing the generation module to focus exclusively on painting.
[62] Query-Kontext: An Unified Multimodal Model for Image Generation and Editing PDF
[71] Ming-univision: Joint image understanding and generation with a unified continuous tokenizer PDF
[72] Uniworld: High-resolution semantic encoders for unified visual understanding and generation PDF
[73] Skywork unipic: Unified autoregressive modeling for visual understanding and generation PDF
[74] WiseEdit: Benchmarking Cognition-and Creativity-Informed Image Editing PDF
[75] EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture PDF
[76] Image Inpainting Models are Effective Tools for Instruction-guided Image Editing PDF
[77] SeCo: Semantic-Guided Multimodal Color Splash Effects PDF
[78] WSI-Agents: A Collaborative Multi-agent System for Multi-modal Whole Slide Image Analysis PDF
[79] A visual editor for semantics specifications using the eclipse graphical modeling framework PDF
DIM-4.6B-T2I/Edit unified multimodal model
The authors develop DIM-4.6B-T2I/Edit by connecting a frozen Qwen2.5-VL-3B multimodal model with a trainable SANA1.5-1.6B diffusion decoder using a simple two-layer MLP connector, trained on the DIM dataset to achieve state-of-the-art image editing performance with significantly fewer parameters than existing models.