LazyDrag: Enabling Stable Drag-Based Editing on Multi-Modal Diffusion Transformers via Explicit Correspondence

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.6 Download Report PDF

DiffusionDiTImage Editing

The reliance on implicit point matching via attention has become a core bottleneck in drag-based editing, resulting in a fundamental compromise on weakened inversion strength and costly test-time optimization (TTO). This compromise severely limits the generative capabilities, suppressing high-fidelity inpainting and text-guided creation. In this paper, we introduce LazyDrag, the first drag-based image editing method for Multi-Modal Diffusion Transformers, which directly eliminates the reliance on implicit point matching. In concrete terms, our method generates an explicit correspondence map from user drag inputs as a reliable reference to boost the attention control. This reliable reference opens the potential for a stable full-strength inversion process, which is the first in the drag-based editing task. It obviates the necessity for TTO and unlocks the generative capability of models. Therefore, LazyDrag naturally unifies precise geometric control with text guidance, enabling complex edits that were previously out of reach: opening the mouth of a dog and inpainting its interior, generating new objects like a ``tennis ball'', or for ambiguous drags, making context-aware changes like moving hands into pockets. Moreover, LazyDrag supports multi-round edits with simultaneous move and scale operations. Evaluated on DragBench, our method outperforms baselines in drag accuracy and perceptual quality, as validated by mean distances, VIEScore and user studies. LazyDrag not only sets new state-of-the-art performance, but also paves a new way to editing paradigms. Code will be open-sourced.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces LazyDrag, which claims to be the first drag-based editing method specifically designed for Multi-Modal Diffusion Transformers. It sits in the 'Explicit Correspondence with Multi-Modal Transformers' leaf of the taxonomy, which currently contains only this work, indicating a sparse research direction. The broader parent category 'Multi-Modal and Text-Integrated Editing' includes three leaves with four papers total, suggesting that while multi-modal drag editing exists, the explicit correspondence approach for DiTs represents a relatively unexplored niche within the field.

The taxonomy reveals that neighboring work falls into two main clusters. The 'Joint Text-Drag Manipulation' leaf contains methods like CLIPDrag and DiffEditor that unify textual and spatial control but through different mechanisms. The 'Core Drag-Based Editing Frameworks' branch, housing foundational methods like DragDiffusion and RegionDrag, establishes point-based optimization paradigms that LazyDrag explicitly aims to move beyond. The taxonomy's scope notes clarify that 'Explicit Correspondence' methods are distinguished by generating correspondence maps rather than relying on implicit attention mechanisms, setting clear boundaries between LazyDrag and implicit attention-based approaches in sibling categories.

Among thirty candidates examined across three contributions, none were found to clearly refute the claimed novelty. The first contribution (drag editing for Multi-Modal DiTs) examined ten candidates with zero refutable matches, as did the second (explicit correspondence map generation) and third (full-strength inversion without test-time optimization). Given the limited search scope—thirty papers from semantic retrieval, not an exhaustive survey—these statistics suggest that within the examined literature, no prior work directly overlaps with LazyDrag's combination of explicit correspondence, multi-modal DiT architecture, and TTO-free inversion. The absence of refutable candidates across all contributions indicates potential novelty, though the search scale leaves open the possibility of undiscovered prior art.

Based on the limited examination of thirty candidates, LazyDrag appears to occupy a relatively novel position at the intersection of explicit correspondence mechanisms and multi-modal diffusion transformers. The taxonomy structure shows this is an emerging direction with sparse prior work in the specific leaf, though related ideas exist in neighboring branches. The analysis covers top-ranked semantic matches and does not claim exhaustive coverage of all drag-based editing literature or broader diffusion model research.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: drag-based image editing with diffusion transformers. The field has organized itself around several complementary directions. Core Drag-Based Editing Frameworks establish foundational methods for point-based manipulation, with works like DragDiffusion[2] and RegionDrag[1] defining how user-specified handle and target points guide feature-space optimization. Optimization and Efficiency Enhancements address computational bottlenecks through techniques such as lazy evaluation (Lazy Diffusion Transformer[11]) and accelerated inference (InstantDrag[14], LightningDrag[20]). Multi-Modal and Text-Integrated Editing explores how textual prompts and semantic correspondences can augment spatial control, while Trajectory and Motion Control Extensions generalize dragging to temporal sequences and complex motion paths (DragFlow[5], MotionCanvas[6]). Finally, General Editing Frameworks and User Interfaces broaden the scope to include insertion, inpainting, and interactive tools (Magic Insert[9], CanFuUI[7]), situating drag-based methods within wider editing ecosystems. A particularly active line of work focuses on balancing precision with computational cost: methods like StableDrag[12] and AdaptiveDrag[8] refine iterative optimization to maintain fidelity under large displacements, whereas InstantDrag[14] and LightningDrag[20] prioritize speed through distillation or single-step inference. Another contrast emerges between purely spatial approaches (DragDiffusion[2], Drag Your Noise[3]) and those integrating semantic or textual guidance (CLIPDrag[21], DiffEditor[22]). LazyDrag[0] sits within the Multi-Modal and Text-Integrated Editing branch, specifically under Explicit Correspondence with Multi-Modal Transformers. Its emphasis on leveraging transformer-based correspondence mechanisms distinguishes it from optimization-heavy baselines like DragDiffusion[2] and positions it closer to works that exploit learned feature alignments. Compared to purely spatial methods such as Drag Your Noise[3], LazyDrag[0] appears to prioritize semantic consistency through multi-modal integration, reflecting ongoing efforts to unify geometric control with high-level understanding.

Claimed Contributions

LazyDrag: first drag-based editing method for Multi-Modal Diffusion Transformers

10 retrieved papers

The authors present LazyDrag as the first method to enable drag-based image editing specifically on Multi-Modal Diffusion Transformers (MM-DiTs), replacing implicit attention-based point matching with an explicit correspondence map to stabilize editing under full-strength inversion without test-time optimization.

10 retrieved papers

Explicit correspondence map generation from drag instructions

10 retrieved papers

The method converts user drag instructions into an explicit correspondence map that provides deterministic, stable guidance for attention control during generation, eliminating the fragility of implicit attention-similarity matching used in prior work.

10 retrieved papers

Full-strength inversion without test-time optimization

10 retrieved papers

By using the explicit correspondence map, LazyDrag achieves stable editing under full-strength inversion across all sampling steps without requiring per-image test-time optimization, unlocking generative capabilities such as high-fidelity inpainting and text-guided creation that were previously suppressed.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

LazyDrag: first drag-based editing method for Multi-Modal Diffusion Transformers

[1] Regiondrag: Fast region-based image editing with diffusion models PDF

Cannot Refute

[2] DragDiffusion: Harnessing Diffusion Models for Interactive Point-Based Image Editing PDF

Cannot Refute

[10] Null text-guided interactive image editing for diffusion models PDF

Cannot Refute

[22] DiffEditor: Boosting Accuracy and Flexibility on Diffusion-based Image Editing PDF

Cannot Refute

[24] DragonDiffusion: Enabling Drag-style Manipulation on Diffusion Models PDF

Cannot Refute

[25] Easydrag: Efficient point-based manipulation on diffusion models PDF

Cannot Refute

[26] Streamlining image editing with layered diffusion brushes PDF

Cannot Refute

[27] FramePainter: Endowing Interactive Image Editing with Video Diffusion Priors PDF

Cannot Refute

[28] GenCompositor: Generative Video Compositing with Diffusion Transformer PDF

Cannot Refute

[29] Interactive Tumor Progression Modeling via Sketch-Based Image Editing PDF

Cannot Refute

Contribution

Explicit correspondence map generation from drag instructions

[30] Cross attention based style distribution for controllable person image synthesis PDF

Cannot Refute

[31] Linguistic Binding in Diffusion Models: Enhancing Attribute Correspondence through Attention Map Alignment PDF

Cannot Refute

[32] Scene-Level Appearance Transfer with Semantic Correspondences PDF

Cannot Refute

[33] Masked-attention diffusion guidance for spatially controlling text-to-image generation PDF

Cannot Refute

[34] Edicho: Consistent image editing in the wild PDF

Cannot Refute

[35] On Mechanistic Knowledge Localization in Text-to-Image Generative Models PDF

Cannot Refute

[36] MOSAIC: Multi-Subject Personalized Generation via Correspondence-Aware Alignment and Disentanglement PDF

Cannot Refute

[37] Directed Diffusion: Direct Control of Object Placement through Attention Guidance PDF

Cannot Refute

[38] Towards better text-to-image generation alignment via attention modulation PDF

Cannot Refute

[39] Neural texture synthesis with guided correspondence PDF

Cannot Refute

Contribution

Full-strength inversion without test-time optimization

[40] Negative-Prompt Inversion: Fast Image Inversion for Editing with Text-Guided Diffusion Models PDF

Cannot Refute

[41] Direct inversion: Boosting diffusion-based editing with 3 lines of code PDF

Cannot Refute

[42] EDICT: Exact Diffusion Inversion via Coupled Transformations PDF

Cannot Refute

[43] HyperStyle: StyleGAN Inversion with HyperNetworks for Real Image Editing PDF

Cannot Refute

[44] Proxedit: Improving tuning-free real image editing with proximal guidance PDF

Cannot Refute

[45] Semantic image inversion and editing using rectified stochastic differential equations PDF

Cannot Refute

[46] Dual-Schedule Inversion: Training-and Tuning-Free Inversion for Real Image Editing PDF

Cannot Refute

[47] Tuning-free inversion-enhanced control for consistent image editing PDF

Cannot Refute

[48] LEDITS: Real Image Editing with DDPM Inversion and Semantic Guidance PDF

Cannot Refute

[49] Noise Map Guidance: Inversion with Spatial Context for Real Image Editing PDF

Cannot Refute

LazyDrag: Enabling Stable Drag-Based Editing on Multi-Modal Diffusion Transformers via Explicit Correspondence

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

LazyDrag: first drag-based editing method for Multi-Modal Diffusion Transformers

[1] Regiondrag: Fast region-based image editing with diffusion models PDF

[2] DragDiffusion: Harnessing Diffusion Models for Interactive Point-Based Image Editing PDF

[10] Null text-guided interactive image editing for diffusion models PDF

[22] DiffEditor: Boosting Accuracy and Flexibility on Diffusion-based Image Editing PDF

[24] DragonDiffusion: Enabling Drag-style Manipulation on Diffusion Models PDF

[25] Easydrag: Efficient point-based manipulation on diffusion models PDF

[26] Streamlining image editing with layered diffusion brushes PDF

[27] FramePainter: Endowing Interactive Image Editing with Video Diffusion Priors PDF

[28] GenCompositor: Generative Video Compositing with Diffusion Transformer PDF

[29] Interactive Tumor Progression Modeling via Sketch-Based Image Editing PDF

Explicit correspondence map generation from drag instructions

[30] Cross attention based style distribution for controllable person image synthesis PDF

[31] Linguistic Binding in Diffusion Models: Enhancing Attribute Correspondence through Attention Map Alignment PDF

[32] Scene-Level Appearance Transfer with Semantic Correspondences PDF

[33] Masked-attention diffusion guidance for spatially controlling text-to-image generation PDF

[34] Edicho: Consistent image editing in the wild PDF

[35] On Mechanistic Knowledge Localization in Text-to-Image Generative Models PDF

[36] MOSAIC: Multi-Subject Personalized Generation via Correspondence-Aware Alignment and Disentanglement PDF

[37] Directed Diffusion: Direct Control of Object Placement through Attention Guidance PDF

[38] Towards better text-to-image generation alignment via attention modulation PDF

[39] Neural texture synthesis with guided correspondence PDF

Full-strength inversion without test-time optimization

[40] Negative-Prompt Inversion: Fast Image Inversion for Editing with Text-Guided Diffusion Models PDF

[41] Direct inversion: Boosting diffusion-based editing with 3 lines of code PDF

[42] EDICT: Exact Diffusion Inversion via Coupled Transformations PDF

[43] HyperStyle: StyleGAN Inversion with HyperNetworks for Real Image Editing PDF

[44] Proxedit: Improving tuning-free real image editing with proximal guidance PDF

[45] Semantic image inversion and editing using rectified stochastic differential equations PDF

[46] Dual-Schedule Inversion: Training-and Tuning-Free Inversion for Real Image Editing PDF

[47] Tuning-free inversion-enhanced control for consistent image editing PDF

[48] LEDITS: Real Image Editing with DDPM Inversion and Semantic Guidance PDF

[49] Noise Map Guidance: Inversion with Spatial Context for Real Image Editing PDF

Table of Contents