TRACE: Your Diffusion Model is Secretly an Instance Edge Detector

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

diffusionunsupervised instance segmentationweakly-supervised panoptic segmentationinference dynamicsattention

High-quality instance and panoptic segmentation has traditionally relied on dense instance-level annotations such as masks, boxes, or points, which are costly, inconsistent, and difficult to scale. Unsupervised and weakly-supervised approaches reduce this burden but remain constrained by semantic backbone constraints and human bias, often producing merged or fragmented outputs. We present TRACE (TRAnsforming diffusion Cues to instance Edges), showing that text-to-image diffusion models secretly function as instance edge annotators. TRACE identifies the Instance Emergence Point (IEP) where object boundaries first appear in self-attention maps, extracts boundaries through Attention Boundary Divergence (ABDiv), and distills them into a lightweight one-step edge decoder. This design removes the need for per-image diffusion inversion, achieving 81× faster inference while producing sharper and more connected boundaries. On the COCO benchmark, TRACE improves unsupervised instance segmentation by +5.1 AP, and in tag-supervised panoptic segmentation it outperforms point-supervised baselines by +1.7 PQ without using any instance-level labels. These results reveal that diffusion models encode hidden instance boundary priors, and that decoding these signals offers a practical and scalable alternative to costly manual annotation.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

TRACE contributes a framework for extracting instance boundaries from diffusion model self-attention without task-specific training or dense annotations. It resides in the 'Self-Attention Map Analysis for Instance Segmentation' leaf, which contains five papers including the original work. This leaf sits within the broader 'Unsupervised and Zero-Shot Instance Boundary Extraction' branch, indicating a moderately populated research direction focused on training-free boundary discovery. The taxonomy reveals this is an active but not overcrowded area, with sibling leaves exploring text-guided grounding and concept extraction as complementary approaches.

The taxonomy structure shows TRACE's leaf neighbors include 'Text-Guided and Phrase-Level Grounding' (three papers using cross-attention for localization) and 'Concept and Mask Extraction from Diffusion Features' (two papers clustering diffusion features). These adjacent directions share the goal of leveraging pretrained diffusion models but differ in mechanism: TRACE analyzes self-attention emergence points, while text-guided methods rely on prompt-driven cross-attention and concept extraction uses feature clustering. The broader 'Supervised and Training-Based Diffusion Segmentation' branch (fourteen papers across multiple leaves) represents an alternative paradigm requiring explicit training, highlighting TRACE's positioning in the training-free methodology space.

Among thirty candidates examined, none clearly refute any of TRACE's three contributions. The 'TRACE framework' contribution examined ten candidates with zero refutable overlaps, as did 'Instance Emergence Point and Attention Boundary Divergence' and 'one-step edge distillation'. This suggests that within the limited search scope, the specific combination of identifying emergence points, computing attention boundary divergence, and distilling to a lightweight decoder appears distinct from prior work. The sibling papers in the same taxonomy leaf likely address related self-attention analysis but may differ in temporal dynamics, boundary extraction mechanisms, or distillation strategies.

Based on the top-thirty semantic matches and taxonomy context, TRACE appears to occupy a recognizable niche within unsupervised diffusion-based segmentation. The analysis covers a focused subset of the field rather than an exhaustive survey, so additional related work may exist beyond the examined candidates. The absence of refutable pairs among thirty candidates suggests novelty in the specific technical approach, though the broader idea of mining diffusion attention for boundaries is shared with the four sibling papers in this taxonomy leaf.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: extracting instance boundaries from diffusion model self-attention. The field has organized itself around several complementary directions. Unsupervised and Zero-Shot Instance Boundary Extraction explores methods that leverage pretrained diffusion models without additional training, often analyzing self-attention maps to discover object boundaries in a training-free manner. Supervised and Training-Based Diffusion Segmentation develops architectures that fine-tune or train diffusion components for segmentation tasks, while Diffusion Models for Data Augmentation and Synthesis focuses on generating synthetic training data. Attention Mechanism Control and Manipulation investigates techniques to steer or modify attention patterns during generation, and Attention Visualization and Interpretability aims to understand what diffusion attention captures. Specialized Attention Architectures and Efficiency addresses computational concerns, and Multimodal and Cross-Domain Applications extends these ideas to diverse settings such as medical imaging or text-guided segmentation. Within the unsupervised branch, a handful of works have demonstrated that diffusion self-attention encodes rich spatial structure. Diffuse attend and segment[3] and Diffusion model is secretly[9] reveal that attention maps can be repurposed for segmentation without retraining, while Repurposing stable diffusion attention[10] similarly extracts semantic correspondences. TRACE[0] sits naturally in this cluster, emphasizing zero-shot instance boundary extraction by analyzing self-attention patterns. Compared to Diffuse attend and segment[3], which may focus on broader semantic regions, TRACE[0] targets finer instance-level boundaries. Image Diffusion Models Exhibit[33] provides complementary evidence that these attention structures emerge during training. The main open question across these studies is how to best aggregate or threshold attention signals to achieve precise boundaries, and whether such unsupervised methods can match the accuracy of supervised approaches like Pixel_DiffusionSeg[5] or Seg4Diff[7] that incorporate explicit segmentation losses.

Claimed Contributions

TRACE framework for instance edge extraction from diffusion models

10 retrieved papers

The authors introduce TRACE, a framework that extracts instance boundaries directly from pretrained text-to-image diffusion models without requiring any instance-level annotations such as masks, boxes, or points. This approach reveals that diffusion models encode hidden instance boundary priors that can be decoded for practical segmentation tasks.

10 retrieved papers

Instance Emergence Point and Attention Boundary Divergence

10 retrieved papers

The authors propose two core technical components: the Instance Emergence Point (IEP) identifies the denoising timestep where instance structure first appears in self-attention maps, and Attention Boundary Divergence (ABDiv) converts criss-cross self-attention differences into boundary maps without clustering or annotations.

10 retrieved papers

One-step edge distillation for real-time inference

10 retrieved papers

The authors develop a distillation method that compresses the multi-step diffusion process into a single-pass edge decoder, achieving 81× faster inference while producing sharper and more connected boundaries compared to per-image diffusion inversion.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[3] Diffuse attend and segment: Unsupervised zero-shot segmentation using stable diffusion PDF

Junjiao Tian, Lavisha Aggarwal, Andrea Colaco, Zsolt Kira, Mar Gonzalez-Franco, Z. Kira, Mar GonzÃ¡lez-Franco (2024)

[9] Diffusion model is secretly a training-free open vocabulary semantic segmenter PDF

Jinglong Wang, Xiawei Li, Jing Zhang, Qingyuan Xu, Qin Zhou, Qian Yu, Lu Sheng, Dong Xu (2025)

[10] Repurposing stable diffusion attention for training-free unsupervised interactive segmentation PDF

Markus Karmann, Onay Urfalioglu, O. Urfalioglu (2025)

[33] Image Diffusion Models Exhibit Emergent Temporal Propagation in Videos PDF

Youngseo Kim, Dohyun Kim, Geonhee Han, Paul Hongsuck Seo (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

TRACE framework for instance edge extraction from diffusion models

[1] From text to mask: Localizing entities using the attention of text-to-image diffusion models PDF

Cannot Refute

[3] Diffuse attend and segment: Unsupervised zero-shot segmentation using stable diffusion PDF

Cannot Refute

[4] Exploring Phrase-Level Grounding with Text-to-Image Diffusion Model PDF

Cannot Refute

[9] Diffusion model is secretly a training-free open vocabulary semantic segmenter PDF

Cannot Refute

[10] Repurposing stable diffusion attention for training-free unsupervised interactive segmentation PDF

Cannot Refute

[51] Datasetdm: Synthesizing data with perception annotations using diffusion models PDF

Cannot Refute

[52] Gs: Generative segmentation via label diffusion PDF

Cannot Refute

[53] Semi-supervised semantic segmentation of cell nuclei with diffusion model and collaborative learning PDF

Cannot Refute

[54] Freeseg-diff: Training-free open-vocabulary segmentation with diffusion models PDF

Cannot Refute

[55] Maskdiffusion: Exploiting pre-trained diffusion models for semantic segmentation PDF

Cannot Refute

Contribution

Instance Emergence Point and Attention Boundary Divergence

[15] Be yourself: Bounded attention for multi-subject text-to-image generation PDF

Cannot Refute

[28] ISAC: Training-Free Instance-to-Semantic Attention Control for Improving Multi-Instance Generation PDF

Cannot Refute

[37] for Microscopic Medical Image Segmentation PDF

Cannot Refute

[66] Localizing object-level shape variations with text-to-image diffusion models PDF

Cannot Refute

[67] Towards Understanding Cross and Self-Attention in Stable Diffusion for Text-Guided Image Editing PDF

Cannot Refute

[68] 3d mitochondria instance segmentation with spatio-temporal transformers PDF

Cannot Refute

[69] Rethinking the spatial inconsistency in classifier-free diffusion guidance PDF

Cannot Refute

[70] Medical image segmentation algorithm based on multilayer boundary perception-self attention deep learning model PDF

Cannot Refute

[71] AFMnanoSALQ: An Accurate Detection PDF

Cannot Refute

[72] Self-Rectifying Diffusion Sampling with Perturbed-Attention Guidance PDF

Cannot Refute

Contribution

One-step edge distillation for real-time inference

[56] One-Step Diffusion with Distribution Matching Distillation PDF

Cannot Refute

[57] Adversarial Diffusion Distillation PDF

Cannot Refute

[58] Denoising score distillation: From noisy diffusion pretraining to one-step high-quality generation PDF

Cannot Refute

[59] VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step PDF

Cannot Refute

[60] DLM-One: Diffusion Language Models for One-Step Sequence Generation PDF

Cannot Refute

[61] Score identity distillation: Exponentially fast distillation of pretrained diffusion models for one-step generation PDF

Cannot Refute

[62] Compression-Aware One-Step Diffusion Model for JPEG Artifact Removal PDF

Cannot Refute

[63] Multi-student Diffusion Distillation for Better One-step Generators PDF

Cannot Refute

[64] Sdxs: Real-time one-step latent diffusion models with image conditions PDF

Cannot Refute

[65] StableCodec: Taming One-Step Diffusion for Extreme Image Compression PDF

Cannot Refute

TRACE: Your Diffusion Model is Secretly an Instance Edge Detector

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[3] Diffuse attend and segment: Unsupervised zero-shot segmentation using stable diffusion PDF

[9] Diffusion model is secretly a training-free open vocabulary semantic segmenter PDF

[10] Repurposing stable diffusion attention for training-free unsupervised interactive segmentation PDF

[33] Image Diffusion Models Exhibit Emergent Temporal Propagation in Videos PDF

Contribution Analysis

TRACE framework for instance edge extraction from diffusion models

[1] From text to mask: Localizing entities using the attention of text-to-image diffusion models PDF

[3] Diffuse attend and segment: Unsupervised zero-shot segmentation using stable diffusion PDF

[4] Exploring Phrase-Level Grounding with Text-to-Image Diffusion Model PDF

[9] Diffusion model is secretly a training-free open vocabulary semantic segmenter PDF

[10] Repurposing stable diffusion attention for training-free unsupervised interactive segmentation PDF

[51] Datasetdm: Synthesizing data with perception annotations using diffusion models PDF

[52] Gs: Generative segmentation via label diffusion PDF

[53] Semi-supervised semantic segmentation of cell nuclei with diffusion model and collaborative learning PDF

[54] Freeseg-diff: Training-free open-vocabulary segmentation with diffusion models PDF

[55] Maskdiffusion: Exploiting pre-trained diffusion models for semantic segmentation PDF

Instance Emergence Point and Attention Boundary Divergence

[15] Be yourself: Bounded attention for multi-subject text-to-image generation PDF

[28] ISAC: Training-Free Instance-to-Semantic Attention Control for Improving Multi-Instance Generation PDF

[37] for Microscopic Medical Image Segmentation PDF

[66] Localizing object-level shape variations with text-to-image diffusion models PDF

[67] Towards Understanding Cross and Self-Attention in Stable Diffusion for Text-Guided Image Editing PDF

[68] 3d mitochondria instance segmentation with spatio-temporal transformers PDF

[69] Rethinking the spatial inconsistency in classifier-free diffusion guidance PDF

[70] Medical image segmentation algorithm based on multilayer boundary perception-self attention deep learning model PDF

[71] AFMnanoSALQ: An Accurate Detection PDF

[72] Self-Rectifying Diffusion Sampling with Perturbed-Attention Guidance PDF

One-step edge distillation for real-time inference

[56] One-Step Diffusion with Distribution Matching Distillation PDF

[57] Adversarial Diffusion Distillation PDF

[58] Denoising score distillation: From noisy diffusion pretraining to one-step high-quality generation PDF

[59] VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step PDF

[60] DLM-One: Diffusion Language Models for One-Step Sequence Generation PDF

[61] Score identity distillation: Exponentially fast distillation of pretrained diffusion models for one-step generation PDF

[62] Compression-Aware One-Step Diffusion Model for JPEG Artifact Removal PDF

[63] Multi-student Diffusion Distillation for Better One-step Generators PDF

[64] Sdxs: Real-time one-step latent diffusion models with image conditions PDF

[65] StableCodec: Taming One-Step Diffusion for Extreme Image Compression PDF

Table of Contents