TRACE: Your Diffusion Model is Secretly an Instance Edge Detector

ICLR 2026 Conference SubmissionAnonymous Authors
diffusionunsupervised instance segmentationweakly-supervised panoptic segmentationinference dynamicsattention
Abstract:

High-quality instance and panoptic segmentation has traditionally relied on dense instance-level annotations such as masks, boxes, or points, which are costly, inconsistent, and difficult to scale. Unsupervised and weakly-supervised approaches reduce this burden but remain constrained by semantic backbone constraints and human bias, often producing merged or fragmented outputs. We present TRACE (TRAnsforming diffusion Cues to instance Edges), showing that text-to-image diffusion models secretly function as instance edge annotators. TRACE identifies the Instance Emergence Point (IEP) where object boundaries first appear in self-attention maps, extracts boundaries through Attention Boundary Divergence (ABDiv), and distills them into a lightweight one-step edge decoder. This design removes the need for per-image diffusion inversion, achieving 81× faster inference while producing sharper and more connected boundaries. On the COCO benchmark, TRACE improves unsupervised instance segmentation by +5.1 AP, and in tag-supervised panoptic segmentation it outperforms point-supervised baselines by +1.7 PQ without using any instance-level labels. These results reveal that diffusion models encode hidden instance boundary priors, and that decoding these signals offers a practical and scalable alternative to costly manual annotation.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

TRACE contributes a framework for extracting instance boundaries from diffusion model self-attention without task-specific training or dense annotations. It resides in the 'Self-Attention Map Analysis for Instance Segmentation' leaf, which contains five papers including the original work. This leaf sits within the broader 'Unsupervised and Zero-Shot Instance Boundary Extraction' branch, indicating a moderately populated research direction focused on training-free boundary discovery. The taxonomy reveals this is an active but not overcrowded area, with sibling leaves exploring text-guided grounding and concept extraction as complementary approaches.

The taxonomy structure shows TRACE's leaf neighbors include 'Text-Guided and Phrase-Level Grounding' (three papers using cross-attention for localization) and 'Concept and Mask Extraction from Diffusion Features' (two papers clustering diffusion features). These adjacent directions share the goal of leveraging pretrained diffusion models but differ in mechanism: TRACE analyzes self-attention emergence points, while text-guided methods rely on prompt-driven cross-attention and concept extraction uses feature clustering. The broader 'Supervised and Training-Based Diffusion Segmentation' branch (fourteen papers across multiple leaves) represents an alternative paradigm requiring explicit training, highlighting TRACE's positioning in the training-free methodology space.

Among thirty candidates examined, none clearly refute any of TRACE's three contributions. The 'TRACE framework' contribution examined ten candidates with zero refutable overlaps, as did 'Instance Emergence Point and Attention Boundary Divergence' and 'one-step edge distillation'. This suggests that within the limited search scope, the specific combination of identifying emergence points, computing attention boundary divergence, and distilling to a lightweight decoder appears distinct from prior work. The sibling papers in the same taxonomy leaf likely address related self-attention analysis but may differ in temporal dynamics, boundary extraction mechanisms, or distillation strategies.

Based on the top-thirty semantic matches and taxonomy context, TRACE appears to occupy a recognizable niche within unsupervised diffusion-based segmentation. The analysis covers a focused subset of the field rather than an exhaustive survey, so additional related work may exist beyond the examined candidates. The absence of refutable pairs among thirty candidates suggests novelty in the specific technical approach, though the broader idea of mining diffusion attention for boundaries is shared with the four sibling papers in this taxonomy leaf.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: extracting instance boundaries from diffusion model self-attention. The field has organized itself around several complementary directions. Unsupervised and Zero-Shot Instance Boundary Extraction explores methods that leverage pretrained diffusion models without additional training, often analyzing self-attention maps to discover object boundaries in a training-free manner. Supervised and Training-Based Diffusion Segmentation develops architectures that fine-tune or train diffusion components for segmentation tasks, while Diffusion Models for Data Augmentation and Synthesis focuses on generating synthetic training data. Attention Mechanism Control and Manipulation investigates techniques to steer or modify attention patterns during generation, and Attention Visualization and Interpretability aims to understand what diffusion attention captures. Specialized Attention Architectures and Efficiency addresses computational concerns, and Multimodal and Cross-Domain Applications extends these ideas to diverse settings such as medical imaging or text-guided segmentation. Within the unsupervised branch, a handful of works have demonstrated that diffusion self-attention encodes rich spatial structure. Diffuse attend and segment[3] and Diffusion model is secretly[9] reveal that attention maps can be repurposed for segmentation without retraining, while Repurposing stable diffusion attention[10] similarly extracts semantic correspondences. TRACE[0] sits naturally in this cluster, emphasizing zero-shot instance boundary extraction by analyzing self-attention patterns. Compared to Diffuse attend and segment[3], which may focus on broader semantic regions, TRACE[0] targets finer instance-level boundaries. Image Diffusion Models Exhibit[33] provides complementary evidence that these attention structures emerge during training. The main open question across these studies is how to best aggregate or threshold attention signals to achieve precise boundaries, and whether such unsupervised methods can match the accuracy of supervised approaches like Pixel_DiffusionSeg[5] or Seg4Diff[7] that incorporate explicit segmentation losses.

Claimed Contributions

TRACE framework for instance edge extraction from diffusion models

The authors introduce TRACE, a framework that extracts instance boundaries directly from pretrained text-to-image diffusion models without requiring any instance-level annotations such as masks, boxes, or points. This approach reveals that diffusion models encode hidden instance boundary priors that can be decoded for practical segmentation tasks.

10 retrieved papers
Instance Emergence Point and Attention Boundary Divergence

The authors propose two core technical components: the Instance Emergence Point (IEP) identifies the denoising timestep where instance structure first appears in self-attention maps, and Attention Boundary Divergence (ABDiv) converts criss-cross self-attention differences into boundary maps without clustering or annotations.

10 retrieved papers
One-step edge distillation for real-time inference

The authors develop a distillation method that compresses the multi-step diffusion process into a single-pass edge decoder, achieving 81× faster inference while producing sharper and more connected boundaries compared to per-image diffusion inversion.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

TRACE framework for instance edge extraction from diffusion models

The authors introduce TRACE, a framework that extracts instance boundaries directly from pretrained text-to-image diffusion models without requiring any instance-level annotations such as masks, boxes, or points. This approach reveals that diffusion models encode hidden instance boundary priors that can be decoded for practical segmentation tasks.

Contribution

Instance Emergence Point and Attention Boundary Divergence

The authors propose two core technical components: the Instance Emergence Point (IEP) identifies the denoising timestep where instance structure first appears in self-attention maps, and Attention Boundary Divergence (ABDiv) converts criss-cross self-attention differences into boundary maps without clustering or annotations.

Contribution

One-step edge distillation for real-time inference

The authors develop a distillation method that compresses the multi-step diffusion process into a single-pass edge decoder, achieving 81× faster inference while producing sharper and more connected boundaries compared to per-image diffusion inversion.