Improved Object-Centric Diffusion Learning with Registers and Contrastive Alignment

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Object-centric learningdiffusion modelscontrastive learningslot attentioncompositionality

Slot Attention (SA) with pretrained diffusion models has recently shown promise for object-centric learning (OCL), but suffers from slot entanglement and weak alignment between object slots and image content. We propose Contrastive Object-centric Diffusion Alignment (CODA), a simple extension that (i) employs register slots to absorb residual attention and reduce interference between object slots, and (ii) applies a contrastive alignment loss to explicitly encourage slot–image correspondence. The resulting training objective serves as a tractable surrogate for maximizing mutual information (MI) between slots and inputs, strengthening slot representation quality. On both synthetic (MOVi-C/E) and real-world datasets (VOC, COCO), CODA improves object discovery (e.g., +6.1% FG-ARI on COCO), property prediction, and compositional image generation over strong baselines. Register slots add negligible overhead, keeping CODA efficient and scalable. These results indicate potential applications of CODA as an effective framework for robust OCL in complex, real-world scenes. Code is available as supplementary material.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes CODA, which extends slot attention with pretrained diffusion models by introducing register slots and a contrastive alignment loss to improve object-centric learning. It sits within the 'Slot Attention with Diffusion Models' leaf, which contains eight papers including SlotDiffusion and Object-Centric Slot Diffusion. This is a moderately populated research direction within the broader Object-Centric Representation Learning branch, indicating active but not overcrowded exploration of slot-based decomposition methods combined with diffusion architectures.

The taxonomy reveals that neighboring leaves explore discrete tokenization (Discrete and Grouped Representations), multi-scale fusion approaches, and causal disentanglement, each containing one to two papers. These directions represent alternative strategies for structured scene understanding. CODA's focus on contrastive alignment and register mechanisms distinguishes it from these neighbors, which emphasize quantization schemes or interpretability through causal modeling. The broader Object-Centric Generation and Synthesis branch (six papers across three leaves) addresses compositional generation rather than representation quality, highlighting CODA's emphasis on learning robust slots before generation.

Among fifteen candidates examined, the contrastive alignment contribution shows no clear refutation (two candidates examined), while the finetuning cross-attention projections contribution appears more overlapping with prior work (ten candidates examined, two refutable). The register-augmented slot mechanism also shows no refutation among three candidates. These statistics suggest that within the limited search scope, the contrastive alignment and register slot ideas appear less directly anticipated by existing work, whereas adapting cross-attention layers to reduce text-conditioning bias may have more substantial precedent in the examined literature.

Based on the top-fifteen semantic matches and the moderately populated taxonomy leaf, CODA appears to introduce meaningful refinements to slot attention mechanisms, particularly through register slots and contrastive objectives. However, the limited search scope means this assessment covers a narrow slice of potentially relevant prior work. The analysis does not capture exhaustive coverage of alignment strategies or attention mechanism modifications across the broader diffusion and representation learning literature.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: object-centric learning with diffusion models. This emerging field combines the compositional power of object-centric representations with the generative capabilities of diffusion models, yielding a taxonomy organized around six main branches. Object-Centric Representation Learning focuses on discovering and encoding individual entities from raw data, often using slot-based architectures that decompose scenes into interpretable parts. Object-Centric Generation and Synthesis leverages these representations to create or compose novel scenes, while Object-Centric Image and Video Editing applies diffusion models to manipulate specific objects within visual content. Object-Centric Detection and Segmentation addresses the identification and localization of entities, sometimes integrating diffusion-based refinement or proposal mechanisms such as DiffusionDet[5]. Object-Centric Robotic Manipulation and Planning exploits compositional scene understanding to guide action generation in physical environments, and Specialized Object-Centric Applications encompasses domain-specific tasks ranging from molecular design to temporal reasoning. Within the representation learning branch, a particularly active line of work centers on slot attention mechanisms combined with diffusion models. SlotDiffusion[2] and Object-Centric Slot Diffusion[3] exemplify efforts to integrate slot-based decomposition with generative modeling, enabling unsupervised discovery of objects and their attributes. Guided Latent Slot[10] and Slot-Guided Adaptation[14] explore how to steer or refine slot representations using additional supervision or task-specific cues. Registers Contrastive Alignment[0] sits naturally within this cluster, emphasizing alignment strategies that leverage contrastive learning to improve the coherence and interpretability of slot-based representations. Compared to SlotDiffusion[2], which primarily targets generative fidelity, and Slot-Guided Adaptation[14], which focuses on downstream task adaptation, Registers Contrastive Alignment[0] appears to prioritize the alignment and consistency of learned object-centric features across different views or modalities, addressing a complementary challenge in making slot representations more robust and semantically grounded.

Claimed Contributions

3 retrieved papers

The authors introduce input-independent register slots that act as attention sinks to absorb residual attention, reducing interference between object slots and mitigating slot entanglement. These register slots are obtained by encoding padding tokens through the frozen SD text encoder.

3 retrieved papers

Finetuning cross-attention projections to mitigate text-conditioning bias

Can Refute

10 retrieved papers

The authors propose a lightweight adaptation strategy that finetunes only the key, value, and output projections in cross-attention layers of the pretrained diffusion model. This approach mitigates text-conditioning bias while preserving generative quality without requiring additional architectural layers.

10 retrieved papers

Can Refute

Contrastive alignment objective for slot-image correspondence

2 retrieved papers

The authors introduce a contrastive alignment loss that explicitly encourages slot-image correspondence by maximizing likelihood under aligned slots while minimizing it under mismatched slots. This objective serves as a tractable surrogate for maximizing mutual information between slots and inputs.

2 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[2] SlotDiffusion: Object-Centric Generative Modeling with Diffusion Models PDF

Wu, Ziyi, Hu Jingyu, Ziyi Wu, Lu, Wuyue, Jingyu Hu, Gilitschenski, Igor, Wuyue Lu, Garg, Animesh, Igor Gilitschenski, Animesh Garg (2023)

[3] Object-Centric Slot Diffusion PDF

Jiang Jindong, Deng Fei, Jindong Jiang, Singh Gautam, Fei Deng, Ahn Sungjin, Gautam Singh, S. Ahn (2023)

[10] Guided Latent Slot Diffusion for Object-Centric Learning PDF

Krishnakant Singh, Simone Schaub-Meyer, Stefan Roth (2024)

[14] Slot-Guided Adaptation of Pre-trained Diffusion Models for Object-Centric Learning and Compositional Generation PDF

Akan, Adil Kaan, Yemez, Yucel, Adil Kaan Akan, Y. Yemez (2025)

[18] GLASS: Guided Latent Slot Diffusion for Object-Centric Learning PDF

Krishnakant Singh, Simone Schaub-Meyer, Stefan Roth (2025)

[19] Learning Object-Centric Representations Based on Slots in Real World Scenarios PDF

Akan, Adil Kaan, Adil Kaan Akan (2025)

[37] Advances in object-centric learning methods towards real world applications PDF

Jiang Jindong (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Register-augmented slot diffusion with input-independent register slots

[51] Action-slot: Visual action-centric representations for multi-label atomic activity recognition in traffic scenes PDF

Cannot Refute

[52] CacheClip: Accelerating RAG with Effective KV Cache Reuse PDF

Cannot Refute

[53] Plug-and-Play Global Memory via Test-Time Registers PDF

Cannot Refute

Contribution

Finetuning cross-attention projections to mitigate text-conditioning bias

[56] MACE: Mass Concept Erasure in Diffusion Models PDF

Can Refute

[62] Contextualized Diffusion Models for Text-Guided Image and Video Generation PDF

Can Refute

[54] Editing implicit assumptions in text-to-image diffusion models PDF

Cannot Refute

[55] Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models PDF

Cannot Refute

[57] Aid: Attention interpolation of text-to-image diffusion PDF

Cannot Refute

[58] Enhancing text-to-image diffusion transformer via split-text conditioning PDF

Cannot Refute

[59] Exploring the role of large language models in prompt encoding for diffusion models PDF

Cannot Refute

[60] Energy-Based Cross Attention for Bayesian Context Update in Text-to-Image Diffusion Models PDF

Cannot Refute

[61] Fair Generation without Unfair Distortions: Debiasing Text-to-Image Generation with Entanglement-Free Attention PDF

Cannot Refute

[63] HiGarment: Cross-modal Harmony Based Diffusion Model for Flat Sketch to Realistic Garment Image PDF

Cannot Refute

Contribution

Contrastive alignment objective for slot-image correspondence

[64] Aligning Text to Image in Diffusion Models is Easier Than You Think PDF

Cannot Refute

[65] Self-Supervised Dense Representation Learning With Inter-Image Information PDF

Cannot Refute

Improved Object-Centric Diffusion Learning with Registers and Contrastive Alignment

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[2] SlotDiffusion: Object-Centric Generative Modeling with Diffusion Models PDF

[3] Object-Centric Slot Diffusion PDF

[10] Guided Latent Slot Diffusion for Object-Centric Learning PDF

[14] Slot-Guided Adaptation of Pre-trained Diffusion Models for Object-Centric Learning and Compositional Generation PDF

[18] GLASS: Guided Latent Slot Diffusion for Object-Centric Learning PDF

[19] Learning Object-Centric Representations Based on Slots in Real World Scenarios PDF

[37] Advances in object-centric learning methods towards real world applications PDF

Contribution Analysis

Register-augmented slot diffusion with input-independent register slots

[51] Action-slot: Visual action-centric representations for multi-label atomic activity recognition in traffic scenes PDF

[52] CacheClip: Accelerating RAG with Effective KV Cache Reuse PDF

[53] Plug-and-Play Global Memory via Test-Time Registers PDF

Finetuning cross-attention projections to mitigate text-conditioning bias

[56] MACE: Mass Concept Erasure in Diffusion Models PDF

[62] Contextualized Diffusion Models for Text-Guided Image and Video Generation PDF

[54] Editing implicit assumptions in text-to-image diffusion models PDF

[55] Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models PDF

[57] Aid: Attention interpolation of text-to-image diffusion PDF

[58] Enhancing text-to-image diffusion transformer via split-text conditioning PDF

[59] Exploring the role of large language models in prompt encoding for diffusion models PDF

[60] Energy-Based Cross Attention for Bayesian Context Update in Text-to-Image Diffusion Models PDF

[61] Fair Generation without Unfair Distortions: Debiasing Text-to-Image Generation with Entanglement-Free Attention PDF

[63] HiGarment: Cross-modal Harmony Based Diffusion Model for Flat Sketch to Realistic Garment Image PDF

Contrastive alignment objective for slot-image correspondence

[64] Aligning Text to Image in Diffusion Models is Easier Than You Think PDF

[65] Self-Supervised Dense Representation Learning With Inter-Image Information PDF

Table of Contents