Improved Object-Centric Diffusion Learning with Registers and Contrastive Alignment

ICLR 2026 Conference SubmissionAnonymous Authors
Object-centric learningdiffusion modelscontrastive learningslot attentioncompositionality
Abstract:

Slot Attention (SA) with pretrained diffusion models has recently shown promise for object-centric learning (OCL), but suffers from slot entanglement and weak alignment between object slots and image content. We propose Contrastive Object-centric Diffusion Alignment (CODA), a simple extension that (i) employs register slots to absorb residual attention and reduce interference between object slots, and (ii) applies a contrastive alignment loss to explicitly encourage slot–image correspondence. The resulting training objective serves as a tractable surrogate for maximizing mutual information (MI) between slots and inputs, strengthening slot representation quality. On both synthetic (MOVi-C/E) and real-world datasets (VOC, COCO), CODA improves object discovery (e.g., +6.1% FG-ARI on COCO), property prediction, and compositional image generation over strong baselines. Register slots add negligible overhead, keeping CODA efficient and scalable. These results indicate potential applications of CODA as an effective framework for robust OCL in complex, real-world scenes. Code is available as supplementary material.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes CODA, which extends slot attention with pretrained diffusion models by introducing register slots and a contrastive alignment loss to improve object-centric learning. It sits within the 'Slot Attention with Diffusion Models' leaf, which contains eight papers including SlotDiffusion and Object-Centric Slot Diffusion. This is a moderately populated research direction within the broader Object-Centric Representation Learning branch, indicating active but not overcrowded exploration of slot-based decomposition methods combined with diffusion architectures.

The taxonomy reveals that neighboring leaves explore discrete tokenization (Discrete and Grouped Representations), multi-scale fusion approaches, and causal disentanglement, each containing one to two papers. These directions represent alternative strategies for structured scene understanding. CODA's focus on contrastive alignment and register mechanisms distinguishes it from these neighbors, which emphasize quantization schemes or interpretability through causal modeling. The broader Object-Centric Generation and Synthesis branch (six papers across three leaves) addresses compositional generation rather than representation quality, highlighting CODA's emphasis on learning robust slots before generation.

Among fifteen candidates examined, the contrastive alignment contribution shows no clear refutation (two candidates examined), while the finetuning cross-attention projections contribution appears more overlapping with prior work (ten candidates examined, two refutable). The register-augmented slot mechanism also shows no refutation among three candidates. These statistics suggest that within the limited search scope, the contrastive alignment and register slot ideas appear less directly anticipated by existing work, whereas adapting cross-attention layers to reduce text-conditioning bias may have more substantial precedent in the examined literature.

Based on the top-fifteen semantic matches and the moderately populated taxonomy leaf, CODA appears to introduce meaningful refinements to slot attention mechanisms, particularly through register slots and contrastive objectives. However, the limited search scope means this assessment covers a narrow slice of potentially relevant prior work. The analysis does not capture exhaustive coverage of alignment strategies or attention mechanism modifications across the broader diffusion and representation learning literature.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
15
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: object-centric learning with diffusion models. This emerging field combines the compositional power of object-centric representations with the generative capabilities of diffusion models, yielding a taxonomy organized around six main branches. Object-Centric Representation Learning focuses on discovering and encoding individual entities from raw data, often using slot-based architectures that decompose scenes into interpretable parts. Object-Centric Generation and Synthesis leverages these representations to create or compose novel scenes, while Object-Centric Image and Video Editing applies diffusion models to manipulate specific objects within visual content. Object-Centric Detection and Segmentation addresses the identification and localization of entities, sometimes integrating diffusion-based refinement or proposal mechanisms such as DiffusionDet[5]. Object-Centric Robotic Manipulation and Planning exploits compositional scene understanding to guide action generation in physical environments, and Specialized Object-Centric Applications encompasses domain-specific tasks ranging from molecular design to temporal reasoning. Within the representation learning branch, a particularly active line of work centers on slot attention mechanisms combined with diffusion models. SlotDiffusion[2] and Object-Centric Slot Diffusion[3] exemplify efforts to integrate slot-based decomposition with generative modeling, enabling unsupervised discovery of objects and their attributes. Guided Latent Slot[10] and Slot-Guided Adaptation[14] explore how to steer or refine slot representations using additional supervision or task-specific cues. Registers Contrastive Alignment[0] sits naturally within this cluster, emphasizing alignment strategies that leverage contrastive learning to improve the coherence and interpretability of slot-based representations. Compared to SlotDiffusion[2], which primarily targets generative fidelity, and Slot-Guided Adaptation[14], which focuses on downstream task adaptation, Registers Contrastive Alignment[0] appears to prioritize the alignment and consistency of learned object-centric features across different views or modalities, addressing a complementary challenge in making slot representations more robust and semantically grounded.

Claimed Contributions

Register-augmented slot diffusion with input-independent register slots

The authors introduce input-independent register slots that act as attention sinks to absorb residual attention, reducing interference between object slots and mitigating slot entanglement. These register slots are obtained by encoding padding tokens through the frozen SD text encoder.

3 retrieved papers
Finetuning cross-attention projections to mitigate text-conditioning bias

The authors propose a lightweight adaptation strategy that finetunes only the key, value, and output projections in cross-attention layers of the pretrained diffusion model. This approach mitigates text-conditioning bias while preserving generative quality without requiring additional architectural layers.

10 retrieved papers
Can Refute
Contrastive alignment objective for slot-image correspondence

The authors introduce a contrastive alignment loss that explicitly encourages slot-image correspondence by maximizing likelihood under aligned slots while minimizing it under mismatched slots. This objective serves as a tractable surrogate for maximizing mutual information between slots and inputs.

2 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Register-augmented slot diffusion with input-independent register slots

The authors introduce input-independent register slots that act as attention sinks to absorb residual attention, reducing interference between object slots and mitigating slot entanglement. These register slots are obtained by encoding padding tokens through the frozen SD text encoder.

Contribution

Finetuning cross-attention projections to mitigate text-conditioning bias

The authors propose a lightweight adaptation strategy that finetunes only the key, value, and output projections in cross-attention layers of the pretrained diffusion model. This approach mitigates text-conditioning bias while preserving generative quality without requiring additional architectural layers.

Contribution

Contrastive alignment objective for slot-image correspondence

The authors introduce a contrastive alignment loss that explicitly encourages slot-image correspondence by maximizing likelihood under aligned slots while minimizing it under mismatched slots. This objective serves as a tractable surrogate for maximizing mutual information between slots and inputs.