Improved Object-Centric Diffusion Learning with Registers and Contrastive Alignment
Overview
Overall Novelty Assessment
The paper proposes CODA, which extends slot attention with pretrained diffusion models by introducing register slots and a contrastive alignment loss to improve object-centric learning. It sits within the 'Slot Attention with Diffusion Models' leaf, which contains eight papers including SlotDiffusion and Object-Centric Slot Diffusion. This is a moderately populated research direction within the broader Object-Centric Representation Learning branch, indicating active but not overcrowded exploration of slot-based decomposition methods combined with diffusion architectures.
The taxonomy reveals that neighboring leaves explore discrete tokenization (Discrete and Grouped Representations), multi-scale fusion approaches, and causal disentanglement, each containing one to two papers. These directions represent alternative strategies for structured scene understanding. CODA's focus on contrastive alignment and register mechanisms distinguishes it from these neighbors, which emphasize quantization schemes or interpretability through causal modeling. The broader Object-Centric Generation and Synthesis branch (six papers across three leaves) addresses compositional generation rather than representation quality, highlighting CODA's emphasis on learning robust slots before generation.
Among fifteen candidates examined, the contrastive alignment contribution shows no clear refutation (two candidates examined), while the finetuning cross-attention projections contribution appears more overlapping with prior work (ten candidates examined, two refutable). The register-augmented slot mechanism also shows no refutation among three candidates. These statistics suggest that within the limited search scope, the contrastive alignment and register slot ideas appear less directly anticipated by existing work, whereas adapting cross-attention layers to reduce text-conditioning bias may have more substantial precedent in the examined literature.
Based on the top-fifteen semantic matches and the moderately populated taxonomy leaf, CODA appears to introduce meaningful refinements to slot attention mechanisms, particularly through register slots and contrastive objectives. However, the limited search scope means this assessment covers a narrow slice of potentially relevant prior work. The analysis does not capture exhaustive coverage of alignment strategies or attention mechanism modifications across the broader diffusion and representation learning literature.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors introduce input-independent register slots that act as attention sinks to absorb residual attention, reducing interference between object slots and mitigating slot entanglement. These register slots are obtained by encoding padding tokens through the frozen SD text encoder.
The authors propose a lightweight adaptation strategy that finetunes only the key, value, and output projections in cross-attention layers of the pretrained diffusion model. This approach mitigates text-conditioning bias while preserving generative quality without requiring additional architectural layers.
The authors introduce a contrastive alignment loss that explicitly encourages slot-image correspondence by maximizing likelihood under aligned slots while minimizing it under mismatched slots. This objective serves as a tractable surrogate for maximizing mutual information between slots and inputs.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[2] SlotDiffusion: Object-Centric Generative Modeling with Diffusion Models PDF
[3] Object-Centric Slot Diffusion PDF
[10] Guided Latent Slot Diffusion for Object-Centric Learning PDF
[14] Slot-Guided Adaptation of Pre-trained Diffusion Models for Object-Centric Learning and Compositional Generation PDF
[18] GLASS: Guided Latent Slot Diffusion for Object-Centric Learning PDF
[19] Learning Object-Centric Representations Based on Slots in Real World Scenarios PDF
[37] Advances in object-centric learning methods towards real world applications PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Register-augmented slot diffusion with input-independent register slots
The authors introduce input-independent register slots that act as attention sinks to absorb residual attention, reducing interference between object slots and mitigating slot entanglement. These register slots are obtained by encoding padding tokens through the frozen SD text encoder.
[51] Action-slot: Visual action-centric representations for multi-label atomic activity recognition in traffic scenes PDF
[52] CacheClip: Accelerating RAG with Effective KV Cache Reuse PDF
[53] Plug-and-Play Global Memory via Test-Time Registers PDF
Finetuning cross-attention projections to mitigate text-conditioning bias
The authors propose a lightweight adaptation strategy that finetunes only the key, value, and output projections in cross-attention layers of the pretrained diffusion model. This approach mitigates text-conditioning bias while preserving generative quality without requiring additional architectural layers.
[56] MACE: Mass Concept Erasure in Diffusion Models PDF
[62] Contextualized Diffusion Models for Text-Guided Image and Video Generation PDF
[54] Editing implicit assumptions in text-to-image diffusion models PDF
[55] Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models PDF
[57] Aid: Attention interpolation of text-to-image diffusion PDF
[58] Enhancing text-to-image diffusion transformer via split-text conditioning PDF
[59] Exploring the role of large language models in prompt encoding for diffusion models PDF
[60] Energy-Based Cross Attention for Bayesian Context Update in Text-to-Image Diffusion Models PDF
[61] Fair Generation without Unfair Distortions: Debiasing Text-to-Image Generation with Entanglement-Free Attention PDF
[63] HiGarment: Cross-modal Harmony Based Diffusion Model for Flat Sketch to Realistic Garment Image PDF
Contrastive alignment objective for slot-image correspondence
The authors introduce a contrastive alignment loss that explicitly encourages slot-image correspondence by maximizing likelihood under aligned slots while minimizing it under mismatched slots. This objective serves as a tractable surrogate for maximizing mutual information between slots and inputs.