SceneTransporter: Optimal Transport-Guided Compositional Latent Diffusion for Single-Image Structured 3D Scene Generation
Overview
Overall Novelty Assessment
SceneTransporter contributes an end-to-end framework for structured 3D scene generation that decomposes scenes into distinct object instances from single images. It resides in the Instance-Level Compositional Generation leaf, which contains only three papers including the original work. This sparse population suggests the specific focus on instance-level decomposition with optimal transport constraints represents a relatively underexplored direction within the broader compositional generation landscape, where most methods address either holistic scene synthesis or part-based reconstruction of individual objects.
The taxonomy reveals that SceneTransporter sits at the intersection of compositional methods and diffusion-guided generation. Neighboring leaves include Part-Based Reconstruction, which decomposes single objects rather than multi-object scenes, and Component-Aligned Semantic Reconstruction, which uses segmentation but may lack the exclusive routing mechanisms proposed here. The broader Diffusion-Guided 3D Generation branch contains methods like Flash3D and WonderWorld that prioritize holistic synthesis over structured instance decomposition. SceneTransporter's use of optimal transport within a compositional DiT model bridges these areas by imposing structural constraints on generative processes.
Among the twenty-nine candidates examined through semantic search, none clearly refute any of the three core contributions. The Debiased Clustering Probe examined ten candidates with zero refutations, the Optimal-Transport–Guided Correlation Assignment Framework examined nine with zero refutations, and the SceneTransporter End-to-End Framework examined ten with zero refutations. This limited search scope suggests that within the top-ranked semantically similar papers, the specific combination of debiased probing, entropic optimal transport for assignment, and exclusive patch-to-latent routing appears distinctive, though exhaustive coverage of the broader literature remains uncertain.
Based on the limited search of twenty-nine candidates and the sparse taxonomy leaf containing only three papers, the work appears to occupy a relatively novel position within instance-level compositional generation. However, this assessment reflects the scope of top-K semantic matching and does not guarantee comprehensive coverage of all relevant prior work in optimal transport for 3D generation, compositional diffusion models, or structured scene decomposition methods that may exist outside the examined candidate set.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors introduce a diagnostic tool using CCA-based debiased clustering to analyze compositional 3D generators. This probe reveals that existing part-level generators fail to establish explicit instance-level associations, despite containing the necessary information implicitly in their learned features.
The authors reformulate structured 3D scene generation as a global correlation assignment problem solved via entropic Optimal Transport. This formulation introduces two structural constraints: a gating mechanism that enforces one-to-one routing between image patches and part-level tokens, and an edge-regularized cost that encourages coherent object grouping while preventing fragmentation.
The authors present SceneTransporter, a complete system that integrates the OT-guided assignment mechanism into compositional latent diffusion models. The framework operates within the denoising loop to generate structured 3D scenes with explicit instance-level object separation from single images.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[29] SceneGen: Single-Image 3D Scene Generation in One Feedforward Pass PDF
[46] MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Debiased Clustering Probe for Latent Structure Investigation
The authors introduce a diagnostic tool using CCA-based debiased clustering to analyze compositional 3D generators. This probe reveals that existing part-level generators fail to establish explicit instance-level associations, despite containing the necessary information implicitly in their learned features.
[64] Deep Dynamic Probabilistic Canonical Correlation Analysis PDF
[65] Variational interpretable deep canonical correlation analysis PDF
[66] Deep Probabilistic Canonical Correlation Analysis PDF
[67] Latent State Space Modeling of High-Dimensional Time Series With a Canonical Correlation Objective PDF
[68] A Bayesian nonparametrics view into deep representations PDF
[69] Variational inference for deep probabilistic canonical correlation analysis PDF
[70] Subspace Clustering of Subspaces: Unifying Canonical Correlation Analysis and Subspace Clustering PDF
[71] Biosignal Generation and Latent Variable Analysis with Recurrent Generative Adversarial Networks PDF
[72] Finite-Sample Analysis of Deep CCA-Based Unsupervised Post-Nonlinear Multimodal Learning PDF
[73] Multi-Way, Multi-View Learning PDF
Optimal-Transport–Guided Correlation Assignment Framework
The authors reformulate structured 3D scene generation as a global correlation assignment problem solved via entropic Optimal Transport. This formulation introduces two structural constraints: a gating mechanism that enforces one-to-one routing between image patches and part-level tokens, and an edge-regularized cost that encourages coherent object grouping while preventing fragmentation.
[55] HOTS3D: Hyper-Spherical Optimal Transport for Semantic Alignment of Text-to-3D Generation PDF
[56] Reparo: Compositional 3d assets generation with differentiable 3d layout alignment PDF
[57] Gromov-Wasserstein and optimal transport: from assignment problems to probabilistic numeric PDF
[58] Flot: Scene flow on point clouds guided by optimal transport PDF
[59] Simultaneous multiple-prompt guided generation using differentiable optimal transport PDF
[60] DEMO: Disentangled Motion Latent Flow Matching for Fine-Grained Controllable Talking Portrait Synthesis PDF
[61] Hyper-Spherical Optimal Transport for Semantic Alignment in Text-to-3D End-to-End Generation PDF
[62] AFESCMNet: lightweight feature matching with adaptive enhancement and confidence modulation PDF
[63] Informative GANs via Structured Regularization of Optimal Transport PDF
SceneTransporter End-to-End Framework
The authors present SceneTransporter, a complete system that integrates the OT-guided assignment mechanism into compositional latent diffusion models. The framework operates within the denoising loop to generate structured 3D scenes with explicit instance-level object separation from single images.