SceneTransporter: Optimal Transport-Guided Compositional Latent Diffusion for Single-Image Structured 3D Scene Generation

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

3D Scene GenerationPart-aware 3D Generation

We introduce SceneTransporter, an end-to-end framework for structured 3D scene generation from a single image. While existing methods generate part-level 3D objects, they often fail to organize these parts into distinct instances in open-world scenes. Through a debiased clustering probe, we reveal a critical insight: this failure stems from the lack of structural constraints within the model's internal assignment mechanism. Based on this finding, we reframe the task of structured 3D scene generation as a global correlation assignment problem. To solve this, SceneTransporter formulates and solves an entropic Optimal Transport (OT) objective within the denoising loop of the compositional DiT model. This formulation imposes two powerful structural constraints. First, the resulting transport plan gates cross-attention to enforce an exclusive, one-to-one routing of image patches to part-level 3D latents, preventing entanglement. Second, the competitive nature of the transport encourages the grouping of similar patches, a process that is further regularized by an edge-based cost, to form coherent objects and prevent fragmentation. Extensive experiments show that SceneTransporter outperforms existing methods on open-world scene generation, significantly improving instance-level coherence and geometric fidelity. Code and models will be publicly available at \url{https://scenetransporter.github.io/}

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

SceneTransporter contributes an end-to-end framework for structured 3D scene generation that decomposes scenes into distinct object instances from single images. It resides in the Instance-Level Compositional Generation leaf, which contains only three papers including the original work. This sparse population suggests the specific focus on instance-level decomposition with optimal transport constraints represents a relatively underexplored direction within the broader compositional generation landscape, where most methods address either holistic scene synthesis or part-based reconstruction of individual objects.

The taxonomy reveals that SceneTransporter sits at the intersection of compositional methods and diffusion-guided generation. Neighboring leaves include Part-Based Reconstruction, which decomposes single objects rather than multi-object scenes, and Component-Aligned Semantic Reconstruction, which uses segmentation but may lack the exclusive routing mechanisms proposed here. The broader Diffusion-Guided 3D Generation branch contains methods like Flash3D and WonderWorld that prioritize holistic synthesis over structured instance decomposition. SceneTransporter's use of optimal transport within a compositional DiT model bridges these areas by imposing structural constraints on generative processes.

Among the twenty-nine candidates examined through semantic search, none clearly refute any of the three core contributions. The Debiased Clustering Probe examined ten candidates with zero refutations, the Optimal-Transport–Guided Correlation Assignment Framework examined nine with zero refutations, and the SceneTransporter End-to-End Framework examined ten with zero refutations. This limited search scope suggests that within the top-ranked semantically similar papers, the specific combination of debiased probing, entropic optimal transport for assignment, and exclusive patch-to-latent routing appears distinctive, though exhaustive coverage of the broader literature remains uncertain.

Based on the limited search of twenty-nine candidates and the sparse taxonomy leaf containing only three papers, the work appears to occupy a relatively novel position within instance-level compositional generation. However, this assessment reflects the scope of top-K semantic matching and does not guarantee comprehensive coverage of all relevant prior work in optimal transport for 3D generation, compositional diffusion models, or structured scene decomposition methods that may exist outside the examined candidate set.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Structured 3D scene generation from single image. The field encompasses a diverse set of approaches for reconstructing or synthesizing three-dimensional scenes from monocular input, organized into several major branches. Feed-Forward Scene Reconstruction methods emphasize direct, single-pass inference to recover geometry and layout, while Compositional and Structured Scene Generation focuses on decomposing scenes into meaningful parts or instances that can be manipulated independently. Diffusion-Guided 3D Generation leverages generative models to produce plausible scene content, and Interactive and Iterative Scene Generation allows for user-driven refinement or progressive construction. Additional branches address Dynamic and Temporal Scene Reconstruction for moving elements, Implicit Representation and Neural Rendering for continuous volumetric modeling, Camera Control and View Synthesis for novel viewpoint generation, and Auxiliary Tasks and Multi-Modal Fusion that integrate semantic or depth cues. Together, these branches reflect a spectrum from purely geometric reconstruction to generative synthesis, and from holistic scene modeling to fine-grained compositional control. Within this landscape, a particularly active line of work explores instance-level compositional generation, where the goal is to identify and reconstruct individual objects or regions as separate entities that compose the overall scene. SceneTransporter[0] falls squarely into this cluster, emphasizing structured decomposition and manipulation of scene components. Nearby efforts such as SceneGen[29] and MIDI[46] similarly pursue compositional strategies, though they may differ in how they handle object boundaries or leverage diffusion priors. In contrast, methods like Flash3D[3] or WonderWorld[1] lean more heavily on diffusion-guided synthesis to generate entire scenes in a holistic manner, trading fine-grained part-level control for broader generative flexibility. The central tension across these branches involves balancing geometric accuracy and semantic interpretability with the creative freedom afforded by generative models, and SceneTransporter[0] addresses this by prioritizing explicit instance-level structure within its reconstruction pipeline.

Claimed Contributions

Debiased Clustering Probe for Latent Structure Investigation

10 retrieved papers

The authors introduce a diagnostic tool using CCA-based debiased clustering to analyze compositional 3D generators. This probe reveals that existing part-level generators fail to establish explicit instance-level associations, despite containing the necessary information implicitly in their learned features.

10 retrieved papers

Optimal-Transport–Guided Correlation Assignment Framework

9 retrieved papers

The authors reformulate structured 3D scene generation as a global correlation assignment problem solved via entropic Optimal Transport. This formulation introduces two structural constraints: a gating mechanism that enforces one-to-one routing between image patches and part-level tokens, and an edge-regularized cost that encourages coherent object grouping while preventing fragmentation.

9 retrieved papers

SceneTransporter End-to-End Framework

10 retrieved papers

The authors present SceneTransporter, a complete system that integrates the OT-guided assignment mechanism into compositional latent diffusion models. The framework operates within the denoising loop to generate structured 3D scenes with explicit instance-level object separation from single images.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[29] SceneGen: Single-Image 3D Scene Generation in One Feedforward Pass PDF

Wu, Haoning, Yanxu Meng, Zhang Ya, Haoning Wu, Xie, Weidi, Ya Zhang, Weidi Xie (2025) • arXiv.org

[46] MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation PDF

Zehuan Huang, Yuan-Chen Guo, Xingqiao An, Yuanchen Guo, Yunhan Yang, Yangguang Li, Yu-nuo Yang, Zi-Xin Zou, Ding Liang, Xihui Liu, Yan-Pei Cao, Lu Sheng (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Debiased Clustering Probe for Latent Structure Investigation

[64] Deep Dynamic Probabilistic Canonical Correlation Analysis PDF

Cannot Refute

[65] Variational interpretable deep canonical correlation analysis PDF

Cannot Refute

[66] Deep Probabilistic Canonical Correlation Analysis PDF

Cannot Refute

[67] Latent State Space Modeling of High-Dimensional Time Series With a Canonical Correlation Objective PDF

Cannot Refute

[68] A Bayesian nonparametrics view into deep representations PDF

Cannot Refute

[69] Variational inference for deep probabilistic canonical correlation analysis PDF

Cannot Refute

[70] Subspace Clustering of Subspaces: Unifying Canonical Correlation Analysis and Subspace Clustering PDF

Cannot Refute

[71] Biosignal Generation and Latent Variable Analysis with Recurrent Generative Adversarial Networks PDF

Cannot Refute

[72] Finite-Sample Analysis of Deep CCA-Based Unsupervised Post-Nonlinear Multimodal Learning PDF

Cannot Refute

[73] Multi-Way, Multi-View Learning PDF

Cannot Refute

Contribution

Optimal-Transport–Guided Correlation Assignment Framework

[55] HOTS3D: Hyper-Spherical Optimal Transport for Semantic Alignment of Text-to-3D Generation PDF

Cannot Refute

[56] Reparo: Compositional 3d assets generation with differentiable 3d layout alignment PDF

Cannot Refute

[57] Gromov-Wasserstein and optimal transport: from assignment problems to probabilistic numeric PDF

Cannot Refute

[58] Flot: Scene flow on point clouds guided by optimal transport PDF

Cannot Refute

[59] Simultaneous multiple-prompt guided generation using differentiable optimal transport PDF

Cannot Refute

[60] DEMO: Disentangled Motion Latent Flow Matching for Fine-Grained Controllable Talking Portrait Synthesis PDF

Cannot Refute

[61] Hyper-Spherical Optimal Transport for Semantic Alignment in Text-to-3D End-to-End Generation PDF

Cannot Refute

[62] AFESCMNet: lightweight feature matching with adaptive enhancement and confidence modulation PDF

Cannot Refute

[63] Informative GANs via Structured Regularization of Optimal Transport PDF

Cannot Refute

Contribution

SceneTransporter End-to-End Framework

[1] WonderWorld: Interactive 3D Scene Generation from a Single Image PDF

Cannot Refute

[3] Flash3D: Feed-Forward Generalisable 3D Scene Reconstruction from a Single Image PDF

Cannot Refute

[8] A recipe for generating 3d worlds from a single image PDF

Cannot Refute

[11] Learning to recover 3d scene shape from a single image PDF

Cannot Refute

[18] Deformable 3D Gaussians for High-Fidelity Monocular Dynamic Scene Reconstruction PDF

Cannot Refute

[32] A point set generation network for 3d object reconstruction from a single image PDF

Cannot Refute

[51] An end-to-end shape modeling framework for vectorized building outline generation from aerial images PDF

Cannot Refute

[52] Text2nerf: Text-driven 3d scene generation with neural radiance fields PDF

Cannot Refute

[53] PanoRecon: Real-time panoptic 3D reconstruction from monocular video PDF

Cannot Refute

[54] Synsin: End-to-end view synthesis from a single image PDF

Cannot Refute

SceneTransporter: Optimal Transport-Guided Compositional Latent Diffusion for Single-Image Structured 3D Scene Generation

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[29] SceneGen: Single-Image 3D Scene Generation in One Feedforward Pass PDF

[46] MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation PDF

Contribution Analysis

Debiased Clustering Probe for Latent Structure Investigation

[64] Deep Dynamic Probabilistic Canonical Correlation Analysis PDF

[65] Variational interpretable deep canonical correlation analysis PDF

[66] Deep Probabilistic Canonical Correlation Analysis PDF

[67] Latent State Space Modeling of High-Dimensional Time Series With a Canonical Correlation Objective PDF

[68] A Bayesian nonparametrics view into deep representations PDF

[69] Variational inference for deep probabilistic canonical correlation analysis PDF

[70] Subspace Clustering of Subspaces: Unifying Canonical Correlation Analysis and Subspace Clustering PDF

[71] Biosignal Generation and Latent Variable Analysis with Recurrent Generative Adversarial Networks PDF

[72] Finite-Sample Analysis of Deep CCA-Based Unsupervised Post-Nonlinear Multimodal Learning PDF

[73] Multi-Way, Multi-View Learning PDF

Optimal-Transport–Guided Correlation Assignment Framework

[55] HOTS3D: Hyper-Spherical Optimal Transport for Semantic Alignment of Text-to-3D Generation PDF

[56] Reparo: Compositional 3d assets generation with differentiable 3d layout alignment PDF

[57] Gromov-Wasserstein and optimal transport: from assignment problems to probabilistic numeric PDF

[58] Flot: Scene flow on point clouds guided by optimal transport PDF

[59] Simultaneous multiple-prompt guided generation using differentiable optimal transport PDF

[60] DEMO: Disentangled Motion Latent Flow Matching for Fine-Grained Controllable Talking Portrait Synthesis PDF

[61] Hyper-Spherical Optimal Transport for Semantic Alignment in Text-to-3D End-to-End Generation PDF

[62] AFESCMNet: lightweight feature matching with adaptive enhancement and confidence modulation PDF

[63] Informative GANs via Structured Regularization of Optimal Transport PDF

SceneTransporter End-to-End Framework

[1] WonderWorld: Interactive 3D Scene Generation from a Single Image PDF

[3] Flash3D: Feed-Forward Generalisable 3D Scene Reconstruction from a Single Image PDF

[8] A recipe for generating 3d worlds from a single image PDF

[11] Learning to recover 3d scene shape from a single image PDF

[18] Deformable 3D Gaussians for High-Fidelity Monocular Dynamic Scene Reconstruction PDF

[32] A point set generation network for 3d object reconstruction from a single image PDF

[51] An end-to-end shape modeling framework for vectorized building outline generation from aerial images PDF

[52] Text2nerf: Text-driven 3d scene generation with neural radiance fields PDF

[53] PanoRecon: Real-time panoptic 3D reconstruction from monocular video PDF

[54] Synsin: End-to-end view synthesis from a single image PDF

Table of Contents