Abstract:

We introduce SceneTransporter, an end-to-end framework for structured 3D scene generation from a single image. While existing methods generate part-level 3D objects, they often fail to organize these parts into distinct instances in open-world scenes. Through a debiased clustering probe, we reveal a critical insight: this failure stems from the lack of structural constraints within the model's internal assignment mechanism. Based on this finding, we reframe the task of structured 3D scene generation as a global correlation assignment problem. To solve this, SceneTransporter formulates and solves an entropic Optimal Transport (OT) objective within the denoising loop of the compositional DiT model. This formulation imposes two powerful structural constraints. First, the resulting transport plan gates cross-attention to enforce an exclusive, one-to-one routing of image patches to part-level 3D latents, preventing entanglement. Second, the competitive nature of the transport encourages the grouping of similar patches, a process that is further regularized by an edge-based cost, to form coherent objects and prevent fragmentation. Extensive experiments show that SceneTransporter outperforms existing methods on open-world scene generation, significantly improving instance-level coherence and geometric fidelity. Code and models will be publicly available at \url{https://scenetransporter.github.io/}

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

SceneTransporter contributes an end-to-end framework for structured 3D scene generation that decomposes scenes into distinct object instances from single images. It resides in the Instance-Level Compositional Generation leaf, which contains only three papers including the original work. This sparse population suggests the specific focus on instance-level decomposition with optimal transport constraints represents a relatively underexplored direction within the broader compositional generation landscape, where most methods address either holistic scene synthesis or part-based reconstruction of individual objects.

The taxonomy reveals that SceneTransporter sits at the intersection of compositional methods and diffusion-guided generation. Neighboring leaves include Part-Based Reconstruction, which decomposes single objects rather than multi-object scenes, and Component-Aligned Semantic Reconstruction, which uses segmentation but may lack the exclusive routing mechanisms proposed here. The broader Diffusion-Guided 3D Generation branch contains methods like Flash3D and WonderWorld that prioritize holistic synthesis over structured instance decomposition. SceneTransporter's use of optimal transport within a compositional DiT model bridges these areas by imposing structural constraints on generative processes.

Among the twenty-nine candidates examined through semantic search, none clearly refute any of the three core contributions. The Debiased Clustering Probe examined ten candidates with zero refutations, the Optimal-Transport–Guided Correlation Assignment Framework examined nine with zero refutations, and the SceneTransporter End-to-End Framework examined ten with zero refutations. This limited search scope suggests that within the top-ranked semantically similar papers, the specific combination of debiased probing, entropic optimal transport for assignment, and exclusive patch-to-latent routing appears distinctive, though exhaustive coverage of the broader literature remains uncertain.

Based on the limited search of twenty-nine candidates and the sparse taxonomy leaf containing only three papers, the work appears to occupy a relatively novel position within instance-level compositional generation. However, this assessment reflects the scope of top-K semantic matching and does not guarantee comprehensive coverage of all relevant prior work in optimal transport for 3D generation, compositional diffusion models, or structured scene decomposition methods that may exist outside the examined candidate set.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
29
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Structured 3D scene generation from single image. The field encompasses a diverse set of approaches for reconstructing or synthesizing three-dimensional scenes from monocular input, organized into several major branches. Feed-Forward Scene Reconstruction methods emphasize direct, single-pass inference to recover geometry and layout, while Compositional and Structured Scene Generation focuses on decomposing scenes into meaningful parts or instances that can be manipulated independently. Diffusion-Guided 3D Generation leverages generative models to produce plausible scene content, and Interactive and Iterative Scene Generation allows for user-driven refinement or progressive construction. Additional branches address Dynamic and Temporal Scene Reconstruction for moving elements, Implicit Representation and Neural Rendering for continuous volumetric modeling, Camera Control and View Synthesis for novel viewpoint generation, and Auxiliary Tasks and Multi-Modal Fusion that integrate semantic or depth cues. Together, these branches reflect a spectrum from purely geometric reconstruction to generative synthesis, and from holistic scene modeling to fine-grained compositional control. Within this landscape, a particularly active line of work explores instance-level compositional generation, where the goal is to identify and reconstruct individual objects or regions as separate entities that compose the overall scene. SceneTransporter[0] falls squarely into this cluster, emphasizing structured decomposition and manipulation of scene components. Nearby efforts such as SceneGen[29] and MIDI[46] similarly pursue compositional strategies, though they may differ in how they handle object boundaries or leverage diffusion priors. In contrast, methods like Flash3D[3] or WonderWorld[1] lean more heavily on diffusion-guided synthesis to generate entire scenes in a holistic manner, trading fine-grained part-level control for broader generative flexibility. The central tension across these branches involves balancing geometric accuracy and semantic interpretability with the creative freedom afforded by generative models, and SceneTransporter[0] addresses this by prioritizing explicit instance-level structure within its reconstruction pipeline.

Claimed Contributions

Debiased Clustering Probe for Latent Structure Investigation

The authors introduce a diagnostic tool using CCA-based debiased clustering to analyze compositional 3D generators. This probe reveals that existing part-level generators fail to establish explicit instance-level associations, despite containing the necessary information implicitly in their learned features.

10 retrieved papers
Optimal-Transport–Guided Correlation Assignment Framework

The authors reformulate structured 3D scene generation as a global correlation assignment problem solved via entropic Optimal Transport. This formulation introduces two structural constraints: a gating mechanism that enforces one-to-one routing between image patches and part-level tokens, and an edge-regularized cost that encourages coherent object grouping while preventing fragmentation.

9 retrieved papers
SceneTransporter End-to-End Framework

The authors present SceneTransporter, a complete system that integrates the OT-guided assignment mechanism into compositional latent diffusion models. The framework operates within the denoising loop to generate structured 3D scenes with explicit instance-level object separation from single images.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Debiased Clustering Probe for Latent Structure Investigation

The authors introduce a diagnostic tool using CCA-based debiased clustering to analyze compositional 3D generators. This probe reveals that existing part-level generators fail to establish explicit instance-level associations, despite containing the necessary information implicitly in their learned features.

Contribution

Optimal-Transport–Guided Correlation Assignment Framework

The authors reformulate structured 3D scene generation as a global correlation assignment problem solved via entropic Optimal Transport. This formulation introduces two structural constraints: a gating mechanism that enforces one-to-one routing between image patches and part-level tokens, and an edge-regularized cost that encourages coherent object grouping while preventing fragmentation.

Contribution

SceneTransporter End-to-End Framework

The authors present SceneTransporter, a complete system that integrates the OT-guided assignment mechanism into compositional latent diffusion models. The framework operates within the denoising loop to generate structured 3D scenes with explicit instance-level object separation from single images.

SceneTransporter: Optimal Transport-Guided Compositional Latent Diffusion for Single-Image Structured 3D Scene Generation | Novelty Validation