Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation

ICLR 2026 Conference SubmissionAnonymous Authors
3dvideo diffusiongaussian splatting
Abstract:

The ability to generate virtual environments is crucial for applications ranging from gaming to physical AI domains such as robotics, autonomous driving, and industrial AI. Current learning-based 3D reconstruction methods rely on the availability of captured real-world multi-view data, which is not always readily available. Recent advancements in video diffusion models have shown remarkable imagination capabilities, yet their 2D nature limits the applications to simulation where a robot needs to navigate and interact with the environment. In this paper, we propose a self-distillation framework that aims to distill the implicit 3D knowledge in the video diffusion models into an explicit 3D Gaussian Splatting (3DGS) representation, eliminating the need for multi-view training data. Specifically, we augment the typical RGB decoder with a 3DGS decoder, which is supervised by the output of the RGB decoder. In this approach, the 3DGS decoder can be purely trained with synthetic data generated by video diffusion models. At inference time, our model can synthesize 3D scenes from either a text prompt or a single image for real-time rendering. Our framework further extends to dynamic 3D scene generation from a monocular input video. Experimental results show that our framework achieves state-of-the-art performance in static and dynamic 3D scene generation. Video results: https://anonlyra.github.io/anonlyra

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: Generative 3D scene reconstruction from single images or videos. The field divides into several complementary branches that reflect different input modalities and reconstruction goals. Single-Image 3D Scene Reconstruction focuses on inferring complete geometry and appearance from a single view, often leveraging learned priors to hallucinate occluded regions. Video-Based 3D Scene Reconstruction exploits temporal cues and multi-view consistency across frames to build richer representations. Generative Model-Based 3D Synthesis emphasizes learning-driven approaches that can synthesize plausible scenes from minimal input, while Novel View Synthesis and 4D Scene Generation extends these ideas to produce dynamic content and novel camera trajectories. Specialized and Application-Driven Reconstruction targets domain-specific challenges such as autonomous driving or medical imaging, and Methodological Foundations and Surveys provide theoretical underpinnings and comparative analyses. Representative works like Gen3dsr[3] and Gaussian Splatting Reconstruction[5] illustrate how different branches balance geometric fidelity with generative flexibility. Recent activity has concentrated on bridging static and dynamic reconstruction, with many studies exploring how video diffusion models can generate temporally coherent 4D scenes. Trade-offs between geometric accuracy and visual plausibility remain central: some methods prioritize photorealistic synthesis at the cost of precise depth, while others enforce stricter geometric constraints. Within this landscape, Lyra[0] sits in the 4D Scene Generation from Video Diffusion Models cluster, alongside works such as 4dnex[34], Geo4d[36], and Videoscene[37]. Compared to 4real[44] and Holotime[47], which also tackle dynamic content, Lyra[0] emphasizes leveraging diffusion priors to generate novel viewpoints and temporal evolution from video input. This positioning reflects a broader trend toward integrating generative models with explicit scene representations, balancing the need for high-quality synthesis with the demand for controllable, geometrically consistent outputs.

Claimed Contributions

Self-distillation framework for 3D scene reconstruction without multi-view data

The authors propose a teacher-student framework where a camera-controlled video diffusion model (teacher) supervises a 3D Gaussian Splatting decoder (student) operating in latent space. This approach removes the requirement for real-world multi-view training datasets by generating synthetic supervision through the video model.

10 retrieved papers
Extension to dynamic 4D scene generation from monocular video

The method is extended to handle time-varying scenes by introducing time conditioning in the 3DGS decoder, enabling generation of dynamic 3D Gaussian representations from single-view video inputs with novel-view synthesis capabilities.

10 retrieved papers
Can Refute
Latent-space 3DGS decoder for efficient multi-view processing

The authors design a 3DGS decoder that operates directly in the compressed latent space of the video diffusion model rather than pixel space, enabling efficient fusion of hundreds of input views (726 frames) that would otherwise exceed GPU memory limits in existing pixel-based approaches.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Self-distillation framework for 3D scene reconstruction without multi-view data

The authors propose a teacher-student framework where a camera-controlled video diffusion model (teacher) supervises a 3D Gaussian Splatting decoder (student) operating in latent space. This approach removes the requirement for real-world multi-view training datasets by generating synthetic supervision through the video model.

Contribution

Extension to dynamic 4D scene generation from monocular video

The method is extended to handle time-varying scenes by introducing time conditioning in the 3DGS decoder, enabling generation of dynamic 3D Gaussian representations from single-view video inputs with novel-view synthesis capabilities.

Contribution

Latent-space 3DGS decoder for efficient multi-view processing

The authors design a 3DGS decoder that operates directly in the compressed latent space of the video diffusion model rather than pixel space, enabling efficient fusion of hundreds of input views (726 frames) that would otherwise exceed GPU memory limits in existing pixel-based approaches.

Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation | Novelty Validation