Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 7.0 Download Report PDF

3dvideo diffusiongaussian splatting

The ability to generate virtual environments is crucial for applications ranging from gaming to physical AI domains such as robotics, autonomous driving, and industrial AI. Current learning-based 3D reconstruction methods rely on the availability of captured real-world multi-view data, which is not always readily available. Recent advancements in video diffusion models have shown remarkable imagination capabilities, yet their 2D nature limits the applications to simulation where a robot needs to navigate and interact with the environment. In this paper, we propose a self-distillation framework that aims to distill the implicit 3D knowledge in the video diffusion models into an explicit 3D Gaussian Splatting (3DGS) representation, eliminating the need for multi-view training data. Specifically, we augment the typical RGB decoder with a 3DGS decoder, which is supervised by the output of the RGB decoder. In this approach, the 3DGS decoder can be purely trained with synthetic data generated by video diffusion models. At inference time, our model can synthesize 3D scenes from either a text prompt or a single image for real-time rendering. Our framework further extends to dynamic 3D scene generation from a monocular input video. Experimental results show that our framework achieves state-of-the-art performance in static and dynamic 3D scene generation. Video results: https://anonlyra.github.io/anonlyra

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Generative 3D scene reconstruction from single images or videos. The field divides into several complementary branches that reflect different input modalities and reconstruction goals. Single-Image 3D Scene Reconstruction focuses on inferring complete geometry and appearance from a single view, often leveraging learned priors to hallucinate occluded regions. Video-Based 3D Scene Reconstruction exploits temporal cues and multi-view consistency across frames to build richer representations. Generative Model-Based 3D Synthesis emphasizes learning-driven approaches that can synthesize plausible scenes from minimal input, while Novel View Synthesis and 4D Scene Generation extends these ideas to produce dynamic content and novel camera trajectories. Specialized and Application-Driven Reconstruction targets domain-specific challenges such as autonomous driving or medical imaging, and Methodological Foundations and Surveys provide theoretical underpinnings and comparative analyses. Representative works like Gen3dsr[3] and Gaussian Splatting Reconstruction[5] illustrate how different branches balance geometric fidelity with generative flexibility. Recent activity has concentrated on bridging static and dynamic reconstruction, with many studies exploring how video diffusion models can generate temporally coherent 4D scenes. Trade-offs between geometric accuracy and visual plausibility remain central: some methods prioritize photorealistic synthesis at the cost of precise depth, while others enforce stricter geometric constraints. Within this landscape, Lyra[0] sits in the 4D Scene Generation from Video Diffusion Models cluster, alongside works such as 4dnex[34], Geo4d[36], and Videoscene[37]. Compared to 4real[44] and Holotime[47], which also tackle dynamic content, Lyra[0] emphasizes leveraging diffusion priors to generate novel viewpoints and temporal evolution from video input. This positioning reflects a broader trend toward integrating generative models with explicit scene representations, balancing the need for high-quality synthesis with the demand for controllable, geometrically consistent outputs.

Claimed Contributions

Self-distillation framework for 3D scene reconstruction without multi-view data

10 retrieved papers

The authors propose a teacher-student framework where a camera-controlled video diffusion model (teacher) supervises a 3D Gaussian Splatting decoder (student) operating in latent space. This approach removes the requirement for real-world multi-view training datasets by generating synthetic supervision through the video model.

10 retrieved papers

Extension to dynamic 4D scene generation from monocular video

Can Refute

10 retrieved papers

The method is extended to handle time-varying scenes by introducing time conditioning in the 3DGS decoder, enabling generation of dynamic 3D Gaussian representations from single-view video inputs with novel-view synthesis capabilities.

10 retrieved papers

Can Refute

Latent-space 3DGS decoder for efficient multi-view processing

Can Refute

10 retrieved papers

The authors design a 3DGS decoder that operates directly in the compressed latent space of the video diffusion model rather than pixel space, enabling efficient fusion of hundreds of input views (726 frames) that would otherwise exceed GPU memory limits in existing pixel-based approaches.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[34] 4dnex: Feed-forward 4d generative modeling made easy PDF

Chen Zhaoxi, Liu Tianqi, Zhaoxi Chen, Zhuo Long, Tianqi Liu, Ren Jiawei, Long Zhuo, Tao Zeng, Jiawei Ren, Zhu He, Zeng Tao, Hong, Fangzhou, He Zhu, Pan Liang, Fangzhou Hong, Liu, Ziwei, Liang Pan, Ziwei Liu (2025)

[36] Geo4d: Leveraging video generators for geometric 4d scene reconstruction PDF

Jiang, Zeren, Zheng, Chuanxia, Zeren Jiang, Laina, Iro, Chuanxia Zheng, Larlus, Diane, Iro Laina, Vedaldi, Andrea, Diane Larlus, Andrea Vedaldi (2025)

[44] 4real: Towards photorealistic 4d scene generation via video diffusion models PDF

Yu Heng, Wang Chaoyang, Heng Yu, Zhuang, Peiye, Chaoyang Wang, Menapace, Willi, Peiye Zhuang, Siarohin, Aliaksandr, Willi Menapace, Cao Jun-li, Aliaksandr Siarohin, Jeni, Laszlo A., Junli Cao, Tulyakov, Sergey, LÃ¡szlÃ³ A. Jeni, Lee, Hsin-Ying, S. Tulyakov, Hsin-Ying Lee (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Self-distillation framework for 3D scene reconstruction without multi-view data

[69] Self-supervised reflectance-guided 3d shape reconstruction from single-view images PDF

Cannot Refute

[70] Weakly supervised monocular 3d detection with a single-view image PDF

Cannot Refute

[71] 3D Feature Distillation with Object-Centric Priors PDF

Cannot Refute

[72] Pre-train, self-train, distill: A simple recipe for supersizing 3d reconstruction PDF

Cannot Refute

[73] Model-based 3d hand reconstruction via self-supervised learning PDF

Cannot Refute

[74] RAFT-MSF: Self-supervised monocular scene flow using recurrent optimizer PDF

Cannot Refute

[75] Consistent 3d hand reconstruction in video via self-supervised learning PDF

Cannot Refute

[76] Exploiting the Potential of Self-Supervised Monocular Depth Estimation via Patch-Based Self-Distillation PDF

Cannot Refute

[77] Visual reinforcement learning with self-supervised 3d representations PDF

Cannot Refute

[78] Reversing the cycle: self-supervised deep stereo through enhanced monocular distillation PDF

Cannot Refute

Contribution

Extension to dynamic 4D scene generation from monocular video

[51] Shape of motion: 4d reconstruction from a single video PDF

Can Refute

[8] Dreamscene4d: Dynamic multi-object scene generation from monocular videos PDF

Cannot Refute

[52] MonoFusion: Sparse-View 4D Reconstruction via Monocular Fusion PDF

Cannot Refute

[53] Diffuman4d: 4d consistent human view synthesis from sparse-view videos with spatio-temporal diffusion models PDF

Cannot Refute

[54] Vivid4D: Improving 4D Reconstruction from Monocular Video by Video Inpainting PDF

Cannot Refute

[55] HSR: holistic 3d human-scene reconstruction from monocular videos PDF

Cannot Refute

[56] Diffusion priors for dynamic view synthesis from monocular videos PDF

Cannot Refute

[57] Feature4x: Bridging any monocular video to 4d agentic ai with versatile gaussian feature fields PDF

Cannot Refute

[58] DrivingRecon: Large 4D Gaussian Reconstruction Model For Autonomous Driving PDF

Cannot Refute

[59] Neural radiance flow for 4d view synthesis and video processing PDF

Cannot Refute

Contribution

Latent-space 3DGS decoder for efficient multi-view processing

[63] Splatflow: Multi-view rectified flow model for 3d gaussian splatting synthesis PDF

Can Refute

[64] latentsplat: Autoencoding variational gaussians for fast generalizable 3d reconstruction PDF

Can Refute

[20] VIST3A: Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator PDF

Cannot Refute

[60] Langsplat: 3d language gaussian splatting PDF

Cannot Refute

[61] UniSplat: Unified Spatio-Temporal Fusion via 3D Latent Scaffolds for Dynamic Driving Scene Reconstruction PDF

Cannot Refute

[62] Leveraging latent diffusion in 3D Gaussian splatting for novel view synthesis PDF

Cannot Refute

[65] L3dg: Latent 3d gaussian diffusion PDF

Cannot Refute

[66] DSplats: 3D Generation by Denoising Splats-Based Multiview Diffusion Models PDF

Cannot Refute

[67] Styleme3d: Stylization with disentangled priors by multiple encoders on 3d gaussians PDF

Cannot Refute

[68] FastAvatar: Instant 3D Gaussian Splatting for Faces from Single Unconstrained Poses PDF

Cannot Refute

Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation

Overview

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[34] 4dnex: Feed-forward 4d generative modeling made easy PDF

[36] Geo4d: Leveraging video generators for geometric 4d scene reconstruction PDF

[44] 4real: Towards photorealistic 4d scene generation via video diffusion models PDF

Contribution Analysis

Self-distillation framework for 3D scene reconstruction without multi-view data

[69] Self-supervised reflectance-guided 3d shape reconstruction from single-view images PDF

[70] Weakly supervised monocular 3d detection with a single-view image PDF

[71] 3D Feature Distillation with Object-Centric Priors PDF

[72] Pre-train, self-train, distill: A simple recipe for supersizing 3d reconstruction PDF

[73] Model-based 3d hand reconstruction via self-supervised learning PDF

[74] RAFT-MSF: Self-supervised monocular scene flow using recurrent optimizer PDF

[75] Consistent 3d hand reconstruction in video via self-supervised learning PDF

[76] Exploiting the Potential of Self-Supervised Monocular Depth Estimation via Patch-Based Self-Distillation PDF

[77] Visual reinforcement learning with self-supervised 3d representations PDF

[78] Reversing the cycle: self-supervised deep stereo through enhanced monocular distillation PDF

Extension to dynamic 4D scene generation from monocular video

[51] Shape of motion: 4d reconstruction from a single video PDF

[8] Dreamscene4d: Dynamic multi-object scene generation from monocular videos PDF

[52] MonoFusion: Sparse-View 4D Reconstruction via Monocular Fusion PDF

[53] Diffuman4d: 4d consistent human view synthesis from sparse-view videos with spatio-temporal diffusion models PDF

[54] Vivid4D: Improving 4D Reconstruction from Monocular Video by Video Inpainting PDF

[55] HSR: holistic 3d human-scene reconstruction from monocular videos PDF

[56] Diffusion priors for dynamic view synthesis from monocular videos PDF

[57] Feature4x: Bridging any monocular video to 4d agentic ai with versatile gaussian feature fields PDF

[58] DrivingRecon: Large 4D Gaussian Reconstruction Model For Autonomous Driving PDF

[59] Neural radiance flow for 4d view synthesis and video processing PDF

Latent-space 3DGS decoder for efficient multi-view processing

[63] Splatflow: Multi-view rectified flow model for 3d gaussian splatting synthesis PDF

[64] latentsplat: Autoencoding variational gaussians for fast generalizable 3d reconstruction PDF

[20] VIST3A: Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator PDF

[60] Langsplat: 3d language gaussian splatting PDF

[61] UniSplat: Unified Spatio-Temporal Fusion via 3D Latent Scaffolds for Dynamic Driving Scene Reconstruction PDF

[62] Leveraging latent diffusion in 3D Gaussian splatting for novel view synthesis PDF

[65] L3dg: Latent 3d gaussian diffusion PDF

[66] DSplats: 3D Generation by Denoising Splats-Based Multiview Diffusion Models PDF

[67] Styleme3d: Stylization with disentangled priors by multiple encoders on 3d gaussians PDF

[68] FastAvatar: Instant 3D Gaussian Splatting for Faces from Single Unconstrained Poses PDF

Table of Contents