Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation
Overview
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors propose a teacher-student framework where a camera-controlled video diffusion model (teacher) supervises a 3D Gaussian Splatting decoder (student) operating in latent space. This approach removes the requirement for real-world multi-view training datasets by generating synthetic supervision through the video model.
The method is extended to handle time-varying scenes by introducing time conditioning in the 3DGS decoder, enabling generation of dynamic 3D Gaussian representations from single-view video inputs with novel-view synthesis capabilities.
The authors design a 3DGS decoder that operates directly in the compressed latent space of the video diffusion model rather than pixel space, enabling efficient fusion of hundreds of input views (726 frames) that would otherwise exceed GPU memory limits in existing pixel-based approaches.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[34] 4dnex: Feed-forward 4d generative modeling made easy PDF
[36] Geo4d: Leveraging video generators for geometric 4d scene reconstruction PDF
[44] 4real: Towards photorealistic 4d scene generation via video diffusion models PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Self-distillation framework for 3D scene reconstruction without multi-view data
The authors propose a teacher-student framework where a camera-controlled video diffusion model (teacher) supervises a 3D Gaussian Splatting decoder (student) operating in latent space. This approach removes the requirement for real-world multi-view training datasets by generating synthetic supervision through the video model.
[69] Self-supervised reflectance-guided 3d shape reconstruction from single-view images PDF
[70] Weakly supervised monocular 3d detection with a single-view image PDF
[71] 3D Feature Distillation with Object-Centric Priors PDF
[72] Pre-train, self-train, distill: A simple recipe for supersizing 3d reconstruction PDF
[73] Model-based 3d hand reconstruction via self-supervised learning PDF
[74] RAFT-MSF: Self-supervised monocular scene flow using recurrent optimizer PDF
[75] Consistent 3d hand reconstruction in video via self-supervised learning PDF
[76] Exploiting the Potential of Self-Supervised Monocular Depth Estimation via Patch-Based Self-Distillation PDF
[77] Visual reinforcement learning with self-supervised 3d representations PDF
[78] Reversing the cycle: self-supervised deep stereo through enhanced monocular distillation PDF
Extension to dynamic 4D scene generation from monocular video
The method is extended to handle time-varying scenes by introducing time conditioning in the 3DGS decoder, enabling generation of dynamic 3D Gaussian representations from single-view video inputs with novel-view synthesis capabilities.
[51] Shape of motion: 4d reconstruction from a single video PDF
[8] Dreamscene4d: Dynamic multi-object scene generation from monocular videos PDF
[52] MonoFusion: Sparse-View 4D Reconstruction via Monocular Fusion PDF
[53] Diffuman4d: 4d consistent human view synthesis from sparse-view videos with spatio-temporal diffusion models PDF
[54] Vivid4D: Improving 4D Reconstruction from Monocular Video by Video Inpainting PDF
[55] HSR: holistic 3d human-scene reconstruction from monocular videos PDF
[56] Diffusion priors for dynamic view synthesis from monocular videos PDF
[57] Feature4x: Bridging any monocular video to 4d agentic ai with versatile gaussian feature fields PDF
[58] DrivingRecon: Large 4D Gaussian Reconstruction Model For Autonomous Driving PDF
[59] Neural radiance flow for 4d view synthesis and video processing PDF
Latent-space 3DGS decoder for efficient multi-view processing
The authors design a 3DGS decoder that operates directly in the compressed latent space of the video diffusion model rather than pixel space, enabling efficient fusion of hundreds of input views (726 frames) that would otherwise exceed GPU memory limits in existing pixel-based approaches.