Skyfall-GS: Synthesizing Immersive 3D Urban Scenes from Satellite Imagery

ICLR 2026 Conference SubmissionAnonymous Authors
City generationView generation3DGSSatellite imageryDiffusion models
Abstract:

Synthesizing large-scale, explorable, and geometrically accurate 3D urban scenes is a challenging yet valuable task in providing immersive and embodied applications. The challenges lie in the lack of large-scale and high-quality real-world 3D scans for training generalizable generative models. In this paper, we take an alternative route to create large-scale 3D scenes by synergizing the readily available satellite imagery that supplies realistic coarse geometry and the open-domain diffusion model for creating high-quality close-up appearances. We propose Skyfall-GS, a novel hybrid framework that synthesizes immersive city-block scale 3D urban scenes by combining satellite reconstruction with diffusion refinement, eliminating the need for costly 3D annotations, also featuring real-time, immersive 3D exploration. We tailor a curriculum-driven iterative refinement strategy to progressively enhance geometric completeness and photorealistic textures. Extensive experiments demonstrate that Skyfall-GS provides improved cross-view consistent geometry and more realistic textures compared to state-of-the-art approaches.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Skyfall-GS, a hybrid framework that synthesizes city-block scale 3D urban scenes by combining satellite reconstruction with diffusion-based refinement. It resides in the 'Diffusion and GAN-Based 3D Urban Scene Generation' leaf, which contains eight papers including the original work. This leaf represents a moderately active research direction within the broader Neural Rendering and Generative 3D Synthesis branch, focusing on generative models that produce photorealistic or controllable 3D content from satellite imagery rather than classical photogrammetric pipelines.

The taxonomy tree reveals that the paper's leaf sits alongside two other neural rendering approaches: NeRF-based methods and cross-view synthesis techniques that generate street-level views from overhead imagery. Neighboring branches include geometric reconstruction methods (stereo and multi-view) and learning-based monocular depth estimation. Skyfall-GS bridges generative synthesis with geometric reconstruction by leveraging satellite-derived coarse geometry, distinguishing it from purely generative approaches like Sat2Scene or procedural methods like MagicCity that prioritize controllability over geometric fidelity.

Among thirty candidates examined, the core Skyfall-GS framework contribution shows one refutable candidate out of ten examined, suggesting some overlap with prior generative synthesis work. The open-domain diffusion refinement contribution and curriculum-learning strategy each examined ten candidates with zero refutations, indicating these methodological choices appear less directly anticipated in the limited search scope. The statistics reflect a focused but not exhaustive literature review, leaving open the possibility of additional relevant work beyond the top-thirty semantic matches.

Based on the limited search scope, the work appears to occupy a moderately explored niche within generative 3D urban synthesis, with the framework-level contribution showing some prior overlap while the refinement and curriculum strategies exhibit less direct precedent among examined candidates. The analysis covers top-thirty semantic matches and does not claim exhaustive coverage of the broader generative modeling or satellite reconstruction literature.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: Synthesizing large-scale 3D urban scenes from satellite imagery. The field is organized around five main branches that reflect distinct methodological emphases and problem settings. Geometric 3D Reconstruction from Satellite Stereo and Multi-View Imagery focuses on classical photogrammetric pipelines that exploit multi-date or multi-angle observations to recover depth and surface models, often relying on stereo matching and structure-from-motion techniques. Learning-Based Monocular and Semantic 3D Reconstruction leverages deep networks to infer height, building footprints, or semantic labels from single overhead images, addressing scenarios where stereo pairs are unavailable. Neural Rendering and Generative 3D Synthesis from Satellite Imagery applies neural radiance fields, diffusion models, and GANs to produce photorealistic or controllable 3D content directly from satellite data. 3D City Modeling Pipelines and Applications encompasses end-to-end workflows that integrate reconstruction, vectorization, and level-of-detail generation for urban planning and simulation. Supporting Technologies and Datasets for Satellite-Based 3D Reconstruction provides the foundational benchmarks, open-source tools, and sensor fusion strategies that enable progress across all other branches. Recent work has seen a surge in generative and neural rendering approaches that move beyond traditional geometry-first pipelines. Within the Neural Rendering and Generative 3D Synthesis branch, diffusion and GAN-based methods such as Sat2Scene[11], Sat2city[10], and Sat2RealCity[22] explore how to hallucinate plausible street-level views or volumetric representations from overhead imagery, trading geometric precision for visual realism and controllability. Skyfall-GS[0] sits squarely in this cluster, emphasizing generative synthesis of urban scenes via Gaussian splatting conditioned on satellite inputs. Compared to Sat2Scene[11], which focuses on diffusion-driven view synthesis, Skyfall-GS[0] adopts an explicit 3D representation that may offer faster rendering and more direct geometric control. Meanwhile, works like MagicCity[36] and Citycraft[21] push generative modeling toward interactive urban design and procedural content creation, highlighting an ongoing tension between fidelity to real-world geometry and the flexibility needed for creative or planning applications.

Claimed Contributions

Skyfall-GS framework for synthesizing immersive 3D urban scenes from satellite imagery

The authors propose Skyfall-GS, a novel hybrid framework that combines satellite-based 3D Gaussian Splatting reconstruction with diffusion model refinement to generate city-block scale 3D urban scenes. This approach eliminates the need for costly 3D annotations or street-level training data while enabling real-time interactive rendering.

10 retrieved papers
Can Refute
Open-domain refinement approach using pre-trained text-to-image diffusion models

The method leverages pre-trained text-to-image diffusion models to hallucinate realistic appearances and complete occluded regions (such as building facades) without requiring training on domain-specific 3D datasets, providing better generalization compared to existing city generation methods.

10 retrieved papers
Curriculum-learning-based iterative refinement strategy

The authors introduce a curriculum-driven iterative dataset update technique that progressively lowers camera viewpoints from sky to ground across optimization episodes. This strategy gradually reveals and refines previously occluded regions, improving geometric completeness and photorealistic textures.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Skyfall-GS framework for synthesizing immersive 3D urban scenes from satellite imagery

The authors propose Skyfall-GS, a novel hybrid framework that combines satellite-based 3D Gaussian Splatting reconstruction with diffusion model refinement to generate city-block scale 3D urban scenes. This approach eliminates the need for costly 3D annotations or street-level training data while enabling real-time interactive rendering.

Contribution

Open-domain refinement approach using pre-trained text-to-image diffusion models

The method leverages pre-trained text-to-image diffusion models to hallucinate realistic appearances and complete occluded regions (such as building facades) without requiring training on domain-specific 3D datasets, providing better generalization compared to existing city generation methods.

Contribution

Curriculum-learning-based iterative refinement strategy

The authors introduce a curriculum-driven iterative dataset update technique that progressively lowers camera viewpoints from sky to ground across optimization episodes. This strategy gradually reveals and refines previously occluded regions, improving geometric completeness and photorealistic textures.