Scalable and Generalizable Autonomous Driving Scene Synthesis

ICLR 2026 Conference SubmissionAnonymous Authors
autonomous drivingmulti-view synthesisBEV representation
Abstract:

Generative modeling has shown remarkable success in vision and language, inspiring research on synthesizing driving scenes. Existing multi-view synthesis approaches typically operate in image latent spaces with cross-attention to enforce spatial consistency, but they are tightly bound to camera configurations, which limits model generalization. We propose BEV-VAE, a variational autoencoder that learns a unified Bird’s-Eye-View (BEV) representation from multi-view images, enabling encoding from arbitrary camera layouts and decoding to any desired viewpoint. Through multi-view image reconstruction and novel view synthesis, we show that BEV-VAE effectively fuses multi-view information and accurately models spatial structure. This capability allows it to generalize across camera configurations and facilitates scalable training on diverse datasets. Within the latent space of BEV-VAE, a Diffusion Transformer (DiT) generates BEV representations conditioned on 3D object layouts, enabling multi-view image synthesis with enhanced spatial consistency on nuScenes and achieving the first complete seven-view synthesis on AV2. Compared with training generative models in image latent spaces, BEV-VAE achieves superior computational efficiency. Finally, synthesized imagery significantly improves the perception performance of BEVFormer, highlighting the utility of generalizable scene synthesis for autonomous driving.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
25
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: multi-view image synthesis for autonomous driving scenes. The field has evolved into several distinct branches that address different facets of generating realistic driving imagery. Generative World Models and Video Forecasting emphasize temporal consistency and future prediction, often leveraging large-scale diffusion or autoregressive architectures to simulate dynamic traffic scenarios. Novel View Synthesis via 3D Reconstruction relies on neural radiance fields and 3D Gaussian splatting to render photorealistic views from sparse camera inputs, while Controllable and Editable Scene Generation focuses on user-driven manipulation of scene layouts, object placements, and environmental conditions through bird's-eye-view (BEV) or semantic maps. Meanwhile, Perception and Representation Learning explores how synthesized data can improve downstream tasks like detection and segmentation, and Data Augmentation and Simulation targets the creation of diverse training samples to enhance model robustness. Benchmarking and Evaluation Datasets provide standardized testbeds, and Specialized Applications and Modalities extend synthesis to LiDAR, radar, or safety-critical edge cases. Within Controllable and Editable Scene Generation, a particularly active line of work centers on BEV-conditioned multi-view generation, where layout maps guide the synthesis of consistent surround-view imagery. Scalable and Generalizable Autonomous[0] sits squarely in this cluster, emphasizing scalability and generalization across diverse driving contexts. Nearby efforts such as Street-view image generation from[5] and BEV-VAE[34] similarly exploit BEV representations but differ in their architectural choices and the granularity of control they offer over scene elements. A key trade-off across these methods is balancing fine-grained editability with computational efficiency and cross-dataset transferability. While some approaches prioritize high-fidelity rendering for specific sensor configurations, others aim for broader applicability to varied camera setups and environmental conditions, reflecting ongoing questions about how best to unify controllability, realism, and practical deployment in autonomous driving pipelines.

Claimed Contributions

BEV-VAE: a unified BEV representation for multi-view driving scenes

The authors introduce BEV-VAE, a variational autoencoder architecture that learns a unified Bird's-Eye-View representation from multi-view images. This representation enables encoding from arbitrary camera configurations and decoding to any desired viewpoint, overcoming the camera-configuration dependency of prior methods.

10 retrieved papers
Can Refute
Scalable training on diverse datasets with varying camera configurations

The authors demonstrate that BEV-VAE can be trained on multiple autonomous driving datasets with different camera setups (nuScenes, AV2, nuPlan), overcoming data isolation limitations and enabling cross-dataset generalization. This scalability is achieved through the unified BEV representation that decouples scene modeling from specific camera configurations.

10 retrieved papers
Diffusion Transformer in BEV latent space for controllable scene synthesis

The authors train a Diffusion Transformer within the BEV-VAE latent space to generate BEV representations conditioned on 3D object layouts encoded as occupancy grids. This approach enables controllable multi-view image synthesis with improved spatial consistency and computational efficiency compared to operating in image latent spaces.

5 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

BEV-VAE: a unified BEV representation for multi-view driving scenes

The authors introduce BEV-VAE, a variational autoencoder architecture that learns a unified Bird's-Eye-View representation from multi-view images. This representation enables encoding from arbitrary camera configurations and decoding to any desired viewpoint, overcoming the camera-configuration dependency of prior methods.

Contribution

Scalable training on diverse datasets with varying camera configurations

The authors demonstrate that BEV-VAE can be trained on multiple autonomous driving datasets with different camera setups (nuScenes, AV2, nuPlan), overcoming data isolation limitations and enabling cross-dataset generalization. This scalability is achieved through the unified BEV representation that decouples scene modeling from specific camera configurations.

Contribution

Diffusion Transformer in BEV latent space for controllable scene synthesis

The authors train a Diffusion Transformer within the BEV-VAE latent space to generate BEV representations conditioned on 3D object layouts encoded as occupancy grids. This approach enables controllable multi-view image synthesis with improved spatial consistency and computational efficiency compared to operating in image latent spaces.