Scalable and Generalizable Autonomous Driving Scene Synthesis

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

autonomous drivingmulti-view synthesisBEV representation

Generative modeling has shown remarkable success in vision and language, inspiring research on synthesizing driving scenes. Existing multi-view synthesis approaches typically operate in image latent spaces with cross-attention to enforce spatial consistency, but they are tightly bound to camera configurations, which limits model generalization. We propose BEV-VAE, a variational autoencoder that learns a unified Bird’s-Eye-View (BEV) representation from multi-view images, enabling encoding from arbitrary camera layouts and decoding to any desired viewpoint. Through multi-view image reconstruction and novel view synthesis, we show that BEV-VAE effectively fuses multi-view information and accurately models spatial structure. This capability allows it to generalize across camera configurations and facilitates scalable training on diverse datasets. Within the latent space of BEV-VAE, a Diffusion Transformer (DiT) generates BEV representations conditioned on 3D object layouts, enabling multi-view image synthesis with enhanced spatial consistency on nuScenes and achieving the first complete seven-view synthesis on AV2. Compared with training generative models in image latent spaces, BEV-VAE achieves superior computational efficiency. Finally, synthesized imagery significantly improves the perception performance of BEVFormer, highlighting the utility of generalizable scene synthesis for autonomous driving.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: multi-view image synthesis for autonomous driving scenes. The field has evolved into several distinct branches that address different facets of generating realistic driving imagery. Generative World Models and Video Forecasting emphasize temporal consistency and future prediction, often leveraging large-scale diffusion or autoregressive architectures to simulate dynamic traffic scenarios. Novel View Synthesis via 3D Reconstruction relies on neural radiance fields and 3D Gaussian splatting to render photorealistic views from sparse camera inputs, while Controllable and Editable Scene Generation focuses on user-driven manipulation of scene layouts, object placements, and environmental conditions through bird's-eye-view (BEV) or semantic maps. Meanwhile, Perception and Representation Learning explores how synthesized data can improve downstream tasks like detection and segmentation, and Data Augmentation and Simulation targets the creation of diverse training samples to enhance model robustness. Benchmarking and Evaluation Datasets provide standardized testbeds, and Specialized Applications and Modalities extend synthesis to LiDAR, radar, or safety-critical edge cases. Within Controllable and Editable Scene Generation, a particularly active line of work centers on BEV-conditioned multi-view generation, where layout maps guide the synthesis of consistent surround-view imagery. Scalable and Generalizable Autonomous[0] sits squarely in this cluster, emphasizing scalability and generalization across diverse driving contexts. Nearby efforts such as Street-view image generation from[5] and BEV-VAE[34] similarly exploit BEV representations but differ in their architectural choices and the granularity of control they offer over scene elements. A key trade-off across these methods is balancing fine-grained editability with computational efficiency and cross-dataset transferability. While some approaches prioritize high-fidelity rendering for specific sensor configurations, others aim for broader applicability to varied camera setups and environmental conditions, reflecting ongoing questions about how best to unify controllability, realism, and practical deployment in autonomous driving pipelines.

Claimed Contributions

BEV-VAE: a unified BEV representation for multi-view driving scenes

Can Refute

10 retrieved papers

The authors introduce BEV-VAE, a variational autoencoder architecture that learns a unified Bird's-Eye-View representation from multi-view images. This representation enables encoding from arbitrary camera configurations and decoding to any desired viewpoint, overcoming the camera-configuration dependency of prior methods.

10 retrieved papers

Can Refute

Scalable training on diverse datasets with varying camera configurations

10 retrieved papers

The authors demonstrate that BEV-VAE can be trained on multiple autonomous driving datasets with different camera setups (nuScenes, AV2, nuPlan), overcoming data isolation limitations and enabling cross-dataset generalization. This scalability is achieved through the unified BEV representation that decouples scene modeling from specific camera configurations.

10 retrieved papers

Diffusion Transformer in BEV latent space for controllable scene synthesis

Can Refute

5 retrieved papers

The authors train a Diffusion Transformer within the BEV-VAE latent space to generate BEV representations conditioned on 3D object layouts encoded as occupancy grids. This approach enables controllable multi-view image synthesis with improved spatial consistency and computational efficiency compared to operating in image latent spaces.

5 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[5] Street-view image generation from a bird's-eye view layout PDF

Alexander Swerdlow, Runsheng Xu, Bolei Zhou (2024)

[34] BEV-VAE: Multi-view Image Generation with Spatial Consistency for Autonomous Driving PDF

Chen Ze-Ming, Zhao Hang (2025) • arXiv.org

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

BEV-VAE: a unified BEV representation for multi-view driving scenes

[67] Bevworld: A multimodal world model for autonomous driving via unified bev latent space PDF

Can Refute

[11] DrivingDiffusion: layout-guided multi-view driving scenarios video generation with latent diffusion model PDF

Cannot Refute

[63] Panacea: Panoramic and controllable video generation for autonomous driving PDF

Cannot Refute

[64] V2V cooperative perception with adaptive communication loss for autonomous driving PDF

Cannot Refute

[65] Adaptive bidirectional planning framework for enhanced safety and robust decision-making in autonomous navigation systems: D. Yu et al. PDF

Cannot Refute

[66] Mapprior: Bird's-eye view perception with generative models PDF

Cannot Refute

[68] Genesis: Multimodal Driving Scene Generation with Spatio-Temporal and Cross-Modal Consistency PDF

Cannot Refute

[69] VQ-Map: Bird's-Eye-View Map Layout Estimation in Tokenized Discrete Space via Vector Quantization PDF

Cannot Refute

[70] CalibRBEV: Multi-Camera Calibration via Reversed Bird's-eye-view Representations for Autonomous Driving PDF

Cannot Refute

[71] Homography VAE: Automatic Bird's Eye View Image Reconstruction from Multi-Perspective Views PDF

Cannot Refute

Contribution

Scalable training on diverse datasets with varying camera configurations

[51] BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers PDF

Cannot Refute

[52] MVImgNet: A Large-scale Dataset of Multi-view Images PDF

Cannot Refute

[53] BEHAVE: Dataset and Method for Tracking Human Object Interactions PDF

Cannot Refute

[54] Adaptive Camera Sensor for Vision Models PDF

Cannot Refute

[55] Vision-based manipulation from single human video with open-world object graphs PDF

Cannot Refute

[56] Rig3R: Rig-Aware Conditioning and Discovery for 3D Reconstruction PDF

Cannot Refute

[57] Neural rendering for sensor adaptation in 3D object detection PDF

Cannot Refute

[58] Rig3R: Rig-Aware Conditioning for Learned 3D Reconstruction PDF

Cannot Refute

[59] Charge: A Comprehensive Novel View Synthesis Benchmark and Dataset to Bind Them All PDF

Cannot Refute

[60] Learning generalizable manipulation policies with object-centric 3d representations PDF

Cannot Refute

Contribution

Diffusion Transformer in BEV latent space for controllable scene synthesis

[34] BEV-VAE: Multi-view Image Generation with Spatial Consistency for Autonomous Driving PDF

Can Refute

[61] OmniGen: Unified Multimodal Sensor Generation for Autonomous Driving PDF

Can Refute

[13] Dive: Efficient multi-view driving scenes generation based on video diffusion transformer PDF

Cannot Refute

[27] Seeing beyond views: Multi-view driving scene video generation with holistic attention PDF

Cannot Refute

[62] MVPbev: Multi-view Perspective Image Generation from BEV with Test-time Controllability and Generalizability PDF

Cannot Refute

Scalable and Generalizable Autonomous Driving Scene Synthesis

Overview

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[5] Street-view image generation from a bird's-eye view layout PDF

[34] BEV-VAE: Multi-view Image Generation with Spatial Consistency for Autonomous Driving PDF

Contribution Analysis

BEV-VAE: a unified BEV representation for multi-view driving scenes

[67] Bevworld: A multimodal world model for autonomous driving via unified bev latent space PDF

[11] DrivingDiffusion: layout-guided multi-view driving scenarios video generation with latent diffusion model PDF

[63] Panacea: Panoramic and controllable video generation for autonomous driving PDF

[64] V2V cooperative perception with adaptive communication loss for autonomous driving PDF

[65] Adaptive bidirectional planning framework for enhanced safety and robust decision-making in autonomous navigation systems: D. Yu et al. PDF

[66] Mapprior: Bird's-eye view perception with generative models PDF

[68] Genesis: Multimodal Driving Scene Generation with Spatio-Temporal and Cross-Modal Consistency PDF

[69] VQ-Map: Bird's-Eye-View Map Layout Estimation in Tokenized Discrete Space via Vector Quantization PDF

[70] CalibRBEV: Multi-Camera Calibration via Reversed Bird's-eye-view Representations for Autonomous Driving PDF

[71] Homography VAE: Automatic Bird's Eye View Image Reconstruction from Multi-Perspective Views PDF

Scalable training on diverse datasets with varying camera configurations

[51] BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers PDF

[52] MVImgNet: A Large-scale Dataset of Multi-view Images PDF

[53] BEHAVE: Dataset and Method for Tracking Human Object Interactions PDF

[54] Adaptive Camera Sensor for Vision Models PDF

[55] Vision-based manipulation from single human video with open-world object graphs PDF

[56] Rig3R: Rig-Aware Conditioning and Discovery for 3D Reconstruction PDF

[57] Neural rendering for sensor adaptation in 3D object detection PDF

[58] Rig3R: Rig-Aware Conditioning for Learned 3D Reconstruction PDF

[59] Charge: A Comprehensive Novel View Synthesis Benchmark and Dataset to Bind Them All PDF

[60] Learning generalizable manipulation policies with object-centric 3d representations PDF

Diffusion Transformer in BEV latent space for controllable scene synthesis

[34] BEV-VAE: Multi-view Image Generation with Spatial Consistency for Autonomous Driving PDF

[61] OmniGen: Unified Multimodal Sensor Generation for Autonomous Driving PDF

[13] Dive: Efficient multi-view driving scenes generation based on video diffusion transformer PDF

[27] Seeing beyond views: Multi-view driving scene video generation with holistic attention PDF

[62] MVPbev: Multi-view Perspective Image Generation from BEV with Test-time Controllability and Generalizability PDF

Table of Contents