Scalable and Generalizable Autonomous Driving Scene Synthesis
Overview
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors introduce BEV-VAE, a variational autoencoder architecture that learns a unified Bird's-Eye-View representation from multi-view images. This representation enables encoding from arbitrary camera configurations and decoding to any desired viewpoint, overcoming the camera-configuration dependency of prior methods.
The authors demonstrate that BEV-VAE can be trained on multiple autonomous driving datasets with different camera setups (nuScenes, AV2, nuPlan), overcoming data isolation limitations and enabling cross-dataset generalization. This scalability is achieved through the unified BEV representation that decouples scene modeling from specific camera configurations.
The authors train a Diffusion Transformer within the BEV-VAE latent space to generate BEV representations conditioned on 3D object layouts encoded as occupancy grids. This approach enables controllable multi-view image synthesis with improved spatial consistency and computational efficiency compared to operating in image latent spaces.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
Contribution Analysis
Detailed comparisons for each claimed contribution
BEV-VAE: a unified BEV representation for multi-view driving scenes
The authors introduce BEV-VAE, a variational autoencoder architecture that learns a unified Bird's-Eye-View representation from multi-view images. This representation enables encoding from arbitrary camera configurations and decoding to any desired viewpoint, overcoming the camera-configuration dependency of prior methods.
[67] Bevworld: A multimodal world model for autonomous driving via unified bev latent space PDF
[11] DrivingDiffusion: layout-guided multi-view driving scenarios video generation with latent diffusion model PDF
[63] Panacea: Panoramic and controllable video generation for autonomous driving PDF
[64] V2V cooperative perception with adaptive communication loss for autonomous driving PDF
[65] Adaptive bidirectional planning framework for enhanced safety and robust decision-making in autonomous navigation systems: D. Yu et al. PDF
[66] Mapprior: Bird's-eye view perception with generative models PDF
[68] Genesis: Multimodal Driving Scene Generation with Spatio-Temporal and Cross-Modal Consistency PDF
[69] VQ-Map: Bird's-Eye-View Map Layout Estimation in Tokenized Discrete Space via Vector Quantization PDF
[70] CalibRBEV: Multi-Camera Calibration via Reversed Bird's-eye-view Representations for Autonomous Driving PDF
[71] Homography VAE: Automatic Bird's Eye View Image Reconstruction from Multi-Perspective Views PDF
Scalable training on diverse datasets with varying camera configurations
The authors demonstrate that BEV-VAE can be trained on multiple autonomous driving datasets with different camera setups (nuScenes, AV2, nuPlan), overcoming data isolation limitations and enabling cross-dataset generalization. This scalability is achieved through the unified BEV representation that decouples scene modeling from specific camera configurations.
[51] BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers PDF
[52] MVImgNet: A Large-scale Dataset of Multi-view Images PDF
[53] BEHAVE: Dataset and Method for Tracking Human Object Interactions PDF
[54] Adaptive Camera Sensor for Vision Models PDF
[55] Vision-based manipulation from single human video with open-world object graphs PDF
[56] Rig3R: Rig-Aware Conditioning and Discovery for 3D Reconstruction PDF
[57] Neural rendering for sensor adaptation in 3D object detection PDF
[58] Rig3R: Rig-Aware Conditioning for Learned 3D Reconstruction PDF
[59] Charge: A Comprehensive Novel View Synthesis Benchmark and Dataset to Bind Them All PDF
[60] Learning generalizable manipulation policies with object-centric 3d representations PDF
Diffusion Transformer in BEV latent space for controllable scene synthesis
The authors train a Diffusion Transformer within the BEV-VAE latent space to generate BEV representations conditioned on 3D object layouts encoded as occupancy grids. This approach enables controllable multi-view image synthesis with improved spatial consistency and computational efficiency compared to operating in image latent spaces.