DrivingGen: A Comprehensive Benchmark for Generative Video World Models in Autonomous Driving

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

BenchmarkAutonomous DrivingGenerative World Model

Video generation models, as one form of world models, has emerged as one of the most exciting frontiers in AI, promising agents the ability to imagine the future by modeling the temporal evolution of complex scenes. In autonomous driving, this vision gives rise to driving world models—generative simulators that imagine ego and agent futures, enabling scalable simulation, safe testing of corner cases, and rich synthetic data generation. Yet, despite fast-growing research activity, the field lacks a rigorous benchmark to measure progress and guide priorities. Existing evaluations remain limited: generic video metrics overlook safety-critical imaging factors; trajectory plausibility is rarely quantified; temporal and agent-level consistency is neglected; and controllability with respect to ego conditioning is ignored. Moreover, current datasets fail to cover the diversity of conditions required for real-world deployment. To address these gaps, we present DrivingGen, the first comprehensive benchmark for generative driving world models. DrivingGen combines a diverse evaluation dataset—curated from both driving datasets and internet-scale video sources, spanning varied weather, time of day, geographic regions, and complex maneuvers—with a suite of new metrics that jointly assess visual realism, trajectory plausibility, temporal coherence, and controllability. Benchmarking 14 state-of-the-art models reveals clear trade-offs: general models look better but break physics, while driving-specific ones capture motion realistically but lag in visual quality. DrivingGen offers a unified evaluation framework to foster reliable, controllable, and deployable driving world models, enabling scalable simulation, planning, and data-driven decision-making.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes DrivingGen, a comprehensive benchmark for evaluating generative driving world models across multiple dimensions including visual quality, trajectory plausibility, temporal consistency, and controllability. Within the taxonomy, it resides in the 'Comprehensive Benchmarking Frameworks' leaf under 'Evaluation Frameworks and Benchmarking'. This leaf contains only two papers total (including DrivingGen), indicating a relatively sparse research direction. The sibling paper is WorldSimBench, suggesting that holistic, multi-dimensional evaluation frameworks for driving world models represent an emerging but not yet crowded area of investigation.

The taxonomy reveals that most research activity concentrates on model architectures and generation mechanisms, with substantial work in diffusion-based models (11 papers across three sub-leaves) and data generation for downstream tasks (10 papers across four sub-leaves). The evaluation branch sits somewhat apart from these technical development efforts. Neighboring leaves include 'Survey and Taxonomic Reviews' (4 papers) which provide broader field overviews, and the various architecture categories which propose the models that benchmarks like DrivingGen aim to assess. The scope_note for this leaf explicitly excludes papers proposing models without comprehensive evaluation frameworks, clarifying that DrivingGen's focus on systematic assessment distinguishes it from generation-focused work.

Among the three contributions analyzed, the benchmark dataset contribution examined 10 candidates and found 1 potentially refutable prior work, suggesting some overlap with existing evaluation datasets. The novel metrics contribution examined 10 candidates with none clearly refuting it, indicating this aspect may be more distinctive. The comprehensive model evaluation contribution examined only 2 candidates with no refutations found. Given the limited search scope of 22 total candidates examined, these statistics suggest moderate novelty for the metrics and evaluation methodology, while the dataset contribution faces more substantial prior work within the examined literature.

Based on the limited top-22 semantic search results, DrivingGen appears to occupy a relatively underexplored niche focused on holistic benchmarking rather than model development. The sparse population of its taxonomy leaf and the moderate refutation rates suggest the work addresses a recognized gap, though the small candidate pool means potentially relevant evaluation frameworks outside the search scope remain unexamined. The analysis captures the paper's positioning within known benchmarking efforts but cannot assess novelty against the broader evaluation literature.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: benchmarking generative video world models for autonomous driving. The field has evolved around several interconnected branches that reflect both the technical challenges of building realistic simulators and the practical demands of downstream autonomy tasks. World Model Architecture and Generation Mechanisms explores the foundational neural architectures—ranging from diffusion-based approaches like DriveDreamer[4] and DriveDreamer Two[3] to transformer and occupancy-based representations such as OccWorld[9]—that enable photorealistic or semantically rich video synthesis. Controllability and Conditioning Mechanisms addresses how these models incorporate diverse inputs (e.g., trajectories, maps, textual commands) to steer generated scenarios, while Closed-Loop Simulation and Interactive Environments focuses on enabling agents to interact with the world model over multiple timesteps, as seen in DriveArena[1] and Muvo[2]. Data Generation and Augmentation for Downstream Tasks examines how synthetic rollouts can improve perception or planning modules, and Specialized Applications and Domain-Specific Adaptations covers tailored solutions for safety-critical or rare-event scenarios. Finally, Evaluation Frameworks and Benchmarking consolidates methods for systematically assessing realism, controllability, and utility across these diverse models. Within this landscape, a particularly active line of work centers on comprehensive benchmarking frameworks that go beyond isolated metrics to evaluate multiple facets—visual fidelity, physical plausibility, and downstream task performance—in a unified manner. DrivingGen[0] exemplifies this direction by proposing a holistic suite of tests that measure not only perceptual quality but also how well generated videos support planning and control algorithms. It sits closely alongside WorldSimBench[37], which similarly emphasizes multi-dimensional evaluation, and contrasts with earlier efforts like DriveDreamer[4] that primarily targeted generation quality without extensive closed-loop or task-oriented benchmarks. By integrating diverse evaluation axes, DrivingGen[0] addresses a key open question: whether improvements in generative realism translate into tangible gains for end-to-end autonomy, thereby bridging the gap between pure synthesis research and practical deployment considerations.

Claimed Contributions

DrivingGen benchmark with diverse evaluation dataset

Can Refute

10 retrieved papers

The authors introduce DrivingGen, a comprehensive benchmark that includes a carefully curated evaluation dataset covering diverse driving conditions such as varied weather (rain, snow, fog), times of day (dawn, day, night), global geographic regions, and complex driving maneuvers. This dataset addresses the limited diversity in existing benchmarks like nuScenes and OpenDV.

10 retrieved papers

Can Refute

Novel multifaceted metrics for driving world models

10 retrieved papers

The authors propose a novel suite of evaluation metrics specifically designed for driving scenarios. These metrics comprehensively evaluate four dimensions: distribution-level measures for videos and trajectories, quality metrics accounting for perceptual and driving-specific factors, temporal consistency at scene and agent levels, and trajectory alignment measuring controllability.

10 retrieved papers

Comprehensive evaluation of 14 state-of-the-art models

2 retrieved papers

The authors conduct extensive benchmarking of 14 generative world models spanning general video models, physics-based models, and driving-specific models. This evaluation reveals important insights about trade-offs between visual quality and physical consistency, providing the first comprehensive comparison in the driving domain.

2 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[37] Worldsimbench: Towards video generation models as world simulators PDF

Qin Yi-ran, Shi Zhelun, Yiran Qin, Yu Jiwen, Zhelun Shi, Wang Xi-jun, Jiwen Yu, Zhou En-shen, Xijun Wang, Li Lijun, Enshen Zhou, Yin, Zhenfei, Lijun Li, Liu Xi-hui, Zhen-fei Yin, Sheng LU, Xihui Liu, Shao, Jing, Lu Sheng, Bai Lei, Jing Shao, Ouyang, Wanli, Lei Bai, Zhang, Ruimao, Wanli Ouyang, Ruimao Zhang (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

DrivingGen benchmark with diverse evaluation dataset

[63] Bdd100k: A diverse driving dataset for heterogeneous multitask learning PDF

Can Refute

[14] Generalized predictive model for autonomous driving PDF

Cannot Refute

[59] SID: Stereo Image Dataset for Autonomous Driving in Adverse Conditions PDF

Cannot Refute

[60] S2R-Bench: A Sim-to-Real Evaluation Benchmark for Autonomous Driving PDF

Cannot Refute

[61] A survey on autonomous driving datasets: Statistics, annotation quality, and a future outlook PDF

Cannot Refute

[62] Towards a Transitional Weather Scene Recognition Approach for Autonomous Vehicles PDF

Cannot Refute

[64] Augmented Cross Layer Refinement Network-Based Lane Detection in Adverse Weather Conditions PDF

Cannot Refute

[65] One million scenes for autonomous driving: Once dataset PDF

Cannot Refute

[66] ACDC: The adverse conditions dataset with correspondences for semantic driving scene understanding PDF

Cannot Refute

[67] Ithaca365: Dataset and driving perception under repeated and challenging weather conditions PDF

Cannot Refute

Contribution

Novel multifaceted metrics for driving world models

[4] Drivedreamer: Towards real-world-drive world models for autonomous driving PDF

Cannot Refute

[37] Worldsimbench: Towards video generation models as world simulators PDF

Cannot Refute

[51] Probing multimodal llms as world models for driving PDF

Cannot Refute

[52] Gigaworld-0: World models as data engine to empower embodied ai PDF

Cannot Refute

[53] Lidardm: Generative lidar simulation in a generated world PDF

Cannot Refute

[54] Geodrive: 3d geometry-informed driving world model with precise action control PDF

Cannot Refute

[55] Panacea: Panoramic and controllable video generation for autonomous driving PDF

Cannot Refute

[56] World-in-world: World models in a closed-loop world PDF

Cannot Refute

[57] Act-bench: Towards action controllable world models for autonomous driving PDF

Cannot Refute

[58] Seeing Clearly, Forgetting Deeply: Revisiting Fine-Tuned Video Generators for Driving Simulation PDF

Cannot Refute

Contribution

Comprehensive evaluation of 14 state-of-the-art models

[68] Vista: A generalizable driving world model with high fidelity and versatile controllability PDF

Cannot Refute

[69] Drive&Gen: Co-Evaluating End-to-End Driving and Video Generation Models PDF

Cannot Refute

DrivingGen: A Comprehensive Benchmark for Generative Video World Models in Autonomous Driving

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[37] Worldsimbench: Towards video generation models as world simulators PDF

Contribution Analysis

DrivingGen benchmark with diverse evaluation dataset

[63] Bdd100k: A diverse driving dataset for heterogeneous multitask learning PDF

[14] Generalized predictive model for autonomous driving PDF

[59] SID: Stereo Image Dataset for Autonomous Driving in Adverse Conditions PDF

[60] S2R-Bench: A Sim-to-Real Evaluation Benchmark for Autonomous Driving PDF

[61] A survey on autonomous driving datasets: Statistics, annotation quality, and a future outlook PDF

[62] Towards a Transitional Weather Scene Recognition Approach for Autonomous Vehicles PDF

[64] Augmented Cross Layer Refinement Network-Based Lane Detection in Adverse Weather Conditions PDF

[65] One million scenes for autonomous driving: Once dataset PDF

[66] ACDC: The adverse conditions dataset with correspondences for semantic driving scene understanding PDF

[67] Ithaca365: Dataset and driving perception under repeated and challenging weather conditions PDF

Novel multifaceted metrics for driving world models

[4] Drivedreamer: Towards real-world-drive world models for autonomous driving PDF

[37] Worldsimbench: Towards video generation models as world simulators PDF

[51] Probing multimodal llms as world models for driving PDF

[52] Gigaworld-0: World models as data engine to empower embodied ai PDF

[53] Lidardm: Generative lidar simulation in a generated world PDF

[54] Geodrive: 3d geometry-informed driving world model with precise action control PDF

[55] Panacea: Panoramic and controllable video generation for autonomous driving PDF

[56] World-in-world: World models in a closed-loop world PDF

[57] Act-bench: Towards action controllable world models for autonomous driving PDF

[58] Seeing Clearly, Forgetting Deeply: Revisiting Fine-Tuned Video Generators for Driving Simulation PDF

Comprehensive evaluation of 14 state-of-the-art models

[68] Vista: A generalizable driving world model with high fidelity and versatile controllability PDF

[69] Drive&Gen: Co-Evaluating End-to-End Driving and Video Generation Models PDF

Table of Contents