DrivingGen: A Comprehensive Benchmark for Generative Video World Models in Autonomous Driving

ICLR 2026 Conference SubmissionAnonymous Authors
BenchmarkAutonomous DrivingGenerative World Model
Abstract:

Video generation models, as one form of world models, has emerged as one of the most exciting frontiers in AI, promising agents the ability to imagine the future by modeling the temporal evolution of complex scenes. In autonomous driving, this vision gives rise to driving world models—generative simulators that imagine ego and agent futures, enabling scalable simulation, safe testing of corner cases, and rich synthetic data generation. Yet, despite fast-growing research activity, the field lacks a rigorous benchmark to measure progress and guide priorities. Existing evaluations remain limited: generic video metrics overlook safety-critical imaging factors; trajectory plausibility is rarely quantified; temporal and agent-level consistency is neglected; and controllability with respect to ego conditioning is ignored. Moreover, current datasets fail to cover the diversity of conditions required for real-world deployment. To address these gaps, we present DrivingGen, the first comprehensive benchmark for generative driving world models. DrivingGen combines a diverse evaluation dataset—curated from both driving datasets and internet-scale video sources, spanning varied weather, time of day, geographic regions, and complex maneuvers—with a suite of new metrics that jointly assess visual realism, trajectory plausibility, temporal coherence, and controllability. Benchmarking 14 state-of-the-art models reveals clear trade-offs: general models look better but break physics, while driving-specific ones capture motion realistically but lag in visual quality. DrivingGen offers a unified evaluation framework to foster reliable, controllable, and deployable driving world models, enabling scalable simulation, planning, and data-driven decision-making.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes DrivingGen, a comprehensive benchmark for evaluating generative driving world models across multiple dimensions including visual quality, trajectory plausibility, temporal consistency, and controllability. Within the taxonomy, it resides in the 'Comprehensive Benchmarking Frameworks' leaf under 'Evaluation Frameworks and Benchmarking'. This leaf contains only two papers total (including DrivingGen), indicating a relatively sparse research direction. The sibling paper is WorldSimBench, suggesting that holistic, multi-dimensional evaluation frameworks for driving world models represent an emerging but not yet crowded area of investigation.

The taxonomy reveals that most research activity concentrates on model architectures and generation mechanisms, with substantial work in diffusion-based models (11 papers across three sub-leaves) and data generation for downstream tasks (10 papers across four sub-leaves). The evaluation branch sits somewhat apart from these technical development efforts. Neighboring leaves include 'Survey and Taxonomic Reviews' (4 papers) which provide broader field overviews, and the various architecture categories which propose the models that benchmarks like DrivingGen aim to assess. The scope_note for this leaf explicitly excludes papers proposing models without comprehensive evaluation frameworks, clarifying that DrivingGen's focus on systematic assessment distinguishes it from generation-focused work.

Among the three contributions analyzed, the benchmark dataset contribution examined 10 candidates and found 1 potentially refutable prior work, suggesting some overlap with existing evaluation datasets. The novel metrics contribution examined 10 candidates with none clearly refuting it, indicating this aspect may be more distinctive. The comprehensive model evaluation contribution examined only 2 candidates with no refutations found. Given the limited search scope of 22 total candidates examined, these statistics suggest moderate novelty for the metrics and evaluation methodology, while the dataset contribution faces more substantial prior work within the examined literature.

Based on the limited top-22 semantic search results, DrivingGen appears to occupy a relatively underexplored niche focused on holistic benchmarking rather than model development. The sparse population of its taxonomy leaf and the moderate refutation rates suggest the work addresses a recognized gap, though the small candidate pool means potentially relevant evaluation frameworks outside the search scope remain unexamined. The analysis captures the paper's positioning within known benchmarking efforts but cannot assess novelty against the broader evaluation literature.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
22
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: benchmarking generative video world models for autonomous driving. The field has evolved around several interconnected branches that reflect both the technical challenges of building realistic simulators and the practical demands of downstream autonomy tasks. World Model Architecture and Generation Mechanisms explores the foundational neural architectures—ranging from diffusion-based approaches like DriveDreamer[4] and DriveDreamer Two[3] to transformer and occupancy-based representations such as OccWorld[9]—that enable photorealistic or semantically rich video synthesis. Controllability and Conditioning Mechanisms addresses how these models incorporate diverse inputs (e.g., trajectories, maps, textual commands) to steer generated scenarios, while Closed-Loop Simulation and Interactive Environments focuses on enabling agents to interact with the world model over multiple timesteps, as seen in DriveArena[1] and Muvo[2]. Data Generation and Augmentation for Downstream Tasks examines how synthetic rollouts can improve perception or planning modules, and Specialized Applications and Domain-Specific Adaptations covers tailored solutions for safety-critical or rare-event scenarios. Finally, Evaluation Frameworks and Benchmarking consolidates methods for systematically assessing realism, controllability, and utility across these diverse models. Within this landscape, a particularly active line of work centers on comprehensive benchmarking frameworks that go beyond isolated metrics to evaluate multiple facets—visual fidelity, physical plausibility, and downstream task performance—in a unified manner. DrivingGen[0] exemplifies this direction by proposing a holistic suite of tests that measure not only perceptual quality but also how well generated videos support planning and control algorithms. It sits closely alongside WorldSimBench[37], which similarly emphasizes multi-dimensional evaluation, and contrasts with earlier efforts like DriveDreamer[4] that primarily targeted generation quality without extensive closed-loop or task-oriented benchmarks. By integrating diverse evaluation axes, DrivingGen[0] addresses a key open question: whether improvements in generative realism translate into tangible gains for end-to-end autonomy, thereby bridging the gap between pure synthesis research and practical deployment considerations.

Claimed Contributions

DrivingGen benchmark with diverse evaluation dataset

The authors introduce DrivingGen, a comprehensive benchmark that includes a carefully curated evaluation dataset covering diverse driving conditions such as varied weather (rain, snow, fog), times of day (dawn, day, night), global geographic regions, and complex driving maneuvers. This dataset addresses the limited diversity in existing benchmarks like nuScenes and OpenDV.

10 retrieved papers
Can Refute
Novel multifaceted metrics for driving world models

The authors propose a novel suite of evaluation metrics specifically designed for driving scenarios. These metrics comprehensively evaluate four dimensions: distribution-level measures for videos and trajectories, quality metrics accounting for perceptual and driving-specific factors, temporal consistency at scene and agent levels, and trajectory alignment measuring controllability.

10 retrieved papers
Comprehensive evaluation of 14 state-of-the-art models

The authors conduct extensive benchmarking of 14 generative world models spanning general video models, physics-based models, and driving-specific models. This evaluation reveals important insights about trade-offs between visual quality and physical consistency, providing the first comprehensive comparison in the driving domain.

2 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

DrivingGen benchmark with diverse evaluation dataset

The authors introduce DrivingGen, a comprehensive benchmark that includes a carefully curated evaluation dataset covering diverse driving conditions such as varied weather (rain, snow, fog), times of day (dawn, day, night), global geographic regions, and complex driving maneuvers. This dataset addresses the limited diversity in existing benchmarks like nuScenes and OpenDV.

Contribution

Novel multifaceted metrics for driving world models

The authors propose a novel suite of evaluation metrics specifically designed for driving scenarios. These metrics comprehensively evaluate four dimensions: distribution-level measures for videos and trajectories, quality metrics accounting for perceptual and driving-specific factors, temporal consistency at scene and agent levels, and trajectory alignment measuring controllability.

Contribution

Comprehensive evaluation of 14 state-of-the-art models

The authors conduct extensive benchmarking of 14 generative world models spanning general video models, physics-based models, and driving-specific models. This evaluation reveals important insights about trade-offs between visual quality and physical consistency, providing the first comprehensive comparison in the driving domain.