SafeMVDrive: Multi-view Safety-Critical Driving Video Generation in the Real World Domain

ICLR 2026 Conference SubmissionAnonymous Authors
Autonomous driving testingsafety-critical scenariovideo generationsafety
Abstract:

Safety-critical scenarios are essential for evaluating autonomous driving (AD) systems, yet they are rare in practice. Existing generators produce trajectories, simulations, or single-view videos—but they don’t meet what modern AD systems actually consume: realistic multi-view video. We present SafeMVDrive, the first framework for generating multi-view safety-critical driving videos in the real-world domain. SafeMVDrive couples a safety-critical trajectory engine with a diffusion-based multi-view video generator through three design choices. First, we pick the right adversary: a GRPO-fine-tuned vision-language model (VLM) that understands multi-camera context and selects vehicles most likely to induce hazards. Second, we generate the right motion: a two-stage trajectory process that (i) produces collisions, then (ii) transforms them into natural evasion trajectories—preserving risk while staying within what current video generators can faithfully render. Third, we synthesize the right data: a diffusion model that turns these trajectories into multi-view videos suitable for end-to-end planners. On a strong end-to-end planner, our videos substantially increase collision rate, exposing brittle behavior and providing targeted stress tests for planning modules. Our code and video examples are available at: https://iclr-1.github.io/SMD/.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

SafeMVDrive introduces a framework coupling adversarial trajectory generation with diffusion-based multi-view video synthesis for safety-critical driving scenarios. It resides in the Safety-Critical Scenario Video Synthesis leaf, which contains four papers including the original work. This leaf represents a focused research direction within the broader Multi-View Driving Video Generation Frameworks branch. The taxonomy reveals this is a moderately populated niche: while general multi-view generation has multiple sub-directions (layout-guided, world-model-based, instance-aware), the safety-critical video synthesis cluster remains relatively compact, suggesting an emerging rather than saturated area.

The taxonomy positions SafeMVDrive adjacent to General Multi-View Video Generation methods emphasizing layout control and world modeling, plus Safety-Critical Data Synthesis approaches using simulation pipelines or LLM-based scripting. The scope notes clarify boundaries: SafeMVDrive differs from general multi-view generators by explicitly targeting adversarial scenarios, and from simulation-based methods by producing real-world-domain videos rather than synthetic simulator outputs. Neighboring leaves like Controllable Multi-View Video Generation share controllability goals but lack the safety-critical focus. This structural context suggests SafeMVDrive bridges video synthesis realism with hazard-injection control, occupying a distinct position between photorealistic generation and safety-oriented data augmentation.

Among seventeen candidates examined, no contribution was clearly refuted. The core SafeMVDrive framework examined ten candidates with zero refutations; the VLM-based adversarial selector examined two candidates with none overlapping; the two-stage trajectory generator examined five candidates with none providing prior work. These statistics reflect a limited semantic search scope, not exhaustive coverage. The absence of refutations across all contributions suggests either genuine novelty within the examined set or that closely related prior work lies outside the top-seventeen semantic matches. The VLM-based selector and trajectory generator appear particularly under-explored in the candidate pool, though this may reflect search limitations rather than absolute novelty.

Given the limited search scope of seventeen candidates, the analysis indicates SafeMVDrive occupies a sparsely populated intersection of multi-view video generation and safety-critical scenario synthesis. The taxonomy structure confirms this is an emerging direction with few direct competitors in the examined literature. However, the small candidate pool and absence of refutations warrant caution: a broader search might reveal closer prior work, particularly in adjacent simulation-based or controllability-focused methods not captured by semantic similarity.

Taxonomy

Core-task Taxonomy Papers
27
3
Claimed Contributions
17
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: multi-view safety-critical driving video generation. The field has evolved around several complementary branches. Multi-View Driving Video Generation Frameworks encompass methods that synthesize realistic driving scenes from multiple camera perspectives, often leveraging diffusion models or neural rendering techniques to produce temporally coherent outputs. Safety-Critical Data Synthesis and Simulation focuses on generating rare or dangerous scenarios—such as near-collisions or adverse weather—that are underrepresented in real-world datasets, enabling robust testing of autonomous systems. Perception and Reasoning Systems for Autonomous Driving address how models interpret and reason about generated or real video streams, while Specialized Applications and Adaptation explore domain-specific tasks like pedestrian editing or traffic-light recognition. Survey and Review Literature provides broader context on generalization challenges and physical-world deployment. Representative works include DrivingDiffusion[8] and InstaDrive[3] for general multi-view synthesis, and SafeMVDrive Synthesis[1] for safety-focused generation. Recent efforts reveal a tension between photorealism and controllability: some approaches prioritize high-fidelity rendering using world models like Gaia[6] or Cosmos Drive Dreams[5], while others such as DrivingGen[9] and Challenger[10] emphasize structured control over safety-critical events. SafeMVDrive[0] sits within the Safety-Critical Scenario Video Synthesis cluster, closely aligned with SafeMVDrive Synthesis[1] and Challenger[10], which similarly target rare-event generation. Compared to InstaDrive[3], which focuses on instant multi-view synthesis without explicit safety constraints, SafeMVDrive[0] places greater emphasis on controllable hazard injection and multi-camera consistency under critical conditions. Open questions remain around balancing realism with the diversity of edge cases, and how best to integrate these synthetic scenarios into end-to-end training pipelines for perception and planning.

Claimed Contributions

SafeMVDrive framework for multi-view safety-critical video generation

The authors introduce SafeMVDrive, a novel framework that couples a safety-critical trajectory engine with a diffusion-based multi-view video generator to produce realistic multi-view safety-critical driving videos suitable for evaluating end-to-end autonomous driving systems.

10 retrieved papers
VLM-based adversarial vehicle selector with visual context

The authors propose a GRPO-fine-tuned vision-language model that leverages multi-view camera images to identify adversarial vehicles most likely to induce safety-critical scenarios, addressing limitations of prior non-visual heuristic selection methods.

2 retrieved papers
Two-stage collision-evasion trajectory generator

The authors develop a two-stage trajectory generation process that first creates collision trajectories and then refines them into natural evasion trajectories, preserving safety-critical characteristics while remaining compatible with current video generation models that cannot realistically render collisions.

5 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

SafeMVDrive framework for multi-view safety-critical video generation

The authors introduce SafeMVDrive, a novel framework that couples a safety-critical trajectory engine with a diffusion-based multi-view video generator to produce realistic multi-view safety-critical driving videos suitable for evaluating end-to-end autonomous driving systems.

Contribution

VLM-based adversarial vehicle selector with visual context

The authors propose a GRPO-fine-tuned vision-language model that leverages multi-view camera images to identify adversarial vehicles most likely to induce safety-critical scenarios, addressing limitations of prior non-visual heuristic selection methods.

Contribution

Two-stage collision-evasion trajectory generator

The authors develop a two-stage trajectory generation process that first creates collision trajectories and then refines them into natural evasion trajectories, preserving safety-critical characteristics while remaining compatible with current video generation models that cannot realistically render collisions.