AutoDrive-R²: Incentivizing Reasoning and Self-Reflection Capacity for VLA Model in Autonomous Driving

ICLR 2026 Conference SubmissionAnonymous Authors
ApplicationsRobotsVision–Language–Action Models
Abstract:

Vision–Language–Action (VLA) models in autonomous driving systems have recently demonstrated transformative potential by integrating multimodal perception with decision-making capabilities. However, the interpretability and coherence of the decision process and the plausibility of action sequences remain largely underexplored. To address these issues, we propose AutoDrive-R², a novel VLA framework that enhances both reasoning and self-reflection capabilities of autonomous driving systems through chain-of-thought (CoT) processing and reinforcement learning (RL). Specifically, we first propose an innovative CoT dataset named nuScenesR²-6K for supervised fine-tuning, which effectively builds cognitive bridges between input information and output trajectories through a four-step logical chain with self-reflection for validation. Moreover, to maximize both reasoning and self-reflection during the RL stage, we further employ the Group Relative Policy Optimization (GRPO) algorithm within a physics-grounded reward framework that incorporates spatial alignment, vehicle dynamic, and temporal smoothness criteria to ensure reliable and realistic trajectory planning. Extensive evaluation results across both nuScenes and Waymo datasets demonstrates the state-of-the-art performance and robust generalization capacity of our proposed method.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes AutoDrive-R², a VLA framework combining chain-of-thought reasoning with self-reflection for autonomous driving. It resides in the Chain-of-Thought Reasoning leaf, which contains four papers total, indicating a moderately populated research direction. This leaf sits within the broader Reasoning Enhancement Mechanisms branch, which encompasses multiple reasoning paradigms including counterfactual analysis and adaptive strategies. The framework's dual emphasis on reasoning and self-reflection positions it at the intersection of structured cognitive processing and validation mechanisms.

The taxonomy reveals that Chain-of-Thought Reasoning neighbors Counterfactual and Self-Reflective Reasoning (two papers) and Adaptive Reasoning Strategies (two papers), suggesting the field is exploring diverse approaches to interpretable decision-making. The broader Multimodal Integration Architectures branch addresses complementary challenges like spatial awareness and unified perception-action frameworks. AutoDrive-R²'s physics-grounded reward framework connects it to Training Paradigms and Optimization, particularly Reinforcement Learning and Online Optimization, indicating cross-cutting methodological contributions beyond pure reasoning architecture.

Among thirty candidates examined, the core VLA framework contribution shows potential overlap with two prior works, while the nuScenesR²-6K dataset and physics-grounded GRPO method each examined ten candidates with no clear refutations. The dataset contribution appears more distinctive, as no examined work provides a comparable four-step logical chain with self-reflection annotations. The GRPO method's novelty is less clear given the limited search scope, though the specific combination of spatial alignment, vehicle dynamics, and temporal smoothness criteria may differentiate it from existing RL approaches in this domain.

Based on top-thirty semantic matches, the framework's reasoning architecture faces some prior work overlap, while the dataset and training methodology appear more novel within the examined scope. The analysis does not cover exhaustive literature on general VLA models or broader autonomous driving systems, focusing specifically on reasoning-enhanced approaches. The taxonomy structure suggests this is an active but not overcrowded research area, with room for contributions that meaningfully advance interpretability and self-correction capabilities.

Taxonomy

Core-task Taxonomy Papers
45
3
Claimed Contributions
30
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: Enhancing reasoning and self-reflection in vision-language-action models for autonomous driving. The field has evolved into several distinct branches that address different facets of building intelligent driving agents. Reasoning Enhancement Mechanisms explore how to inject structured thought processes—such as chain-of-thought or reflective loops—into model predictions, enabling systems to articulate intermediate steps before committing to actions. Multimodal Integration Architectures focus on fusing visual, linguistic, and sometimes spatial or temporal cues within unified frameworks, as seen in works like CoVLA[3] and OmniReason[4]. Training Paradigms and Optimization investigate learning strategies, from imitation and reinforcement learning to self-supervised techniques that leverage large-scale driving data. Datasets and Benchmarks provide standardized evaluation protocols, while Specialized Applications and Extensions tackle domain-specific challenges like safety-critical scenarios or real-time deployment. Surveys and Conceptual Frameworks, including A Survey on Vision-Language-Action[1], offer high-level perspectives on the landscape, and Direct Vision-Action Mapping examines end-to-end approaches that bypass explicit linguistic reasoning. Within Reasoning Enhancement Mechanisms, a particularly active line of work centers on chain-of-thought reasoning, where models generate intermediate rationales to improve decision transparency and robustness. AutoDrive-R²[0] exemplifies this direction by emphasizing both reasoning and self-reflection, aiming to produce interpretable driving decisions that can be scrutinized and refined. Nearby efforts such as CoT4AD[35] and CoC-VLA[40] similarly adopt structured reasoning chains but may differ in how they balance computational overhead against interpretability gains. A key trade-off across these methods is whether to prioritize explicit linguistic explanations—which enhance human trust and debugging—or to streamline inference for real-time performance. AutoDrive-R²[0] sits squarely in the chain-of-thought cluster, sharing the goal of transparent reasoning with CoT4AD[35] while potentially exploring deeper self-correction loops that distinguish it from simpler one-pass chain-of-thought approaches. This positioning highlights ongoing questions about how much reasoning depth is necessary for safe, reliable autonomous driving.

Claimed Contributions

AutoDrive-R2 VLA framework with reasoning and self-reflection

The authors propose a Vision-Language-Action framework that enhances autonomous driving by incorporating chain-of-thought reasoning and self-reflection capabilities, enabling the system to generate physically feasible trajectories while providing interpretable decision-making processes.

10 retrieved papers
Can Refute
nuScenesR2-6K chain-of-thought dataset

The authors introduce the first autonomous driving dataset that includes not only ground-truth trajectories but also structured reasoning steps through a four-step logical chain (observation, calculation, logical deductions, reflection) to train models with both reasoning and self-reflection capabilities.

10 retrieved papers
Physics-grounded GRPO reinforcement learning method

The authors develop a reinforcement learning approach using Group Relative Policy Optimization with a physics-grounded reward framework that incorporates spatial alignment, vehicle dynamics, and temporal smoothness constraints to ensure physically feasible and realistic trajectory planning.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

AutoDrive-R2 VLA framework with reasoning and self-reflection

The authors propose a Vision-Language-Action framework that enhances autonomous driving by incorporating chain-of-thought reasoning and self-reflection capabilities, enabling the system to generate physically feasible trajectories while providing interpretable decision-making processes.

Contribution

nuScenesR2-6K chain-of-thought dataset

The authors introduce the first autonomous driving dataset that includes not only ground-truth trajectories but also structured reasoning steps through a four-step logical chain (observation, calculation, logical deductions, reflection) to train models with both reasoning and self-reflection capabilities.

Contribution

Physics-grounded GRPO reinforcement learning method

The authors develop a reinforcement learning approach using Group Relative Policy Optimization with a physics-grounded reward framework that incorporates spatial alignment, vehicle dynamics, and temporal smoothness constraints to ensure physically feasible and realistic trajectory planning.