Plan-R1: Safe and Feasible Trajectory Planning as Language Modeling

ICLR 2026 Conference SubmissionAnonymous Authors
Trajectory PlanningReinforcement LearningAutonomous Driving
Abstract:

Safe and feasible trajectory planning is critical for real-world autonomous driving systems. However, existing learning-based planners rely heavily on expert demonstrations, which not only lack explicit safety awareness but also risk inheriting undesirable behaviors such as speeding from suboptimal human driving data. Inspired by the success of large language models, we propose Plan-R1, a two-stage trajectory planning framework that decouples principle alignment from behavior learning. In the first stage, a general trajectory predictor is pre-trained on expert data to capture diverse, human-like driving behaviors. In the second stage, the model is fine-tuned with rule-based rewards using Group Relative Policy Optimization (GRPO), explicitly aligning ego planning with principles such as safety, comfort, and traffic rule compliance. This two-stage paradigm retains human-like behaviors while enhancing safety awareness and discarding undesirable patterns from demonstrations. Furthermore, we identify a key limitation of directly applying GRPO to planning: group-wise normalization erases cross-group scale differences, causing rare, high-variance safety-violation groups to have similar advantages as abundant low-variance safe groups, thereby suppressing optimization for safety-critical objectives. To address this, we propose Variance-Decoupled GRPO (VD-GRPO), which replaces normalization with centering and fixed scaling to preserve absolute reward magnitudes, ensuring that safety-critical objectives remain dominant throughout training. Experiments on the nuPlan benchmark demonstrate that Plan-R1 significantly improves planning safety and feasibility, achieving state-of-the-art performance, particularly in realistic reactive settings.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

Plan-R1 proposes a two-stage trajectory planning framework that decouples principle alignment from behavior learning, first pre-training on expert demonstrations and then fine-tuning with rule-based rewards via Group Relative Policy Optimization. The paper resides in the Social-Aware Trajectory Prediction and Planning leaf, which contains only three papers including this one. This is a relatively sparse research direction within the broader taxonomy of fifty papers, suggesting that the specific combination of social-aware planning with explicit principle alignment through reinforcement learning remains underexplored compared to more crowded areas like Model Predictive Control or Reinforcement Learning for Sequential Decision-Making.

The taxonomy reveals that Plan-R1's leaf sits within the Social Interaction and Risk-Aware Planning branch, which emphasizes modeling human-driven vehicle interactions and dynamic risk assessment. Neighboring branches include Learning-Based Planning Approaches—particularly Reinforcement Learning for Sequential Decision-Making and Imitation Learning categories—which share the data-driven ethos but typically lack the explicit social modeling focus. The Optimization-Based Planning Methods branch, containing Model Predictive Control and Quadratic Programming clusters, represents an alternative paradigm prioritizing mathematical constraints over learned policies. Plan-R1 bridges these worlds by combining learned social behaviors with rule-based safety alignment, occupying a distinct position between purely data-driven and purely optimization-centric approaches.

Among thirteen candidates examined through limited semantic search, none clearly refute the three core contributions. The two-stage decoupling framework examined one candidate with no refutation found. Variance-Decoupled GRPO, the most extensively analyzed contribution, examined ten candidates without identifying overlapping prior work. The formulation of trajectory planning as a principle-aligned prediction task examined two candidates, again with no refutations. These statistics suggest that within the examined scope, the specific combination of GRPO adaptation for trajectory planning and the variance-decoupling mechanism appears novel, though the limited search scale means potentially relevant work in adjacent reinforcement learning or planning communities may not have been captured.

Based on the top-thirteen semantic matches and the sparse taxonomy leaf containing only two sibling papers, Plan-R1 appears to introduce a distinctive approach to social-aware planning. The absence of refutations across all contributions within this limited scope suggests novelty in the specific technical mechanisms, particularly the variance-decoupled GRPO adaptation. However, the small search scale and the paper's position in an underexplored taxonomy leaf mean that broader connections to reinforcement learning from human feedback or safety-critical RL literature may warrant deeper investigation beyond the current analysis.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
13
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: safe and feasible trajectory planning for autonomous driving. The field has evolved into a rich landscape organized around several complementary perspectives. Social Interaction and Risk-Aware Planning emphasizes modeling human behavior and anticipating interactive scenarios, often drawing on game-theoretic or prediction-driven frameworks to handle multi-agent coordination. Learning-Based Planning Approaches leverage neural networks and reinforcement learning to discover policies from data, while Optimization-Based Planning Methods rely on mathematical programming to enforce hard constraints and optimality criteria. Hierarchical and Decoupled Planning Frameworks separate high-level decision-making from low-level trajectory generation, and Sampling-Based and Hybrid Planning Methods combine discrete search with continuous optimization. Safety Verification and Fail-Safe Planning focuses on formal guarantees and backup strategies, Uncertainty-Aware Planning addresses sensor noise and prediction errors, and Scenario-Specific branches target particular driving contexts such as highway merging or urban intersections. Survey and Review Studies synthesize these threads, while Specialized Techniques explore emerging tools like occupancy grids and spatio-temporal representations. Within this taxonomy, a particularly active line of work centers on integrating prediction and planning in socially aware settings. Plan-R1[0] sits squarely in the Social-Aware Trajectory Prediction and Planning cluster, where the emphasis is on reasoning about other agents' intentions and adapting the ego vehicle's trajectory accordingly. Nearby works such as S4TP[1] and SA-TP[2] similarly prioritize social context, but they may differ in how they encode interaction models or balance computational efficiency with prediction fidelity. In contrast, branches like Safety Verification and Fail-Safe Planning (e.g., Fail-Safe Motion[5], Fail-Safe Convex Optimization[16]) focus more on worst-case guarantees than on nuanced social reasoning, while Learning-Based Planning Approaches (e.g., RL Trajectory Planning[40]) often trade interpretability for end-to-end adaptability. Plan-R1[0] thus occupies a middle ground: it inherits the interactive modeling ethos of its branch while remaining distinct from purely optimization-centric or purely data-driven paradigms, reflecting ongoing debates about how best to fuse prediction, planning, and safety in complex traffic scenarios.

Claimed Contributions

Plan-R1: Two-stage trajectory planning framework decoupling principle alignment from behavior learning

A framework that first pre-trains a general trajectory predictor on expert data to capture diverse human-like driving behaviors, then fine-tunes it with rule-based rewards using reinforcement learning to explicitly align ego planning with principles such as safety, comfort, and traffic rule compliance without requiring additional expert or preference data.

1 retrieved paper
Variance-Decoupled GRPO (VD-GRPO) for safety-critical optimization

An improved reinforcement learning optimization method that addresses a key limitation of standard GRPO by replacing per-group normalization with centering and fixed scaling, thereby preserving absolute reward magnitudes so that rare, high-variance safety-violation cases generate larger gradients and remain prioritized during training.

10 retrieved papers
Formulation of trajectory planning as principle-aligned prediction task

A novel problem formulation that extends autoregressive trajectory prediction by explicitly incorporating high-level planning principles, enabling the model to generate ego trajectories that are both human-like and compliant with safety and traffic rules while addressing limitations of expert demonstrations.

2 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Plan-R1: Two-stage trajectory planning framework decoupling principle alignment from behavior learning

A framework that first pre-trains a general trajectory predictor on expert data to capture diverse human-like driving behaviors, then fine-tunes it with rule-based rewards using reinforcement learning to explicitly align ego planning with principles such as safety, comfort, and traffic rule compliance without requiring additional expert or preference data.

Contribution

Variance-Decoupled GRPO (VD-GRPO) for safety-critical optimization

An improved reinforcement learning optimization method that addresses a key limitation of standard GRPO by replacing per-group normalization with centering and fixed scaling, thereby preserving absolute reward magnitudes so that rare, high-variance safety-violation cases generate larger gradients and remain prioritized during training.

Contribution

Formulation of trajectory planning as principle-aligned prediction task

A novel problem formulation that extends autoregressive trajectory prediction by explicitly incorporating high-level planning principles, enabling the model to generate ego trajectories that are both human-like and compliant with safety and traffic rules while addressing limitations of expert demonstrations.