RAP: 3D Rasterization Augmented End-to-End Planning

ICLR 2026 Conference SubmissionAnonymous Authors
Autonomous DrivingPlanningSim-to-Real
Abstract:

Imitation learning for end-to-end driving trains policies only on expert demonstrations. Once deployed in a closed loop, such policies lack recovery data: small mistakes cannot be corrected and quickly compound into failures. A promising direction is to generate alternative viewpoints and trajectories beyond the logged path. Prior work explores photorealistic digital twins via neural rendering or game engines, but these methods are prohibitively slow and costly, and thus mainly used for evaluation. In this work, we argue that photorealism is unnecessary for training end-to-end planners. What matters is semantic fidelity and scalability: driving depends on geometry and dynamics, not textures or lighting. Motivated by this, we propose 3D Rasterization, which replaces costly rendering with lightweight rasterization of annotated primitives, enabling augmentations such as counterfactual recovery maneuvers and cross-agent view synthesis. To transfer these synthetic views effectively to real-world deployment, we introduce a Raster-to-Real (R2R) feature-space alignment that bridges the sim-to-real gap at the representation level. Together, these components form the Rasterization Augmented Planning (RAP) pipeline, a scalable data augmentation framework for planning. RAP achieves state-of-the-art closed-loop robustness and long-tail generalization, ranking 1st on four major benchmarks: NAVSIM v1/v2, Waymo Open Dataset Vision-based E2E Driving, and Bench2Drive. Our results demonstrate that lightweight rasterization with feature alignment suffices to scale end-to-end training, offering a practical alternative to photorealistic rendering. Code will be released.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a rasterization-based augmentation framework for end-to-end driving, replacing photorealistic rendering with lightweight primitive rasterization to generate counterfactual recovery maneuvers and cross-agent viewpoints. Within the taxonomy, it occupies the 'Rasterization and Primitive-Based Rendering' leaf under 'Synthetic Data Generation and Augmentation Methods'. This leaf currently contains only the original paper itself, with no sibling papers identified. This isolation suggests the rasterization-based approach represents a relatively sparse research direction compared to neighboring leaves like 'Diffusion-Based Video and Image Generation' (four papers) or 'Simulation-Based Data Generation' (two papers).

The taxonomy reveals a crowded landscape of synthetic augmentation methods. Adjacent leaves include diffusion-based synthesis (Synthetic Diffusion Driving, four papers), style transfer techniques (one paper), and simulation-based generation (two papers). The scope notes clarify boundaries: diffusion methods prioritize photorealism via generative models, while simulation-based approaches use physics engines or world models. The original paper explicitly diverges by arguing photorealism is unnecessary—semantic fidelity and scalability matter more. This positions the work as a computational efficiency alternative to heavier generative pipelines, though it shares the broader goal of expanding training data beyond logged expert demonstrations.

Among eighteen candidates examined across three contributions, none were found to clearly refute the proposed methods. The '3D Rasterization Pipeline' examined six candidates with zero refutations; 'Raster-to-Real Feature Alignment' examined ten with zero refutations; 'RAP Framework with Counterfactual Augmentation' examined two with zero refutations. This absence of overlapping prior work within the limited search scope suggests the specific combination of lightweight rasterization, feature-space sim-to-real alignment, and counterfactual augmentation strategies has not been directly addressed in the top-eighteen semantically similar papers. However, the search scale is modest and does not cover the full breadth of autonomous driving or graphics literature.

Based on the limited search scope, the work appears to introduce a distinct technical approach within a broader augmentation landscape. The taxonomy structure shows active research in diffusion-based and simulation-based generation, but the rasterization-based direction remains sparsely populated. The contribution-level statistics indicate no direct prior work among examined candidates, though this reflects top-eighteen semantic matches rather than exhaustive coverage. The novelty assessment is thus conditional on the search boundaries and may shift with deeper exploration of graphics-oriented or real-time rendering communities.

Taxonomy

Core-task Taxonomy Papers
30
3
Claimed Contributions
18
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: end-to-end autonomous driving planning with data augmentation. The field addresses the challenge of training robust driving policies by combining neural architectures that map sensor inputs directly to control outputs with techniques that expand or enrich training data. The taxonomy reveals five main branches. Synthetic Data Generation and Augmentation Methods explores how to create or transform driving scenarios—ranging from diffusion-based scene synthesis (Synthetic Diffusion Driving[2]) to rasterization and primitive-based rendering (RAP Rasterization Planning[0])—to overcome data scarcity and improve generalization. End-to-End Learning Architectures and Training Frameworks focuses on neural network designs and optimization strategies that enable direct sensor-to-action mappings (Modularization Free Driving[1]). Imitation Learning and Behavioral Cloning investigates how to distill expert demonstrations into policies, often addressing distribution shift and covariate mismatch (Robust Behavioral Cloning[28], Imitation Data Augmentation[9]). Trajectory Selection and Multi-Modal Planning deals with generating and scoring multiple candidate trajectories to handle uncertainty and diverse driving behaviors (Multimodal Trajectory Scoring[19], DistillDrive Multimode[10]). Finally, Reinforcement Learning for Autonomous Driving examines reward-driven policy optimization, including safe exploration and sample efficiency (Expert Parking RL[15]). A particularly active line of work centers on how synthetic augmentation interacts with end-to-end architectures: some studies leverage diffusion models or style transfer (Style Transfer Augmentation[20]) to diversify training scenes, while others use lightweight rasterization to rapidly generate varied traffic configurations. RAP Rasterization Planning[0] sits within the rasterization and primitive-based rendering cluster, emphasizing efficient scene rendering to augment planning data. This contrasts with heavier generative approaches like Synthetic Diffusion Driving[2], which prioritizes photorealistic synthesis at higher computational cost, and with simulation-based methods (Robust Control Simulation[3]) that rely on physics engines. Meanwhile, imitation learning branches grapple with the trade-off between data quality and quantity: augmenting expert demonstrations can mitigate overfitting, but poorly designed augmentations risk introducing spurious correlations. Across these directions, open questions remain about the optimal balance between synthetic diversity and real-world fidelity, and how to integrate augmentation seamlessly into multi-modal planning frameworks that must reason over diverse candidate trajectories.

Claimed Contributions

3D Rasterization Pipeline for Driving Scene Reconstruction

The authors introduce a lightweight, training-free rasterization method that converts annotated driving logs (lane polylines, agent cuboids) into perspective camera views. This approach prioritizes semantic and geometric fidelity over photorealism, enabling fast and controllable scene generation for end-to-end planning.

6 retrieved papers
Raster-to-Real (R2R) Feature-Space Alignment Module

The authors propose a feature-space alignment technique that minimizes the domain gap between rasterized and real images at both spatial and global levels, using MSE loss and gradient reversal for domain confusion. This enables effective transfer from synthetic rasterized inputs to real-world deployment.

10 retrieved papers
RAP Framework with Counterfactual Augmentation Strategies

The authors develop RAP, a complete data augmentation framework that combines 3D rasterization with recovery-oriented trajectory perturbations and cross-agent view synthesis. This framework addresses covariate shift in imitation learning by generating diverse training scenarios beyond logged expert demonstrations.

2 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

3D Rasterization Pipeline for Driving Scene Reconstruction

The authors introduce a lightweight, training-free rasterization method that converts annotated driving logs (lane polylines, agent cuboids) into perspective camera views. This approach prioritizes semantic and geometric fidelity over photorealism, enabling fast and controllable scene generation for end-to-end planning.

Contribution

Raster-to-Real (R2R) Feature-Space Alignment Module

The authors propose a feature-space alignment technique that minimizes the domain gap between rasterized and real images at both spatial and global levels, using MSE loss and gradient reversal for domain confusion. This enables effective transfer from synthetic rasterized inputs to real-world deployment.

Contribution

RAP Framework with Counterfactual Augmentation Strategies

The authors develop RAP, a complete data augmentation framework that combines 3D rasterization with recovery-oriented trajectory perturbations and cross-agent view synthesis. This framework addresses covariate shift in imitation learning by generating diverse training scenarios beyond logged expert demonstrations.