DriveMamba: Task-Centric Scalable State Space Model for Efficient End-to-End Autonomous Driving

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Autonomous DrivingTask-Centric ParadigmScalable State Space Model

Recent advances towards End-to-End Autonomous Driving (E2E-AD) focus on integrating modular designs into a unified framework for joint optimization. Most of these advances follow a sequential paradigm (i.e., perception-prediction-planning) based on separable Transformer decoders and rely on dense BEV features to encode scene representations. However, such manual ordering design can inevitably cause information loss and cumulative errors, lacking flexible and diverse relation modeling among different modules and sensors. Meanwhile, insufficient training of image backbone and quadratic-complexity of attention mechanism also hinder the scalability and efficiency of E2E-AD system to handle spatiotemporal input. To this end, we propose DriveMamba, a Task-Centric Scalable paradigm for efficient E2E-AD, which integrates dynamic task relation modeling, implicit view correspondence learning and long-term temporal fusion into a single-stage Unified Mamba decoder. Specifically, both extracted image features and expected task outputs are converted into token-level sparse representations in advance, which are then sorted by their instantiated positions in 3D space. The linear-complexity operator enables efficient long-context sequential token modeling to capture task-related inter-dependencies simultaneously. Additionally, a bidirectional trajectory-guided "local-to-global" scan method is designed to preserve spatial locality from ego-perspective, thus facilitating the ego-planning. Extensive experiments conducted on nuScenes and Bench2Drive datasets demonstrate the superiority, generalizability and great efficiency of DriveMamba.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes DriveMamba, a unified end-to-end autonomous driving framework that integrates perception, prediction, and planning using Mamba state space models. According to the taxonomy, this work resides in the 'Mamba-Based Unified End-to-End Driving' leaf, which currently contains no sibling papers. This leaf sits within the broader 'State Space Model Architectures for Perception and Planning' branch, which includes only four total papers across four distinct leaves. The sparse population of this branch suggests that applying Mamba architectures to unified end-to-end driving is an emerging and relatively unexplored research direction within the field.

The taxonomy reveals that neighboring research directions include 'Mamba for Temporal BEV Perception' (BevMamba), 'Mamba for Multi-Modal Video Understanding', and 'Trajectory Prediction with Selective State Spaces', each focusing on specific subtasks rather than unified end-to-end systems. The broader field shows substantial activity in 'Latent World Model-Based End-to-End Driving' (seven papers across four leaves) and 'Deep Reinforcement Learning for End-to-End Driving' (five papers across five leaves). DriveMamba diverges from these directions by eschewing explicit latent dynamics modeling or pure RL optimization in favor of direct state space sequential processing across all driving tasks simultaneously.

Among the three identified contributions, the literature search examined 28 total candidates with no refutable pairs found. Contribution A (Task-Centric Scalable paradigm) examined eight candidates with zero refutations; Contribution B (bidirectional trajectory-guided scan) examined ten candidates with zero refutations; Contribution C (Unified Mamba decoder) examined ten candidates with zero refutations. This limited search scope—covering top-K semantic matches rather than exhaustive field coverage—suggests that within the examined candidate pool, no prior work directly overlaps with the proposed technical innovations. However, the small search scale means substantial related work may exist outside the examined set.

Given the sparse taxonomy leaf (zero siblings) and the absence of refutations among 28 examined candidates, the work appears to occupy a relatively novel position within the limited search scope. The integration of Mamba state space models into a single-stage unified driving architecture represents a distinct approach compared to the examined latent world model and RL-based methods. However, the analysis is constrained by the top-K semantic search methodology and does not constitute an exhaustive literature review, leaving open the possibility of relevant prior work in adjacent research communities or recent preprints.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: end-to-end autonomous driving with state space models. The field encompasses a diverse set of approaches that leverage state space representations to model vehicle dynamics, perception, and planning in a unified framework. At the highest level, the taxonomy reveals several major branches: latent world model-based methods that learn compressed scene representations for prediction and control (e.g., Latent World Model[1], World4drive[2]); state space model architectures that integrate perception and planning modules using modern sequential models; deep reinforcement learning techniques that optimize driving policies through trial and error (e.g., Dueling Double DQN[3], Imitation to Exploration[8]); model predictive control formulations that exploit state space dynamics for real-time optimization (e.g., Neural State-Space MPC[11], State Lattice MPC[37]); and specialized branches addressing state estimation, data-driven vehicle modeling, path tracking, collision avoidance, decision-making, world models, safety verification, and control optimization. These branches reflect complementary emphases: some prioritize interpretability and safety guarantees through classical control theory, while others pursue end-to-end learning for scalability and generalization. A particularly active line of work explores latent world models that compress high-dimensional sensor data into compact state representations, enabling efficient planning and prediction (Latent World Model[1], Semantic Masked World[5], DriveWorld[32]). In contrast, model predictive control approaches often rely on explicit vehicle dynamics and optimization-based planning, trading off computational cost for formal guarantees. DriveMamba[0] sits within the emerging cluster of Mamba-based unified architectures, which apply state space models directly to perception and planning tasks. Compared to latent world model methods like Latent World Model[1] or World4drive[2], DriveMamba[0] emphasizes the use of structured state space layers for efficient sequential processing, rather than learning a separate latent dynamics model. This positions it alongside works like BevMamba[17] and Hybrid State Space[9], which similarly explore modern state space architectures for autonomous driving, but with different emphases on sensor fusion, temporal modeling, and integration with classical control paradigms.

Claimed Contributions

DriveMamba: Task-Centric Scalable State Space Model paradigm for E2E-AD

8 retrieved papers

The authors introduce DriveMamba, a novel end-to-end autonomous driving framework that uses a unified Mamba decoder to simultaneously model task relations, learn view correspondences, and fuse temporal information. This paradigm operates on sparse token-level representations rather than dense BEV features, enabling efficient and scalable processing with linear complexity.

8 retrieved papers

Bidirectional trajectory-guided local-to-global scan method

10 retrieved papers

The authors design a hybrid spatiotemporal scanning method that organizes tokens based on their 3D positions and ego-vehicle trajectory. This scan preserves spatial locality from the ego-vehicle perspective and captures task-related dependencies in a manner suited for interactive planning.

10 retrieved papers

Unified Mamba decoder for parallel task modeling

10 retrieved papers

The authors propose a unified decoder architecture based on bidirectional Mamba blocks that processes task queries and sensor tokens in parallel. This design enables dynamic task relation modeling without manual sequential ordering, supporting scalability through simple layer stacking with linear complexity.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

DriveMamba: Task-Centric Scalable State Space Model paradigm for E2E-AD

[70] Efficient Long-Range Context Modeling for Motion Forecasting with State Space Models PDF

Cannot Refute

[71] Foresight in Motion: Reinforcing Trajectory Prediction with Reward Heuristics PDF

Cannot Refute

[72] SpaceDrive: Infusing Spatial Awareness into VLM-based Autonomous Driving PDF

Cannot Refute

[73] Multi-Agent Motion Forecasting via Mixed Supervision PDF

Cannot Refute

[74] Trajectory Mamba: Efficient Attention-Mamba Forecasting Model Based on Selective SSM PDF

Cannot Refute

[75] Online Joint State Inference and Learning of Partially Unknown State-Space Models PDF

Cannot Refute

[76] A Comprehensive Survey on World Models for Embodied AI PDF

Cannot Refute

[77] World-Model based Hierarchical Planning with Semantic Communications for Autonomous Driving PDF

Cannot Refute

Contribution

Bidirectional trajectory-guided local-to-global scan method

[51] Occworld: Learning a 3d occupancy world model for autonomous driving PDF

Cannot Refute

[61] Parting with Misconceptions about Learning-based Vehicle Motion Planning PDF

Cannot Refute

[62] Genad: Generative end-to-end autonomous driving PDF

Cannot Refute

[63] Autonomous Vehicle Motion Planning PDF

Cannot Refute

[64] Search-Based Task and Motion Planning for Hybrid Systems: Agile Autonomous Vehicles PDF

Cannot Refute

[65] Collaborative Motion Planning Based on the Improved Ant Colony Algorithm for Multiple Autonomous Vehicles PDF

Cannot Refute

[66] Set-based trajectory planning for a car-like vehicle PDF

Cannot Refute

[67] Three-Dimensional Flight Corridor: An Occupancy Checking Process for Unmanned Aerial Vehicle Motion Planning inside Confined Spaces PDF

Cannot Refute

[68] An RRT-Dijkstra-Based Path Planning Strategy for Autonomous Vehicles PDF

Cannot Refute

[69] Agile Decision-Making and Safety-Critical Motion Planning for Emergency Autonomous Vehicles PDF

Cannot Refute

Contribution

Unified Mamba decoder for parallel task modeling

[51] Occworld: Learning a 3d occupancy world model for autonomous driving PDF

Cannot Refute

[52] FishTrack: Multi-object tracking method for fish using spatiotemporal information fusion PDF

Cannot Refute

[53] OmniNet: A unified architecture for multi-modal multi-task learning PDF

Cannot Refute

[54] End-to-end neural video coding using a compound spatiotemporal representation PDF

Cannot Refute

[55] Beverse: Unified perception and prediction in birds-eye-view for vision-centric autonomous driving PDF

Cannot Refute

[56] Task-Oriented Scanpath Prediction with Spatial-Temporal Information in Driving Scenarios PDF

Cannot Refute

[57] STSR-INR: Spatiotemporal super-resolution for multivariate time-varying volumetric data via implicit neural representation PDF

Cannot Refute

[58] Explainable fMRIâbased brain decoding via spatial temporalâpyramid graph convolutional network PDF

Cannot Refute

[59] A novel deep multiâtask learning model for spatialâtemporal fault detection and diagnosis in industrial systems PDF

Cannot Refute

[60] FrequencyâSpatialâTemporal Domain Fusion Network for Remote Sensing Image Change Captioning PDF

Cannot Refute

DriveMamba: Task-Centric Scalable State Space Model for Efficient End-to-End Autonomous Driving

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

DriveMamba: Task-Centric Scalable State Space Model paradigm for E2E-AD

[70] Efficient Long-Range Context Modeling for Motion Forecasting with State Space Models PDF

[71] Foresight in Motion: Reinforcing Trajectory Prediction with Reward Heuristics PDF

[72] SpaceDrive: Infusing Spatial Awareness into VLM-based Autonomous Driving PDF

[73] Multi-Agent Motion Forecasting via Mixed Supervision PDF

[74] Trajectory Mamba: Efficient Attention-Mamba Forecasting Model Based on Selective SSM PDF

[75] Online Joint State Inference and Learning of Partially Unknown State-Space Models PDF

[76] A Comprehensive Survey on World Models for Embodied AI PDF

[77] World-Model based Hierarchical Planning with Semantic Communications for Autonomous Driving PDF

Bidirectional trajectory-guided local-to-global scan method

[51] Occworld: Learning a 3d occupancy world model for autonomous driving PDF

[61] Parting with Misconceptions about Learning-based Vehicle Motion Planning PDF

[62] Genad: Generative end-to-end autonomous driving PDF

[63] Autonomous Vehicle Motion Planning PDF

[64] Search-Based Task and Motion Planning for Hybrid Systems: Agile Autonomous Vehicles PDF

[65] Collaborative Motion Planning Based on the Improved Ant Colony Algorithm for Multiple Autonomous Vehicles PDF

[66] Set-based trajectory planning for a car-like vehicle PDF

[67] Three-Dimensional Flight Corridor: An Occupancy Checking Process for Unmanned Aerial Vehicle Motion Planning inside Confined Spaces PDF

[68] An RRT-Dijkstra-Based Path Planning Strategy for Autonomous Vehicles PDF

[69] Agile Decision-Making and Safety-Critical Motion Planning for Emergency Autonomous Vehicles PDF

Unified Mamba decoder for parallel task modeling

[51] Occworld: Learning a 3d occupancy world model for autonomous driving PDF

[52] FishTrack: Multi-object tracking method for fish using spatiotemporal information fusion PDF

[53] OmniNet: A unified architecture for multi-modal multi-task learning PDF

[54] End-to-end neural video coding using a compound spatiotemporal representation PDF

[55] Beverse: Unified perception and prediction in birds-eye-view for vision-centric autonomous driving PDF

[56] Task-Oriented Scanpath Prediction with Spatial-Temporal Information in Driving Scenarios PDF

[57] STSR-INR: Spatiotemporal super-resolution for multivariate time-varying volumetric data via implicit neural representation PDF

[58] Explainable fMRIâbased brain decoding via spatial temporalâpyramid graph convolutional network PDF

[59] A novel deep multiâtask learning model for spatialâtemporal fault detection and diagnosis in industrial systems PDF

[60] FrequencyâSpatialâTemporal Domain Fusion Network for Remote Sensing Image Change Captioning PDF

Table of Contents

[58] Explainable fMRIâbased brain decoding via spatial temporalâpyramid graph convolutional network PDF

[59] A novel deep multiâtask learning model for spatialâtemporal fault detection and diagnosis in industrial systems PDF

[60] FrequencyâSpatialâTemporal Domain Fusion Network for Remote Sensing Image Change Captioning PDF