DriveMamba: Task-Centric Scalable State Space Model for Efficient End-to-End Autonomous Driving

ICLR 2026 Conference SubmissionAnonymous Authors
Autonomous DrivingTask-Centric ParadigmScalable State Space Model
Abstract:

Recent advances towards End-to-End Autonomous Driving (E2E-AD) focus on integrating modular designs into a unified framework for joint optimization. Most of these advances follow a sequential paradigm (i.e., perception-prediction-planning) based on separable Transformer decoders and rely on dense BEV features to encode scene representations. However, such manual ordering design can inevitably cause information loss and cumulative errors, lacking flexible and diverse relation modeling among different modules and sensors. Meanwhile, insufficient training of image backbone and quadratic-complexity of attention mechanism also hinder the scalability and efficiency of E2E-AD system to handle spatiotemporal input. To this end, we propose DriveMamba, a Task-Centric Scalable paradigm for efficient E2E-AD, which integrates dynamic task relation modeling, implicit view correspondence learning and long-term temporal fusion into a single-stage Unified Mamba decoder. Specifically, both extracted image features and expected task outputs are converted into token-level sparse representations in advance, which are then sorted by their instantiated positions in 3D space. The linear-complexity operator enables efficient long-context sequential token modeling to capture task-related inter-dependencies simultaneously. Additionally, a bidirectional trajectory-guided "local-to-global" scan method is designed to preserve spatial locality from ego-perspective, thus facilitating the ego-planning. Extensive experiments conducted on nuScenes and Bench2Drive datasets demonstrate the superiority, generalizability and great efficiency of DriveMamba.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes DriveMamba, a unified end-to-end autonomous driving framework that integrates perception, prediction, and planning using Mamba state space models. According to the taxonomy, this work resides in the 'Mamba-Based Unified End-to-End Driving' leaf, which currently contains no sibling papers. This leaf sits within the broader 'State Space Model Architectures for Perception and Planning' branch, which includes only four total papers across four distinct leaves. The sparse population of this branch suggests that applying Mamba architectures to unified end-to-end driving is an emerging and relatively unexplored research direction within the field.

The taxonomy reveals that neighboring research directions include 'Mamba for Temporal BEV Perception' (BevMamba), 'Mamba for Multi-Modal Video Understanding', and 'Trajectory Prediction with Selective State Spaces', each focusing on specific subtasks rather than unified end-to-end systems. The broader field shows substantial activity in 'Latent World Model-Based End-to-End Driving' (seven papers across four leaves) and 'Deep Reinforcement Learning for End-to-End Driving' (five papers across five leaves). DriveMamba diverges from these directions by eschewing explicit latent dynamics modeling or pure RL optimization in favor of direct state space sequential processing across all driving tasks simultaneously.

Among the three identified contributions, the literature search examined 28 total candidates with no refutable pairs found. Contribution A (Task-Centric Scalable paradigm) examined eight candidates with zero refutations; Contribution B (bidirectional trajectory-guided scan) examined ten candidates with zero refutations; Contribution C (Unified Mamba decoder) examined ten candidates with zero refutations. This limited search scope—covering top-K semantic matches rather than exhaustive field coverage—suggests that within the examined candidate pool, no prior work directly overlaps with the proposed technical innovations. However, the small search scale means substantial related work may exist outside the examined set.

Given the sparse taxonomy leaf (zero siblings) and the absence of refutations among 28 examined candidates, the work appears to occupy a relatively novel position within the limited search scope. The integration of Mamba state space models into a single-stage unified driving architecture represents a distinct approach compared to the examined latent world model and RL-based methods. However, the analysis is constrained by the top-K semantic search methodology and does not constitute an exhaustive literature review, leaving open the possibility of relevant prior work in adjacent research communities or recent preprints.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
28
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: end-to-end autonomous driving with state space models. The field encompasses a diverse set of approaches that leverage state space representations to model vehicle dynamics, perception, and planning in a unified framework. At the highest level, the taxonomy reveals several major branches: latent world model-based methods that learn compressed scene representations for prediction and control (e.g., Latent World Model[1], World4drive[2]); state space model architectures that integrate perception and planning modules using modern sequential models; deep reinforcement learning techniques that optimize driving policies through trial and error (e.g., Dueling Double DQN[3], Imitation to Exploration[8]); model predictive control formulations that exploit state space dynamics for real-time optimization (e.g., Neural State-Space MPC[11], State Lattice MPC[37]); and specialized branches addressing state estimation, data-driven vehicle modeling, path tracking, collision avoidance, decision-making, world models, safety verification, and control optimization. These branches reflect complementary emphases: some prioritize interpretability and safety guarantees through classical control theory, while others pursue end-to-end learning for scalability and generalization. A particularly active line of work explores latent world models that compress high-dimensional sensor data into compact state representations, enabling efficient planning and prediction (Latent World Model[1], Semantic Masked World[5], DriveWorld[32]). In contrast, model predictive control approaches often rely on explicit vehicle dynamics and optimization-based planning, trading off computational cost for formal guarantees. DriveMamba[0] sits within the emerging cluster of Mamba-based unified architectures, which apply state space models directly to perception and planning tasks. Compared to latent world model methods like Latent World Model[1] or World4drive[2], DriveMamba[0] emphasizes the use of structured state space layers for efficient sequential processing, rather than learning a separate latent dynamics model. This positions it alongside works like BevMamba[17] and Hybrid State Space[9], which similarly explore modern state space architectures for autonomous driving, but with different emphases on sensor fusion, temporal modeling, and integration with classical control paradigms.

Claimed Contributions

DriveMamba: Task-Centric Scalable State Space Model paradigm for E2E-AD

The authors introduce DriveMamba, a novel end-to-end autonomous driving framework that uses a unified Mamba decoder to simultaneously model task relations, learn view correspondences, and fuse temporal information. This paradigm operates on sparse token-level representations rather than dense BEV features, enabling efficient and scalable processing with linear complexity.

8 retrieved papers
Bidirectional trajectory-guided local-to-global scan method

The authors design a hybrid spatiotemporal scanning method that organizes tokens based on their 3D positions and ego-vehicle trajectory. This scan preserves spatial locality from the ego-vehicle perspective and captures task-related dependencies in a manner suited for interactive planning.

10 retrieved papers
Unified Mamba decoder for parallel task modeling

The authors propose a unified decoder architecture based on bidirectional Mamba blocks that processes task queries and sensor tokens in parallel. This design enables dynamic task relation modeling without manual sequential ordering, supporting scalability through simple layer stacking with linear complexity.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

DriveMamba: Task-Centric Scalable State Space Model paradigm for E2E-AD

The authors introduce DriveMamba, a novel end-to-end autonomous driving framework that uses a unified Mamba decoder to simultaneously model task relations, learn view correspondences, and fuse temporal information. This paradigm operates on sparse token-level representations rather than dense BEV features, enabling efficient and scalable processing with linear complexity.

Contribution

Bidirectional trajectory-guided local-to-global scan method

The authors design a hybrid spatiotemporal scanning method that organizes tokens based on their 3D positions and ego-vehicle trajectory. This scan preserves spatial locality from the ego-vehicle perspective and captures task-related dependencies in a manner suited for interactive planning.

Contribution

Unified Mamba decoder for parallel task modeling

The authors propose a unified decoder architecture based on bidirectional Mamba blocks that processes task queries and sensor tokens in parallel. This design enables dynamic task relation modeling without manual sequential ordering, supporting scalability through simple layer stacking with linear complexity.