AsyncBEV: Cross-modal flow alignment in Asynchronous 3D Object Detection

ICLR 2026 Conference SubmissionAnonymous Authors
multi-modal 3D object detectionautonomous drivinga synchronous fusion.
Abstract:

In autonomous driving, multi-modal perception tasks like 3D object detection typically rely on well-synchronized sensors, both at training and inference. However, despite the use of hardware- or software-based synchronization algorithms, perfect synchrony is rarely guaranteed: Sensors may operate at different frequencies, and real-world factors such as network latency, hardware failures, or processing bottlenecks often introduce time offsets between sensors. Such asynchrony degrades perception performance, especially for dynamic objects. To address this challenge, we propose AsyncBEV, a trainable lightweight and generic module to improve the robustness of 3D Birds' Eye View (BEV) object detection models against sensor asynchrony. Inspired by scene flow estimation, AsyncBEV first estimates the 2D flow from the BEV features of two different sensor modalities, taking into account the known time offset between these sensor measurements. The predicted feature flow is then used to warp and spatially align the feature maps, which we show can easily be integrated into different current BEV detector architectures (e.g., BEV grid-based and token-based). Extensive experiments demonstrate AsyncBEV improves robustness against both small and large asynchrony between LiDAR or camera sensors in both the token-based CMT and grid-based UniBEV, especially for dynamic objects. We significantly outperform the ego motion compensated CMT and UniBEV baselines, notably by 16.616.6 % and 11.911.9 % NDS on dynamic objects in the worst-case scenario of a 0.5s0.5 s time offset. Code will be released upon acceptance.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes AsyncBEV, a module that predicts 2D flow in bird's-eye-view feature space to align asynchronous LiDAR-camera data for 3D object detection. It resides in the 'BEV Feature Flow Prediction for Sensor Asynchrony' leaf, which contains only two papers including this one. This represents a relatively sparse research direction within the broader taxonomy of asynchronous multi-modal detection, suggesting the specific approach of BEV-space flow prediction for temporal alignment is not yet heavily explored compared to attention-based or cooperative perception methods.

The taxonomy reveals neighboring directions that address asynchrony differently. The sibling leaf 'Vehicle-Infrastructure Flow-Based Cooperative Fusion' applies flow prediction in V2I scenarios rather than single-vehicle settings. Adjacent branches include 'Attention-Based Multi-Modal and Cooperative Fusion' with temporal attention mechanisms, and 'Calibration-Robust and Geometry-Aware Fusion' emphasizing geometric constraints. AsyncBEV diverges from these by focusing on explicit BEV-space flow modeling for single-vehicle sensor synchronization, rather than attention-driven fusion or multi-agent cooperation, occupying a distinct methodological niche.

Among twenty-six candidates examined, the AsyncBEV module contribution shows one refutable candidate from ten examined, while the cross-modal flow alignment approach has two refutable candidates from six examined. The generic integration framework contribution appears more novel with zero refutable candidates among ten examined. The limited search scope means these statistics reflect top-K semantic matches and citation expansion, not exhaustive coverage. The flow-based contributions face more substantial prior work overlap, while the architectural integration aspect appears less contested within the examined literature.

Based on the limited search scope of twenty-six candidates, the work appears to occupy a sparsely populated research direction with modest prior work overlap. The taxonomy structure suggests flow-based temporal alignment in BEV space is less crowded than attention-based or cooperative approaches. However, the analysis cannot confirm novelty beyond the examined candidate set, and the presence of refutable candidates for core contributions indicates meaningful prior work exists in this specific methodological space.

Taxonomy

Core-task Taxonomy Papers
30
3
Claimed Contributions
26
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: Asynchronous multi-modal 3D object detection in autonomous driving. The field addresses the challenge of fusing data from sensors (cameras, LiDAR, radar, event cameras) that operate at different frequencies and with varying latencies, requiring methods to align temporal mismatches while preserving spatial accuracy. The taxonomy reveals several complementary research directions: Temporal Alignment and Flow-Based Fusion Methods focus on predicting intermediate representations to bridge asynchronous gaps, often using optical or scene flow in bird's-eye view (BEV) space. Attention-Based Multi-Modal and Cooperative Fusion emphasizes learning-driven alignment through cross-modal attention mechanisms. Calibration-Robust and Geometry-Aware Fusion tackles extrinsic parameter uncertainties and geometric consistency across modalities. Event-Based Asynchronous Perception leverages neuromorphic sensors for high-temporal-resolution inputs. Multi-Agent Cooperative Perception Systems extend fusion to vehicle-to-vehicle (V2V) and vehicle-to-infrastructure (V2I) scenarios, where communication delays and pose uncertainties compound asynchrony challenges. Specialized Sensor Fusion Architectures and Foundational Methods provide system-level integration strategies, while Survey and Review Literature synthesizes progress across these branches. Recent work has intensified around flow-based temporal compensation and cooperative perception under real-world constraints. AsyncBEV[0] exemplifies the flow prediction approach by forecasting BEV features to synchronize asynchronous sensor streams, closely related to Velocity driven vision[1], which similarly exploits motion cues for temporal alignment. This contrasts with attention-driven methods like Multi-Modal 3D Object Detection[3] and cooperative frameworks such as V2X-ViT[10] and Practical collaborative perception[11], which prioritize learned feature interactions over explicit flow modeling. Meanwhile, event-based approaches like Ev-3DOD[21] and SpikiLi[8] offer alternative pathways by capturing continuous temporal information. AsyncBEV[0] sits within the temporal alignment cluster, sharing methodological DNA with flow-based works but distinguished by its focus on BEV-space prediction rather than image-space warping or purely attention-based fusion, addressing a practical gap in handling multi-rate sensor inputs for robust 3D detection.

Claimed Contributions

AsyncBEV module for asynchronous 3D object detection

The authors introduce AsyncBEV, a module that estimates 2D flow from BEV features of different sensor modalities while accounting for known time offsets, then uses this flow to warp and align feature maps. This module is designed to be lightweight, trainable, and easily integrated into different BEV detector architectures.

10 retrieved papers
Can Refute
Cross-modal flow alignment approach using scene flow estimation

The method draws inspiration from scene flow estimation to predict feature flow between asynchronous sensor modalities. This predicted flow is then used to spatially align feature maps from sensors with time offsets, addressing the asynchrony problem in multi-modal perception.

6 retrieved papers
Can Refute
Generic integration framework for BEV detector architectures

The authors demonstrate that their AsyncBEV module can be generically integrated into various existing BEV detector architectures, including both grid-based and token-based approaches, making it a flexible solution for handling sensor asynchrony across different detection frameworks.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

AsyncBEV module for asynchronous 3D object detection

The authors introduce AsyncBEV, a module that estimates 2D flow from BEV features of different sensor modalities while accounting for known time offsets, then uses this flow to warp and align feature maps. This module is designed to be lightweight, trainable, and easily integrated into different BEV detector architectures.

Contribution

Cross-modal flow alignment approach using scene flow estimation

The method draws inspiration from scene flow estimation to predict feature flow between asynchronous sensor modalities. This predicted flow is then used to spatially align feature maps from sensors with time offsets, addressing the asynchrony problem in multi-modal perception.

Contribution

Generic integration framework for BEV detector architectures

The authors demonstrate that their AsyncBEV module can be generically integrated into various existing BEV detector architectures, including both grid-based and token-based approaches, making it a flexible solution for handling sensor asynchrony across different detection frameworks.

AsyncBEV: Cross-modal flow alignment in Asynchronous 3D Object Detection | Novelty Validation