AsyncBEV: Cross-modal flow alignment in Asynchronous 3D Object Detection

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

multi-modal 3D object detectionautonomous drivinga synchronous fusion.

In autonomous driving, multi-modal perception tasks like 3D object detection typically rely on well-synchronized sensors, both at training and inference. However, despite the use of hardware- or software-based synchronization algorithms, perfect synchrony is rarely guaranteed: Sensors may operate at different frequencies, and real-world factors such as network latency, hardware failures, or processing bottlenecks often introduce time offsets between sensors. Such asynchrony degrades perception performance, especially for dynamic objects. To address this challenge, we propose AsyncBEV, a trainable lightweight and generic module to improve the robustness of 3D Birds' Eye View (BEV) object detection models against sensor asynchrony. Inspired by scene flow estimation, AsyncBEV first estimates the 2D flow from the BEV features of two different sensor modalities, taking into account the known time offset between these sensor measurements. The predicted feature flow is then used to warp and spatially align the feature maps, which we show can easily be integrated into different current BEV detector architectures (e.g., BEV grid-based and token-based). Extensive experiments demonstrate AsyncBEV improves robustness against both small and large asynchrony between LiDAR or camera sensors in both the token-based CMT and grid-based UniBEV, especially for dynamic objects. We significantly outperform the ego motion compensated CMT and UniBEV baselines, notably by $16.6$ % and $11.9$ % NDS on dynamic objects in the worst-case scenario of a $0.5 s$ time offset. Code will be released upon acceptance.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes AsyncBEV, a module that predicts 2D flow in bird's-eye-view feature space to align asynchronous LiDAR-camera data for 3D object detection. It resides in the 'BEV Feature Flow Prediction for Sensor Asynchrony' leaf, which contains only two papers including this one. This represents a relatively sparse research direction within the broader taxonomy of asynchronous multi-modal detection, suggesting the specific approach of BEV-space flow prediction for temporal alignment is not yet heavily explored compared to attention-based or cooperative perception methods.

The taxonomy reveals neighboring directions that address asynchrony differently. The sibling leaf 'Vehicle-Infrastructure Flow-Based Cooperative Fusion' applies flow prediction in V2I scenarios rather than single-vehicle settings. Adjacent branches include 'Attention-Based Multi-Modal and Cooperative Fusion' with temporal attention mechanisms, and 'Calibration-Robust and Geometry-Aware Fusion' emphasizing geometric constraints. AsyncBEV diverges from these by focusing on explicit BEV-space flow modeling for single-vehicle sensor synchronization, rather than attention-driven fusion or multi-agent cooperation, occupying a distinct methodological niche.

Among twenty-six candidates examined, the AsyncBEV module contribution shows one refutable candidate from ten examined, while the cross-modal flow alignment approach has two refutable candidates from six examined. The generic integration framework contribution appears more novel with zero refutable candidates among ten examined. The limited search scope means these statistics reflect top-K semantic matches and citation expansion, not exhaustive coverage. The flow-based contributions face more substantial prior work overlap, while the architectural integration aspect appears less contested within the examined literature.

Based on the limited search scope of twenty-six candidates, the work appears to occupy a sparsely populated research direction with modest prior work overlap. The taxonomy structure suggests flow-based temporal alignment in BEV space is less crowded than attention-based or cooperative approaches. However, the analysis cannot confirm novelty beyond the examined candidate set, and the presence of refutable candidates for core contributions indicates meaningful prior work exists in this specific methodological space.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Asynchronous multi-modal 3D object detection in autonomous driving. The field addresses the challenge of fusing data from sensors (cameras, LiDAR, radar, event cameras) that operate at different frequencies and with varying latencies, requiring methods to align temporal mismatches while preserving spatial accuracy. The taxonomy reveals several complementary research directions: Temporal Alignment and Flow-Based Fusion Methods focus on predicting intermediate representations to bridge asynchronous gaps, often using optical or scene flow in bird's-eye view (BEV) space. Attention-Based Multi-Modal and Cooperative Fusion emphasizes learning-driven alignment through cross-modal attention mechanisms. Calibration-Robust and Geometry-Aware Fusion tackles extrinsic parameter uncertainties and geometric consistency across modalities. Event-Based Asynchronous Perception leverages neuromorphic sensors for high-temporal-resolution inputs. Multi-Agent Cooperative Perception Systems extend fusion to vehicle-to-vehicle (V2V) and vehicle-to-infrastructure (V2I) scenarios, where communication delays and pose uncertainties compound asynchrony challenges. Specialized Sensor Fusion Architectures and Foundational Methods provide system-level integration strategies, while Survey and Review Literature synthesizes progress across these branches. Recent work has intensified around flow-based temporal compensation and cooperative perception under real-world constraints. AsyncBEV[0] exemplifies the flow prediction approach by forecasting BEV features to synchronize asynchronous sensor streams, closely related to Velocity driven vision[1], which similarly exploits motion cues for temporal alignment. This contrasts with attention-driven methods like Multi-Modal 3D Object Detection[3] and cooperative frameworks such as V2X-ViT[10] and Practical collaborative perception[11], which prioritize learned feature interactions over explicit flow modeling. Meanwhile, event-based approaches like Ev-3DOD[21] and SpikiLi[8] offer alternative pathways by capturing continuous temporal information. AsyncBEV[0] sits within the temporal alignment cluster, sharing methodological DNA with flow-based works but distinguished by its focus on BEV-space prediction rather than image-space warping or purely attention-based fusion, addressing a practical gap in handling multi-rate sensor inputs for robust 3D detection.

Claimed Contributions

AsyncBEV module for asynchronous 3D object detection

Can Refute

10 retrieved papers

The authors introduce AsyncBEV, a module that estimates 2D flow from BEV features of different sensor modalities while accounting for known time offsets, then uses this flow to warp and align feature maps. This module is designed to be lightweight, trainable, and easily integrated into different BEV detector architectures.

10 retrieved papers

Can Refute

Cross-modal flow alignment approach using scene flow estimation

Can Refute

6 retrieved papers

The method draws inspiration from scene flow estimation to predict feature flow between asynchronous sensor modalities. This predicted flow is then used to spatially align feature maps from sensors with time offsets, addressing the asynchrony problem in multi-modal perception.

6 retrieved papers

Can Refute

Generic integration framework for BEV detector architectures

10 retrieved papers

The authors demonstrate that their AsyncBEV module can be generically integrated into various existing BEV detector architectures, including both grid-based and token-based approaches, making it a flexible solution for handling sensor asynchrony across different detection frameworks.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] Velocity driven vision: asynchronous sensor fusion birds eye view models for autonomous vehicles PDF

Sharma Sushil, Seamie Hayes, Eising, CiarÃ¡n, Sushil Sharma, CiarÃ¡n Eising (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

AsyncBEV module for asynchronous 3D object detection

[48] A Method of Time Alignment in BEV Features for Multimodal Fusion Object Detection of Intelligent Vehicles PDF

Can Refute

[1] Velocity driven vision: asynchronous sensor fusion birds eye view models for autonomous vehicles PDF

Cannot Refute

[11] Practical collaborative perception: A framework for asynchronous and multi-agent 3d object detection PDF

Cannot Refute

[12] Dair-v2x: A large-scale dataset for vehicle-infrastructure cooperative 3d object detection PDF

Cannot Refute

[44] BEVDet4D: Exploit Temporal Cues in Multi-camera 3D Object Detection PDF

Cannot Refute

[45] Temporal feature fusion with deformable attention for multi-view 3D object detection PDF

Cannot Refute

[46] UncertainBEV: Uncertainty-aware BEV fusion for roadside 3D object detection PDF

Cannot Refute

[47] PETRv2: A Unified Framework for 3D Perception from Multi-Camera Images PDF

Cannot Refute

[49] Uvcpnet: A uav-vehicle collaborative perception network for 3d object detection PDF

Cannot Refute

[50] ContrastAlign: Toward Robust BEV Feature Alignment via Contrastive Learning for Multi-Modal 3D Object Detection PDF

Cannot Refute

Contribution

Cross-modal flow alignment approach using scene flow estimation

[14] Flow-Based Feature Fusion for Vehicle-Infrastructure Cooperative 3D Object Detection PDF

Can Refute

[43] Asynchrony-robust collaborative perception via bird's eye view flow PDF

Can Refute

[11] Practical collaborative perception: A framework for asynchronous and multi-agent 3d object detection PDF

Cannot Refute

[22] Deep learning multi-modal fusion based 3D object detection PDF

Cannot Refute

[41] Multi-modal Dynamic Point Cloud Geometric Compression Based on Bidirectional Recurrent Scene Flow PDF

Cannot Refute

[42] Rpeflow: Multimodal fusion of rgb-pointcloud-event for joint optical flow and scene flow estimation PDF

Cannot Refute

Contribution

Generic integration framework for BEV detector architectures

[31] Bevformer v2: Adapting modern image backbones to bird's-eye-view recognition via perspective supervision PDF

Cannot Refute

[32] BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation PDF

Cannot Refute

[33] Fast-bev: A fast and strong bird's-eye view perception baseline PDF

Cannot Refute

[34] An efficient multimodal fusion bird's-eye view 3D object detection algorithm PDF

Cannot Refute

[35] MBEV: Multi-Camera Joint 3D Detection and Segmentation with Unified Birds-Eye View Representation PDF

Cannot Refute

[36] RCBEVDet: Radar-Camera Fusion in Bird's Eye View for 3D Object Detection PDF

Cannot Refute

[37] BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers PDF

Cannot Refute

[38] Automatic parking based on a bird's eye view vision system PDF

Cannot Refute

[39] ArticuBEVSeg: Road Semantic Understanding and Its Application in Bird's Eye View from Panoramic Vision System of Long Combination Vehicles PDF

Cannot Refute

[40] Fast-BEV: Towards real-time on-vehicle bird's-eye view perception PDF

Cannot Refute

AsyncBEV: Cross-modal flow alignment in Asynchronous 3D Object Detection

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] Velocity driven vision: asynchronous sensor fusion birds eye view models for autonomous vehicles PDF

Contribution Analysis

AsyncBEV module for asynchronous 3D object detection

[48] A Method of Time Alignment in BEV Features for Multimodal Fusion Object Detection of Intelligent Vehicles PDF

[1] Velocity driven vision: asynchronous sensor fusion birds eye view models for autonomous vehicles PDF

[11] Practical collaborative perception: A framework for asynchronous and multi-agent 3d object detection PDF

[12] Dair-v2x: A large-scale dataset for vehicle-infrastructure cooperative 3d object detection PDF

[44] BEVDet4D: Exploit Temporal Cues in Multi-camera 3D Object Detection PDF

[45] Temporal feature fusion with deformable attention for multi-view 3D object detection PDF

[46] UncertainBEV: Uncertainty-aware BEV fusion for roadside 3D object detection PDF

[47] PETRv2: A Unified Framework for 3D Perception from Multi-Camera Images PDF

[49] Uvcpnet: A uav-vehicle collaborative perception network for 3d object detection PDF

[50] ContrastAlign: Toward Robust BEV Feature Alignment via Contrastive Learning for Multi-Modal 3D Object Detection PDF

Cross-modal flow alignment approach using scene flow estimation

[14] Flow-Based Feature Fusion for Vehicle-Infrastructure Cooperative 3D Object Detection PDF

[43] Asynchrony-robust collaborative perception via bird's eye view flow PDF

[11] Practical collaborative perception: A framework for asynchronous and multi-agent 3d object detection PDF

[22] Deep learning multi-modal fusion based 3D object detection PDF

[41] Multi-modal Dynamic Point Cloud Geometric Compression Based on Bidirectional Recurrent Scene Flow PDF

[42] Rpeflow: Multimodal fusion of rgb-pointcloud-event for joint optical flow and scene flow estimation PDF

Generic integration framework for BEV detector architectures

[31] Bevformer v2: Adapting modern image backbones to bird's-eye-view recognition via perspective supervision PDF

[32] BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation PDF

[33] Fast-bev: A fast and strong bird's-eye view perception baseline PDF

[34] An efficient multimodal fusion bird's-eye view 3D object detection algorithm PDF

[35] MBEV: Multi-Camera Joint 3D Detection and Segmentation with Unified Birds-Eye View Representation PDF

[36] RCBEVDet: Radar-Camera Fusion in Bird's Eye View for 3D Object Detection PDF

[37] BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers PDF

[38] Automatic parking based on a bird's eye view vision system PDF

[39] ArticuBEVSeg: Road Semantic Understanding and Its Application in Bird's Eye View from Panoramic Vision System of Long Combination Vehicles PDF

[40] Fast-BEV: Towards real-time on-vehicle bird's-eye view perception PDF

Table of Contents