SiMO: Single-Modality-Operable Multimodal Collaborative Perception

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

collaborative perceptionmultimodalmodal failuremodality competition

Collaborative perception integrates multi-agent perspectives to enhance the sensing range and overcome occlusion issues. While existing multimodal approaches leverage complementary sensors to improve performance, they are highly prone to failure—especially when a key sensor like LiDAR is unavailable. The root cause is that feature fusion leads to semantic mismatches between single-modality features and the downstream modules. This paper addresses this challenge for the first time in the field of collaborative perception, introducing Single-Modality-Operable Multimodal Collaborative Perception (SiMO). By adopting the proposed Length-Adaptive Multi-Modal Fusion (LAMMA), SiMO can adaptively handle remaining modal features during modal failures while maintaining consistency of the semantic space. Additionally, leveraging the innovative "Pretrain-Align-Fuse-RD" training strategy, SiMO addresses the issue of modality competition—generally overlooked by existing methods—ensuring the independence of each individual modality branch. Experiments demonstrate that SiMO effectively aligns multimodal features while simultaneously preserving modality-specific features, enabling it to maintain optimal performance across all individual modalities.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces SiMO, a framework enabling multimodal collaborative perception to operate effectively when reduced to single-modality input during sensor failures. It resides in the 'Adaptive Fusion with Degradation Awareness' leaf, which contains four papers total. This leaf sits within the broader 'Sensor Fusion Architectures for Robustness' branch, indicating a moderately populated research direction focused on maintaining perception performance under sensor degradation. The taxonomy reveals this is an active but not overcrowded area, with sibling papers addressing related robustness challenges through different mechanisms.

The taxonomy structure shows neighboring leaves addressing complementary approaches: 'Cross-Modal Translation and Reconstruction' (three papers) focuses on synthesizing missing modalities, while 'Unified Canonical Space Fusion' (two papers) projects features into shared representations. The parent branch 'Sensor Fusion Architectures for Robustness' excludes multi-agent collaborative systems, yet SiMO explicitly targets collaborative perception scenarios. This positions the work at the intersection of two major branches—collaborative perception and sensor fusion robustness—suggesting it bridges a gap between typically separate research directions within the field's organizational structure.

Among thirty candidates examined, the SiMO contribution shows one refutable candidate from ten examined, while LAMMA shows zero from ten, and PAFR shows one from ten. The statistics indicate that the core framework concept (SiMO) and training strategy (PAFR) each encounter at least one prior work with overlapping ideas within the limited search scope, whereas the specific fusion mechanism (LAMMA) appears more distinctive among examined candidates. The modest search scale (thirty total candidates) means these findings reflect top semantic matches rather than exhaustive coverage, leaving open questions about less-cited or domain-specific prior work.

Based on the limited literature search, the work appears to occupy a meaningful position bridging collaborative perception and modality-robust fusion, though the search scope constrains definitive novelty assessment. The taxonomy reveals this intersection is relatively underexplored compared to either branch independently. The contribution-level statistics suggest the fusion mechanism may be the most distinctive element, while the overall framework and training approach show some overlap with examined prior work, though the specific combination in collaborative settings remains less explored within the analyzed candidate set.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Multimodal collaborative perception with single-modality operation under sensor failures. The field addresses how autonomous systems can maintain robust perception when sensors fail or degrade, particularly in collaborative multi-agent settings. The taxonomy organizes work into four main branches: Collaborative Multi-Agent Perception Systems explore how multiple agents share and fuse information (e.g., Multi-UAV Perception[1], BM2CP[3], Cooperative Perception Survey[5]); Sensor Fusion Architectures for Robustness develop fusion mechanisms that gracefully handle missing or degraded modalities (e.g., Robust Multi-Sensor Fusion[6], Degradation-Aware SLAM[7]); Training Strategies for Modality Robustness focus on learning approaches that prepare models for sensor failures (e.g., Unsupervised Sensor Failures[12], Missing-Modality Handling[16]); and Application-Specific Fusion Systems tailor solutions to domains like autonomous driving (Multi-Sensor Autonomous Driving[8]) or robotics (Soft Robot Perception[15], Collaborative Sorting Arms[2]). These branches reflect a shared concern: ensuring that perception pipelines remain functional despite the inherent unreliability of real-world sensors. A particularly active line of work centers on adaptive fusion mechanisms that dynamically adjust to sensor quality, contrasting with static fusion schemes that assume all modalities are always available. Within this landscape, SiMO[0] sits alongside methods like Robust Multi-Sensor Fusion[6] and Degradation-Aware SLAM[7] in the Adaptive Fusion with Degradation Awareness cluster. While Degradation-Aware SLAM[7] emphasizes real-time quality assessment for SLAM tasks and Robust Multi-Sensor Fusion[6] focuses on general-purpose robustness, SiMO[0] specifically targets collaborative perception scenarios where agents must operate effectively even when reduced to single-modality input. This emphasis on collaborative settings distinguishes it from purely single-agent approaches like Missing-Modality Inference[24] or Fusion Weight Regularization[26], which address modality dropout but not the multi-agent coordination challenge. The interplay between collaboration and robustness remains an open question, as systems must balance communication overhead with the benefits of shared perception under uncertain sensor conditions.

Claimed Contributions

Single-Modality-Operable Multimodal Collaborative Perception (SiMO)

Can Refute

10 retrieved papers

SiMO is a novel framework that enables multimodal collaborative perception systems to maintain functionality when individual sensors (especially LiDAR) fail. Unlike existing methods that collapse during modal failures, SiMO allows the system to operate with any available modality by maintaining semantic consistency across single-modal and multimodal features.

10 retrieved papers

Can Refute

Length-Adaptive Multi-Modal Fusion (LAMMA)

10 retrieved papers

LAMMA is a plug-and-play fusion module that adaptively handles varying numbers of modal features during sensor failures. It structurally ensures consistent feature processing across modalities and preserves semantic alignment before and after fusion through attention-based mechanisms that degrade gracefully to self-attention when modalities are missing.

10 retrieved papers

Pretrain-Align-Fuse-RD (PAFR) training strategy

Can Refute

10 retrieved papers

PAFR is a multi-stage training strategy that addresses modality competition by independently pre-training each modality branch before fusion. This approach ensures balanced multimodal learning and preserves the independent functionality of each modality, avoiding the imbalanced training that occurs in naive joint learning.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[2] Adaptive control system for collaborative sorting robotic arms based on multimodal sensor fusion and edge computing PDF

Yanfang Feng (2025)

[6] Robust environmental perception of multi-sensor data fusion PDF

Huihui Pan, Jue Wang, Xinghu Yu, Weichao Sun, Huijun Gao (2023)

[7] Degradation-Aware LiDAR-Thermal-Inertial SLAM PDF

Yu Wang, Yufeng Liu, Lingxu Chen, Haoyao Chen, Shiwu Zhang (2025) • IEEE Robotics and Automation Letters

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Single-Modality-Operable Multimodal Collaborative Perception (SiMO)

[30] Unibev: Multi-modal 3d object detection with uniform bev encoders for robustness against missing sensor modalities PDF

Can Refute

[3] BM2CP: Efficient Collaborative Perception with LiDAR-Camera Modalities PDF

Cannot Refute

[17] Learning end-to-end multimodal sensor policies for autonomous navigation PDF

Cannot Refute

[27] Multimodal Model-Based Reinforcement Learning for Autonomous Racing PDF

Cannot Refute

[28] Resilient Sensor Fusion under Adverse Sensor Failures via Multi-Modal Expert Fusion PDF

Cannot Refute

[29] Learning Robust Anymodal Segmentor with Unimodal and Cross-modal Distillation PDF

Cannot Refute

[31] Multimodal Sensor Fusion in Autonomous Driving: A Deep Learning-Based Visual Perception Framework PDF

Cannot Refute

[32] SensorFusionNet: A Novel Approach for Dynamic Traffic Sign Interpretation Using Multi-Sensor Data PDF

Cannot Refute

[33] Investigating the effect of sensor modalities in multi-sensor detection-prediction models PDF

Cannot Refute

[34] Multimodal deep learning for multiple motor and sensor faults diagnosis PDF

Cannot Refute

Contribution

Length-Adaptive Multi-Modal Fusion (LAMMA)

[45] Context-based adaptive multimodal fusion network for continuous frame-level sentiment prediction PDF

Cannot Refute

[46] Adaptive cross-modal fusion for robust multi-modal object detection in infraredâvisible imaging PDF

Cannot Refute

[47] Convofusion: Multi-modal conversational diffusion for co-speech gesture synthesis PDF

Cannot Refute

[48] Connecting multi-modal contrastive representations PDF

Cannot Refute

[49] MSeg3D: Multi-Modal 3D Semantic Segmentation for Autonomous Driving PDF

Cannot Refute

[50] XGeM: A Multi-Prompt Foundation Model for Multimodal Medical Data Generation PDF

Cannot Refute

[51] Multi-modal medical image segmentation using vision transformers (vits) PDF

Cannot Refute

[52] MM-TTA: Multi-Modal Test-Time Adaptation for 3D Semantic Segmentation PDF

Cannot Refute

[53] Enhancing multimodal translation: Achieving consistency among visual information, source language and target language PDF

Cannot Refute

[54] Semantic alignment for multimodal large language models PDF

Cannot Refute

Contribution

Pretrain-Align-Fuse-RD (PAFR) training strategy

[42] Multi-Stage Training and Fusion Method for Imbalanced Multimodal UAV Remote Sensing Classification PDF

Can Refute

[35] Learning Modality-Specific Representations with Self-Supervised Multi-Task Learning for Multimodal Sentiment Analysis PDF

Cannot Refute

[36] Intra- and Inter-Modal Curriculum for Multimodal Learning PDF

Cannot Refute

[37] Continual self-supervised learning: Towards universal multi-modal medical data representation learning PDF

Cannot Refute

[38] Text-dominant multimodal perception network for sentiment analysis based on cross-modal semantic enhancements. PDF

Cannot Refute

[39] M2-omni: Advancing omni-mllm for comprehensive modality support with competitive performance PDF

Cannot Refute

[40] Multi-stage multi-modal pre-training for automatic speech recognition PDF

Cannot Refute

[41] Harmony: Heterogeneous Multi-Modal Federated Learning through Disentangled Model Training PDF

Cannot Refute

[43] Modality-Aware SAM: Sharpness-Aware-Minimization Driven Gradient Modulation for Harmonized Multimodal Learning PDF

Cannot Refute

[44] Balanced Multimodal Learning via Mutual Information PDF

Cannot Refute

SiMO: Single-Modality-Operable Multimodal Collaborative Perception

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[2] Adaptive control system for collaborative sorting robotic arms based on multimodal sensor fusion and edge computing PDF

[6] Robust environmental perception of multi-sensor data fusion PDF

[7] Degradation-Aware LiDAR-Thermal-Inertial SLAM PDF

Contribution Analysis

Single-Modality-Operable Multimodal Collaborative Perception (SiMO)

[30] Unibev: Multi-modal 3d object detection with uniform bev encoders for robustness against missing sensor modalities PDF

[3] BM2CP: Efficient Collaborative Perception with LiDAR-Camera Modalities PDF

[17] Learning end-to-end multimodal sensor policies for autonomous navigation PDF

[27] Multimodal Model-Based Reinforcement Learning for Autonomous Racing PDF

[28] Resilient Sensor Fusion under Adverse Sensor Failures via Multi-Modal Expert Fusion PDF

[29] Learning Robust Anymodal Segmentor with Unimodal and Cross-modal Distillation PDF

[31] Multimodal Sensor Fusion in Autonomous Driving: A Deep Learning-Based Visual Perception Framework PDF

[32] SensorFusionNet: A Novel Approach for Dynamic Traffic Sign Interpretation Using Multi-Sensor Data PDF

[33] Investigating the effect of sensor modalities in multi-sensor detection-prediction models PDF

[34] Multimodal deep learning for multiple motor and sensor faults diagnosis PDF

Length-Adaptive Multi-Modal Fusion (LAMMA)

[45] Context-based adaptive multimodal fusion network for continuous frame-level sentiment prediction PDF

[46] Adaptive cross-modal fusion for robust multi-modal object detection in infraredâvisible imaging PDF

[47] Convofusion: Multi-modal conversational diffusion for co-speech gesture synthesis PDF

[48] Connecting multi-modal contrastive representations PDF

[49] MSeg3D: Multi-Modal 3D Semantic Segmentation for Autonomous Driving PDF

[50] XGeM: A Multi-Prompt Foundation Model for Multimodal Medical Data Generation PDF

[51] Multi-modal medical image segmentation using vision transformers (vits) PDF

[52] MM-TTA: Multi-Modal Test-Time Adaptation for 3D Semantic Segmentation PDF

[53] Enhancing multimodal translation: Achieving consistency among visual information, source language and target language PDF

[54] Semantic alignment for multimodal large language models PDF

Pretrain-Align-Fuse-RD (PAFR) training strategy

[42] Multi-Stage Training and Fusion Method for Imbalanced Multimodal UAV Remote Sensing Classification PDF

[35] Learning Modality-Specific Representations with Self-Supervised Multi-Task Learning for Multimodal Sentiment Analysis PDF

[36] Intra- and Inter-Modal Curriculum for Multimodal Learning PDF

[37] Continual self-supervised learning: Towards universal multi-modal medical data representation learning PDF

[38] Text-dominant multimodal perception network for sentiment analysis based on cross-modal semantic enhancements. PDF

[39] M2-omni: Advancing omni-mllm for comprehensive modality support with competitive performance PDF

[40] Multi-stage multi-modal pre-training for automatic speech recognition PDF

[41] Harmony: Heterogeneous Multi-Modal Federated Learning through Disentangled Model Training PDF

[43] Modality-Aware SAM: Sharpness-Aware-Minimization Driven Gradient Modulation for Harmonized Multimodal Learning PDF

[44] Balanced Multimodal Learning via Mutual Information PDF

Table of Contents

[46] Adaptive cross-modal fusion for robust multi-modal object detection in infraredâvisible imaging PDF