SiMO: Single-Modality-Operable Multimodal Collaborative Perception

ICLR 2026 Conference SubmissionAnonymous Authors
collaborative perceptionmultimodalmodal failuremodality competition
Abstract:

Collaborative perception integrates multi-agent perspectives to enhance the sensing range and overcome occlusion issues. While existing multimodal approaches leverage complementary sensors to improve performance, they are highly prone to failure—especially when a key sensor like LiDAR is unavailable. The root cause is that feature fusion leads to semantic mismatches between single-modality features and the downstream modules. This paper addresses this challenge for the first time in the field of collaborative perception, introducing Single-Modality-Operable Multimodal Collaborative Perception (SiMO). By adopting the proposed Length-Adaptive Multi-Modal Fusion (LAMMA), SiMO can adaptively handle remaining modal features during modal failures while maintaining consistency of the semantic space. Additionally, leveraging the innovative "Pretrain-Align-Fuse-RD" training strategy, SiMO addresses the issue of modality competition—generally overlooked by existing methods—ensuring the independence of each individual modality branch. Experiments demonstrate that SiMO effectively aligns multimodal features while simultaneously preserving modality-specific features, enabling it to maintain optimal performance across all individual modalities.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces SiMO, a framework enabling multimodal collaborative perception to operate effectively when reduced to single-modality input during sensor failures. It resides in the 'Adaptive Fusion with Degradation Awareness' leaf, which contains four papers total. This leaf sits within the broader 'Sensor Fusion Architectures for Robustness' branch, indicating a moderately populated research direction focused on maintaining perception performance under sensor degradation. The taxonomy reveals this is an active but not overcrowded area, with sibling papers addressing related robustness challenges through different mechanisms.

The taxonomy structure shows neighboring leaves addressing complementary approaches: 'Cross-Modal Translation and Reconstruction' (three papers) focuses on synthesizing missing modalities, while 'Unified Canonical Space Fusion' (two papers) projects features into shared representations. The parent branch 'Sensor Fusion Architectures for Robustness' excludes multi-agent collaborative systems, yet SiMO explicitly targets collaborative perception scenarios. This positions the work at the intersection of two major branches—collaborative perception and sensor fusion robustness—suggesting it bridges a gap between typically separate research directions within the field's organizational structure.

Among thirty candidates examined, the SiMO contribution shows one refutable candidate from ten examined, while LAMMA shows zero from ten, and PAFR shows one from ten. The statistics indicate that the core framework concept (SiMO) and training strategy (PAFR) each encounter at least one prior work with overlapping ideas within the limited search scope, whereas the specific fusion mechanism (LAMMA) appears more distinctive among examined candidates. The modest search scale (thirty total candidates) means these findings reflect top semantic matches rather than exhaustive coverage, leaving open questions about less-cited or domain-specific prior work.

Based on the limited literature search, the work appears to occupy a meaningful position bridging collaborative perception and modality-robust fusion, though the search scope constrains definitive novelty assessment. The taxonomy reveals this intersection is relatively underexplored compared to either branch independently. The contribution-level statistics suggest the fusion mechanism may be the most distinctive element, while the overall framework and training approach show some overlap with examined prior work, though the specific combination in collaborative settings remains less explored within the analyzed candidate set.

Taxonomy

Core-task Taxonomy Papers
26
3
Claimed Contributions
30
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: Multimodal collaborative perception with single-modality operation under sensor failures. The field addresses how autonomous systems can maintain robust perception when sensors fail or degrade, particularly in collaborative multi-agent settings. The taxonomy organizes work into four main branches: Collaborative Multi-Agent Perception Systems explore how multiple agents share and fuse information (e.g., Multi-UAV Perception[1], BM2CP[3], Cooperative Perception Survey[5]); Sensor Fusion Architectures for Robustness develop fusion mechanisms that gracefully handle missing or degraded modalities (e.g., Robust Multi-Sensor Fusion[6], Degradation-Aware SLAM[7]); Training Strategies for Modality Robustness focus on learning approaches that prepare models for sensor failures (e.g., Unsupervised Sensor Failures[12], Missing-Modality Handling[16]); and Application-Specific Fusion Systems tailor solutions to domains like autonomous driving (Multi-Sensor Autonomous Driving[8]) or robotics (Soft Robot Perception[15], Collaborative Sorting Arms[2]). These branches reflect a shared concern: ensuring that perception pipelines remain functional despite the inherent unreliability of real-world sensors. A particularly active line of work centers on adaptive fusion mechanisms that dynamically adjust to sensor quality, contrasting with static fusion schemes that assume all modalities are always available. Within this landscape, SiMO[0] sits alongside methods like Robust Multi-Sensor Fusion[6] and Degradation-Aware SLAM[7] in the Adaptive Fusion with Degradation Awareness cluster. While Degradation-Aware SLAM[7] emphasizes real-time quality assessment for SLAM tasks and Robust Multi-Sensor Fusion[6] focuses on general-purpose robustness, SiMO[0] specifically targets collaborative perception scenarios where agents must operate effectively even when reduced to single-modality input. This emphasis on collaborative settings distinguishes it from purely single-agent approaches like Missing-Modality Inference[24] or Fusion Weight Regularization[26], which address modality dropout but not the multi-agent coordination challenge. The interplay between collaboration and robustness remains an open question, as systems must balance communication overhead with the benefits of shared perception under uncertain sensor conditions.

Claimed Contributions

Single-Modality-Operable Multimodal Collaborative Perception (SiMO)

SiMO is a novel framework that enables multimodal collaborative perception systems to maintain functionality when individual sensors (especially LiDAR) fail. Unlike existing methods that collapse during modal failures, SiMO allows the system to operate with any available modality by maintaining semantic consistency across single-modal and multimodal features.

10 retrieved papers
Can Refute
Length-Adaptive Multi-Modal Fusion (LAMMA)

LAMMA is a plug-and-play fusion module that adaptively handles varying numbers of modal features during sensor failures. It structurally ensures consistent feature processing across modalities and preserves semantic alignment before and after fusion through attention-based mechanisms that degrade gracefully to self-attention when modalities are missing.

10 retrieved papers
Pretrain-Align-Fuse-RD (PAFR) training strategy

PAFR is a multi-stage training strategy that addresses modality competition by independently pre-training each modality branch before fusion. This approach ensures balanced multimodal learning and preserves the independent functionality of each modality, avoiding the imbalanced training that occurs in naive joint learning.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Single-Modality-Operable Multimodal Collaborative Perception (SiMO)

SiMO is a novel framework that enables multimodal collaborative perception systems to maintain functionality when individual sensors (especially LiDAR) fail. Unlike existing methods that collapse during modal failures, SiMO allows the system to operate with any available modality by maintaining semantic consistency across single-modal and multimodal features.

Contribution

Length-Adaptive Multi-Modal Fusion (LAMMA)

LAMMA is a plug-and-play fusion module that adaptively handles varying numbers of modal features during sensor failures. It structurally ensures consistent feature processing across modalities and preserves semantic alignment before and after fusion through attention-based mechanisms that degrade gracefully to self-attention when modalities are missing.

Contribution

Pretrain-Align-Fuse-RD (PAFR) training strategy

PAFR is a multi-stage training strategy that addresses modality competition by independently pre-training each modality branch before fusion. This approach ensures balanced multimodal learning and preserves the independent functionality of each modality, avoiding the imbalanced training that occurs in naive joint learning.

SiMO: Single-Modality-Operable Multimodal Collaborative Perception | Novelty Validation