GT-Space: Enhancing Heterogeneous Collaborative Perception with Ground Truth Feature Space

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Collaborative perceptionmulti-modalitymulti-agentsensor fusion

In autonomous driving, multi-agent collaborative perception enhances sensing capabilities by enabling agents to share perceptual data. A key challenge lies in handling heterogeneous features from agents equipped with different sensing modalities or model architectures, which complicates data fusion. Existing approaches often require retraining encoders or designing interpreter modules for pairwise feature alignment, but these solutions are not scalable in practice. To address this, we propose GT-Space, a flexible and scalable collaborative perception framework for heterogeneous agents. GT-Space constructs a common feature space from ground-truth labels, providing a unified reference for feature alignment. With this shared space, agents only need a single adapter module to project their features, eliminating the need for pairwise interactions with other agents. Furthermore, we design a fusion network trained with contrastive losses across diverse modality combinations. Extensive experiments on simulation datasets (OPV2V and V2XSet) and a real-world dataset (RCooper) demonstrate that GT-Space consistently outperforms baselines in detection accuracy while delivering robust performance.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

GT-Space proposes a framework for heterogeneous collaborative perception that constructs a common feature space from ground-truth labels, enabling agents with different modalities or architectures to align features via single adapter modules. The paper resides in the Heterogeneous Multi-Modal Fusion Frameworks leaf, which contains six papers addressing feature alignment and fusion across diverse collaborative agents. This leaf sits within the broader Collaborative Perception Frameworks and Architectures branch, indicating a moderately populated research direction focused on reconciling sensor and model heterogeneity in multi-agent systems.

The taxonomy reveals neighboring leaves addressing homogeneous collaborative perception (four papers) and unified multi-agent platforms (two papers), suggesting the field distinguishes between heterogeneous and homogeneous settings. Within the heterogeneous fusion leaf, sibling works like HM-ViT and Hetecooper tackle similar alignment challenges but differ in their fusion mechanisms. The broader Frameworks branch excludes communication optimization and sensor deployment, which are addressed in separate sibling categories, clarifying that GT-Space focuses specifically on feature-level fusion rather than bandwidth management or infrastructure placement.

Among thirty candidates examined, the GT-Space framework contribution shows one refutable candidate out of ten examined, suggesting some prior work addresses similar heterogeneous fusion architectures. The ground-truth derived common feature space contribution examined ten candidates with none clearly refuting it, indicating this specific alignment strategy may be less explored. The combinatorial contrastive loss training strategy similarly shows no refutations among ten candidates examined. These statistics reflect a limited semantic search scope rather than exhaustive coverage, with the framework-level contribution appearing more connected to existing work than the specific technical mechanisms.

Based on top-thirty semantic matches, GT-Space appears to occupy a moderately crowded research area where heterogeneous fusion is actively studied, but its specific approach using ground-truth-derived feature spaces and combinatorial contrastive training shows less direct overlap with examined prior work. The analysis does not cover broader fusion literature outside collaborative perception or recent preprints beyond the search scope.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Heterogeneous collaborative perception for autonomous driving. The field organizes around several major branches that address complementary challenges in enabling vehicles and infrastructure to share and fuse sensory information. Collaborative Perception Frameworks and Architectures develop methods for integrating data from diverse agents and modalities, with works like HM-ViT[13] and Hetecooper[29] tackling heterogeneous multi-modal fusion. Communication and Resource Optimization focuses on bandwidth-efficient sharing strategies, while Sensor Deployment and Coverage Enhancement examines optimal placement of sensing assets. Datasets and Benchmarks such as V2X-Sim[1] and OPV2V[21] provide evaluation testbeds, and Federated and Distributed Learning branches explore privacy-preserving cooperative training. Single-Vehicle Multi-Sensor Fusion addresses on-board integration, Cooperative Vehicle-Infrastructure Systems target real-world deployment scenarios, and Surveys and Reviews synthesize progress across these areas. Incentive Mechanisms and Multi-Agent Learning consider the strategic and economic dimensions of collaboration. A particularly active line of work centers on heterogeneous multi-modal fusion frameworks, where the challenge is to reconcile different sensor types, viewpoints, and agent capabilities. GT-Space[0] sits within this cluster, emphasizing geometric transformation and spatial alignment to handle heterogeneity. Nearby efforts like Heterogeneous Multiscale Cooperative[2] and Hecofuse[14] similarly address multi-scale or multi-modal integration, while Polymorphic Feature Interpreter[35] explores adaptive feature representations. Compared to these neighbors, GT-Space[0] appears to prioritize explicit spatial reasoning over purely learned fusion strategies. Broader tensions in the field include trade-offs between communication overhead and perception accuracy, the gap between simulation benchmarks and real-world deployment, and the challenge of designing frameworks that generalize across diverse agent configurations and environmental conditions.

Claimed Contributions

GT-Space framework for heterogeneous collaborative perception

Can Refute

10 retrieved papers

The authors introduce GT-Space, a collaborative perception framework that constructs a common feature space from ground-truth labels to enable heterogeneous agents with different sensing modalities or model architectures to align their features. Each agent requires only a single adapter module to project features into this shared space, eliminating the need for pairwise interactions.

10 retrieved papers

Can Refute

Ground-truth derived common feature space

10 retrieved papers

The method explicitly constructs a common feature space by encoding ground-truth object information (locations, sizes, and properties) into bird's-eye view features. This space provides a shared, accurate reference for aligning heterogeneous features and offers strong intermediate supervision signals beyond final detection outputs.

10 retrieved papers

Combinatorial contrastive loss training strategy

10 retrieved papers

The authors propose training the fusion network using combinatorial contrastive losses computed over all possible modality pairs. This strategy enables the model to effectively fuse any combination of input modalities at inference time and enhances the model's ability to capture object-relevant information across different modalities.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[2] Heterogeneous Multiscale Cooperative Perception for Connected Autonomous Vehicles via V2X Interaction PDF

Yuanyuan Zha, Wei Shangguan, Junjie Chen, Shangguan Wei, Linguo Chai, Weizhi Qiu, Antonio M. LÃ³pez (2025)

[13] HM-ViT: Hetero-modal vehicle-to-vehicle cooperative perception with vision transformer PDF

Hao Xiang, Runsheng Xu, Jiaqi Ma (2023)

[14] Hecofuse: Cross-modal complementary v2x cooperative perception with heterogeneous sensors PDF

Zimmer, Walter, Wu, Guoyuan, Barth, Matthew J. (2025)

[29] Hetecooper: Feature collaboration graph for heterogeneous collaborative perception PDF

Congzhang Shao, Guiyang Luo, Quan Yuan, Yifu Chen, Yilin Liu, Kexin Gong, Jinglin Li (2024)

[35] One is Plenty: A Polymorphic Feature Interpreter for Immutable Heterogeneous Collaborative Perception PDF

Yuchen Xia, Quan Yuan, Guiyang Luo, Xiaoyuan Fu, Yang Li, Xuanhan Zhu, Tianyou Luo, Siheng Chen, Jinglin Li (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

GT-Space framework for heterogeneous collaborative perception

[61] An Extensible Framework for Open Heterogeneous Collaborative Perception PDF

Can Refute

[11] Bm2cp: Efficient collaborative perception with lidar-camera modalities PDF

Cannot Refute

[25] A survey and framework of cooperative perception: From heterogeneous singleton to hierarchical cooperation PDF

Cannot Refute

[30] How2comm: Communication-efficient and collaboration-pragmatic multi-agent perception PDF

Cannot Refute

[62] Real-Time Heterogeneous Collaborative Perception in Edge-Enabled Vehicular Environments PDF

Cannot Refute

[63] Heterogeneous Embodied Multi-Agent Collaboration PDF

Cannot Refute

[64] Spatio-Temporal Domain Awareness for Multi-Agent Collaborative Perception PDF

Cannot Refute

[65] Model-Agnostic Multi-Agent Perception Framework PDF

Cannot Refute

[66] OC-HMAS: Dynamic Self-Organization and Self-Correction in Heterogeneous Multi-Agent Systems Using Multi-Modal Large Models PDF

Cannot Refute

[67] Collaborative multi-robot search and rescue: Planning, coordination, perception, and active vision PDF

Cannot Refute

Contribution

Ground-truth derived common feature space

[68] Personalized federated learning with feature alignment and classifier collaboration PDF

Cannot Refute

[69] Radiology report generation with a learned knowledge base and multi-modal alignment PDF

Cannot Refute

[70] Learning Semantic-Aligned Feature Representation for Text-Based Person Search PDF

Cannot Refute

[71] Cross-domain object detection through coarse-to-fine feature adaptation PDF

Cannot Refute

[72] Text prompt with normality guidance for weakly supervised video anomaly detection PDF

Cannot Refute

[73] Semi-supervised domain adaptation for semantic segmentation via active learning with feature-and semantic-level alignments PDF

Cannot Refute

[74] Unified Contrastive Learning in Image-Text-Label Space PDF

Cannot Refute

[75] Cross-domain few-shot hyperspectral image classification with cross-modal alignment and supervised contrastive learning PDF

Cannot Refute

[76] Pseudo-Label Calibration Semi-supervised Multi-Modal Entity Alignment PDF

Cannot Refute

[77] Backprop induced feature weighting for adversarial domain adaptation with iterative label distribution alignment PDF

Cannot Refute

Contribution

Combinatorial contrastive loss training strategy

[51] Hallucination augmented contrastive learning for multimodal large language model PDF

Cannot Refute

[52] Cross-Modal Contrastive Learning for Text-to-Image Generation PDF

Cannot Refute

[53] Cross-modal Contrastive Learning for Multimodal Fake News Detection PDF

Cannot Refute

[54] Modality-aware contrast and fusion for multi-modal summarization PDF

Cannot Refute

[55] Multimodal Graph Contrastive Learning for Multimedia-Based Recommendation PDF

Cannot Refute

[56] Extending multi-modal contrastive representations PDF

Cannot Refute

[57] Connecting multi-modal contrastive representations PDF

Cannot Refute

[58] Multi-modal Graph Contrastive Learning for Micro-video Recommendation PDF

Cannot Refute

[59] Multimodal contrastive training for visual representation learning PDF

Cannot Refute

[60] CoCoNet: Coupled Contrastive Learning Network with Multi-level Feature Ensemble for Multi-modality Image Fusion PDF

Cannot Refute

GT-Space: Enhancing Heterogeneous Collaborative Perception with Ground Truth Feature Space

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[2] Heterogeneous Multiscale Cooperative Perception for Connected Autonomous Vehicles via V2X Interaction PDF

[13] HM-ViT: Hetero-modal vehicle-to-vehicle cooperative perception with vision transformer PDF

[14] Hecofuse: Cross-modal complementary v2x cooperative perception with heterogeneous sensors PDF

[29] Hetecooper: Feature collaboration graph for heterogeneous collaborative perception PDF

[35] One is Plenty: A Polymorphic Feature Interpreter for Immutable Heterogeneous Collaborative Perception PDF

Contribution Analysis

GT-Space framework for heterogeneous collaborative perception

[61] An Extensible Framework for Open Heterogeneous Collaborative Perception PDF

[11] Bm2cp: Efficient collaborative perception with lidar-camera modalities PDF

[25] A survey and framework of cooperative perception: From heterogeneous singleton to hierarchical cooperation PDF

[30] How2comm: Communication-efficient and collaboration-pragmatic multi-agent perception PDF

[62] Real-Time Heterogeneous Collaborative Perception in Edge-Enabled Vehicular Environments PDF

[63] Heterogeneous Embodied Multi-Agent Collaboration PDF

[64] Spatio-Temporal Domain Awareness for Multi-Agent Collaborative Perception PDF

[65] Model-Agnostic Multi-Agent Perception Framework PDF

[66] OC-HMAS: Dynamic Self-Organization and Self-Correction in Heterogeneous Multi-Agent Systems Using Multi-Modal Large Models PDF

[67] Collaborative multi-robot search and rescue: Planning, coordination, perception, and active vision PDF

Ground-truth derived common feature space

[68] Personalized federated learning with feature alignment and classifier collaboration PDF

[69] Radiology report generation with a learned knowledge base and multi-modal alignment PDF

[70] Learning Semantic-Aligned Feature Representation for Text-Based Person Search PDF

[71] Cross-domain object detection through coarse-to-fine feature adaptation PDF

[72] Text prompt with normality guidance for weakly supervised video anomaly detection PDF

[73] Semi-supervised domain adaptation for semantic segmentation via active learning with feature-and semantic-level alignments PDF

[74] Unified Contrastive Learning in Image-Text-Label Space PDF

[75] Cross-domain few-shot hyperspectral image classification with cross-modal alignment and supervised contrastive learning PDF

[76] Pseudo-Label Calibration Semi-supervised Multi-Modal Entity Alignment PDF

[77] Backprop induced feature weighting for adversarial domain adaptation with iterative label distribution alignment PDF

Combinatorial contrastive loss training strategy

[51] Hallucination augmented contrastive learning for multimodal large language model PDF

[52] Cross-Modal Contrastive Learning for Text-to-Image Generation PDF

[53] Cross-modal Contrastive Learning for Multimodal Fake News Detection PDF

[54] Modality-aware contrast and fusion for multi-modal summarization PDF

[55] Multimodal Graph Contrastive Learning for Multimedia-Based Recommendation PDF

[56] Extending multi-modal contrastive representations PDF

[57] Connecting multi-modal contrastive representations PDF

[58] Multi-modal Graph Contrastive Learning for Micro-video Recommendation PDF

[59] Multimodal contrastive training for visual representation learning PDF

[60] CoCoNet: Coupled Contrastive Learning Network with Multi-level Feature Ensemble for Multi-modality Image Fusion PDF

Table of Contents