GT-Space: Enhancing Heterogeneous Collaborative Perception with Ground Truth Feature Space

ICLR 2026 Conference SubmissionAnonymous Authors
Collaborative perceptionmulti-modalitymulti-agentsensor fusion
Abstract:

In autonomous driving, multi-agent collaborative perception enhances sensing capabilities by enabling agents to share perceptual data. A key challenge lies in handling heterogeneous features from agents equipped with different sensing modalities or model architectures, which complicates data fusion. Existing approaches often require retraining encoders or designing interpreter modules for pairwise feature alignment, but these solutions are not scalable in practice. To address this, we propose GT-Space, a flexible and scalable collaborative perception framework for heterogeneous agents. GT-Space constructs a common feature space from ground-truth labels, providing a unified reference for feature alignment. With this shared space, agents only need a single adapter module to project their features, eliminating the need for pairwise interactions with other agents. Furthermore, we design a fusion network trained with contrastive losses across diverse modality combinations. Extensive experiments on simulation datasets (OPV2V and V2XSet) and a real-world dataset (RCooper) demonstrate that GT-Space consistently outperforms baselines in detection accuracy while delivering robust performance.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

GT-Space proposes a framework for heterogeneous collaborative perception that constructs a common feature space from ground-truth labels, enabling agents with different modalities or architectures to align features via single adapter modules. The paper resides in the Heterogeneous Multi-Modal Fusion Frameworks leaf, which contains six papers addressing feature alignment and fusion across diverse collaborative agents. This leaf sits within the broader Collaborative Perception Frameworks and Architectures branch, indicating a moderately populated research direction focused on reconciling sensor and model heterogeneity in multi-agent systems.

The taxonomy reveals neighboring leaves addressing homogeneous collaborative perception (four papers) and unified multi-agent platforms (two papers), suggesting the field distinguishes between heterogeneous and homogeneous settings. Within the heterogeneous fusion leaf, sibling works like HM-ViT and Hetecooper tackle similar alignment challenges but differ in their fusion mechanisms. The broader Frameworks branch excludes communication optimization and sensor deployment, which are addressed in separate sibling categories, clarifying that GT-Space focuses specifically on feature-level fusion rather than bandwidth management or infrastructure placement.

Among thirty candidates examined, the GT-Space framework contribution shows one refutable candidate out of ten examined, suggesting some prior work addresses similar heterogeneous fusion architectures. The ground-truth derived common feature space contribution examined ten candidates with none clearly refuting it, indicating this specific alignment strategy may be less explored. The combinatorial contrastive loss training strategy similarly shows no refutations among ten candidates examined. These statistics reflect a limited semantic search scope rather than exhaustive coverage, with the framework-level contribution appearing more connected to existing work than the specific technical mechanisms.

Based on top-thirty semantic matches, GT-Space appears to occupy a moderately crowded research area where heterogeneous fusion is actively studied, but its specific approach using ground-truth-derived feature spaces and combinatorial contrastive training shows less direct overlap with examined prior work. The analysis does not cover broader fusion literature outside collaborative perception or recent preprints beyond the search scope.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: Heterogeneous collaborative perception for autonomous driving. The field organizes around several major branches that address complementary challenges in enabling vehicles and infrastructure to share and fuse sensory information. Collaborative Perception Frameworks and Architectures develop methods for integrating data from diverse agents and modalities, with works like HM-ViT[13] and Hetecooper[29] tackling heterogeneous multi-modal fusion. Communication and Resource Optimization focuses on bandwidth-efficient sharing strategies, while Sensor Deployment and Coverage Enhancement examines optimal placement of sensing assets. Datasets and Benchmarks such as V2X-Sim[1] and OPV2V[21] provide evaluation testbeds, and Federated and Distributed Learning branches explore privacy-preserving cooperative training. Single-Vehicle Multi-Sensor Fusion addresses on-board integration, Cooperative Vehicle-Infrastructure Systems target real-world deployment scenarios, and Surveys and Reviews synthesize progress across these areas. Incentive Mechanisms and Multi-Agent Learning consider the strategic and economic dimensions of collaboration. A particularly active line of work centers on heterogeneous multi-modal fusion frameworks, where the challenge is to reconcile different sensor types, viewpoints, and agent capabilities. GT-Space[0] sits within this cluster, emphasizing geometric transformation and spatial alignment to handle heterogeneity. Nearby efforts like Heterogeneous Multiscale Cooperative[2] and Hecofuse[14] similarly address multi-scale or multi-modal integration, while Polymorphic Feature Interpreter[35] explores adaptive feature representations. Compared to these neighbors, GT-Space[0] appears to prioritize explicit spatial reasoning over purely learned fusion strategies. Broader tensions in the field include trade-offs between communication overhead and perception accuracy, the gap between simulation benchmarks and real-world deployment, and the challenge of designing frameworks that generalize across diverse agent configurations and environmental conditions.

Claimed Contributions

GT-Space framework for heterogeneous collaborative perception

The authors introduce GT-Space, a collaborative perception framework that constructs a common feature space from ground-truth labels to enable heterogeneous agents with different sensing modalities or model architectures to align their features. Each agent requires only a single adapter module to project features into this shared space, eliminating the need for pairwise interactions.

10 retrieved papers
Can Refute
Ground-truth derived common feature space

The method explicitly constructs a common feature space by encoding ground-truth object information (locations, sizes, and properties) into bird's-eye view features. This space provides a shared, accurate reference for aligning heterogeneous features and offers strong intermediate supervision signals beyond final detection outputs.

10 retrieved papers
Combinatorial contrastive loss training strategy

The authors propose training the fusion network using combinatorial contrastive losses computed over all possible modality pairs. This strategy enables the model to effectively fuse any combination of input modalities at inference time and enhances the model's ability to capture object-relevant information across different modalities.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

GT-Space framework for heterogeneous collaborative perception

The authors introduce GT-Space, a collaborative perception framework that constructs a common feature space from ground-truth labels to enable heterogeneous agents with different sensing modalities or model architectures to align their features. Each agent requires only a single adapter module to project features into this shared space, eliminating the need for pairwise interactions.

Contribution

Ground-truth derived common feature space

The method explicitly constructs a common feature space by encoding ground-truth object information (locations, sizes, and properties) into bird's-eye view features. This space provides a shared, accurate reference for aligning heterogeneous features and offers strong intermediate supervision signals beyond final detection outputs.

Contribution

Combinatorial contrastive loss training strategy

The authors propose training the fusion network using combinatorial contrastive losses computed over all possible modality pairs. This strategy enables the model to effectively fuse any combination of input modalities at inference time and enhances the model's ability to capture object-relevant information across different modalities.

GT-Space: Enhancing Heterogeneous Collaborative Perception with Ground Truth Feature Space | Novelty Validation