TerraFM: A Scalable Foundation Model for Unified Multisensor Earth Observation

ICLR 2026 Conference SubmissionAnonymous Authors
Remote SensingFoundation ModelGeospatial
Abstract:

Modern Earth observation (EO) increasingly leverages deep learning to harness the scale and diversity of satellite imagery across sensors and regions. While recent foundation models have demonstrated promising generalization across EO tasks, many remain limited by the scale, geographical coverage, and spectral diversity of their training data, factors critical for learning globally transferable representations. In this work, we introduce TerraFM, a scalable self-supervised learning model that leverages globally distributed Sentinel-1 and Sentinel-2 imagery, combined with large spatial tiles and land-cover aware sampling to enrich spatial and semantic coverage. By treating sensing modalities as natural augmentations in our self-supervised approach, we unify radar and optical inputs via modality-specific patch embeddings and adaptive cross-attention fusion. Our training strategy integrates local-global contrastive learning and introduces a dual-centering mechanism that incorporates class-frequency-aware regularization to address long-tailed distributions in land cover. TerraFM achieves strong generalization on both classification and segmentation tasks, outperforming prior models on GEO-Bench and Copernicus-Bench. Our code and pretrained models will be publicly released.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

TerraFM contributes a self-supervised learning framework that unifies Sentinel-1 radar and Sentinel-2 optical imagery through modality-specific patch embeddings, adaptive cross-attention fusion, and a dual-centering mechanism for long-tailed land-cover distributions. The paper resides in the Attention-Based Cross-Modal Alignment leaf, which contains only three papers total, indicating a relatively sparse research direction within the broader taxonomy. This leaf sits under Multimodal Fusion Mechanisms, one of six major branches in the field, suggesting that while fusion strategies are well-studied, attention-based alignment remains a less crowded niche compared to contrastive pretraining or masked modeling approaches.

The taxonomy reveals that TerraFM's neighboring research directions include Early and Late Fusion Strategies (two papers) and Incomplete Modality Handling (one paper), both of which address multimodal integration but through different mechanisms—concatenation-based fusion versus robustness to missing sensors. The broader Pretraining Objectives and Architectures branch contains substantially more work, particularly in Contrastive Learning Approaches (eight papers across three sub-leaves) and Masked Autoencoding (three papers). TerraFM's emphasis on cross-attention distinguishes it from these pretraining-centric methods, which typically treat fusion as a secondary concern rather than a core architectural innovation.

Among the three contributions analyzed, the literature search examined twenty-four candidates total. The modality-specific patch embedding mechanism was evaluated against ten candidates with zero refutations, the cross-attention fusion treating sensors as augmentations was similarly examined against ten candidates with no clear prior work, and the dual-centering strategy for long-tailed distributions was assessed against four candidates, again with no refutations. These statistics reflect a limited search scope—top-K semantic matches plus citation expansion—rather than exhaustive coverage. The absence of refutations across all contributions suggests that within this bounded candidate set, the specific combination of techniques appears novel.

Based on the limited search of twenty-four candidates, TerraFM's contributions appear to occupy a relatively unexplored intersection of attention-based fusion and land-cover-aware regularization. However, the sparse population of the Attention-Based Cross-Modal Alignment leaf (three papers) and the modest search scope mean that broader prior work in adjacent areas—such as hybrid pretraining strategies or specialized encoder architectures—may not have been fully captured. The analysis provides evidence of novelty within the examined sample but does not constitute an exhaustive field survey.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
24
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: self-supervised learning for multisensor earth observation imagery. The field organizes around several complementary branches that reflect both methodological and application-oriented concerns. Pretraining Objectives and Architectures explores foundational self-supervised paradigms—contrastive learning, masked modeling, and hybrid strategies—that enable models to extract robust representations from unlabeled satellite data. Multimodal Fusion Mechanisms addresses the challenge of integrating heterogeneous sensor modalities (optical, SAR, hyperspectral) through attention-based alignment, early or late fusion, and cross-modal consistency constraints. Application-Driven Methods tailors these techniques to specific downstream tasks such as land cover classification, change detection, and building extraction, while Domain Adaptation and Transfer Learning tackles the distribution shifts that arise when models trained on one geographic region or sensor type are deployed elsewhere. Finally, Datasets, Benchmarks, and Frameworks (e.g., SSL4EO-S12[30], Mmearth[49]) provide the infrastructure for reproducible experimentation, and Reviews and Surveys (Self-supervised Remote Sensing Review[3], Orbit to Ground Review[50]) synthesize emerging trends across these branches. Recent work highlights a tension between general-purpose foundation models and task-specific fusion strategies. On one hand, large-scale pretraining efforts like OmniSat[19] and AnySat[46] aim to build unified representations across diverse sensors and resolutions, leveraging masked modeling or contrastive objectives at scale. On the other hand, attention-based cross-modal alignment methods—exemplified by TerraFM[0]—focus on learning fine-grained correspondences between modalities, often using transformer architectures to dynamically weight sensor contributions. TerraFM[0] sits within this latter cluster, emphasizing cross-modal attention mechanisms that align optical and SAR features for improved downstream performance. Compared to broader multimodal frameworks like Self-supervised Multimodal EO[1], which may employ simpler concatenation or early fusion, TerraFM[0] invests more heavily in learned alignment, trading off architectural complexity for potentially richer inter-sensor synergies. This design choice reflects an ongoing question in the field: whether to prioritize scalable, modality-agnostic pretraining or to encode domain-specific sensor relationships directly into the fusion architecture.

Claimed Contributions

Modality-specific patch embedding mechanism for heterogeneous remote sensing data

The authors propose modality-specific patch embeddings that replace the shared projection in standard ViTs with modality-aware embeddings. This enables flexible handling of sensor-specific spectral profiles (e.g., Sentinel-1 SAR and Sentinel-2 optical bands) while preserving spatial structure.

10 retrieved papers
Cross-attention fusion treating sensor modalities as natural augmentations

The authors interpret different aligned modalities (S1-SAR, S2-L1C, S2-L2A) as complementary views of the same scene and introduce a cross-attention fusion module that dynamically aggregates modality-specific tokens using learnable spatial queries within a single DINO-style multi-crop backbone.

10 retrieved papers
Dual-centering strategy for addressing long-tailed land-cover distributions

The authors introduce a dual-centering mechanism into the distillation process that leverages WorldCover-derived class statistics to compute a frequency-aware center. This improves balance across dominant and rare semantic categories without requiring supervised objectives.

4 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Modality-specific patch embedding mechanism for heterogeneous remote sensing data

The authors propose modality-specific patch embeddings that replace the shared projection in standard ViTs with modality-aware embeddings. This enables flexible handling of sensor-specific spectral profiles (e.g., Sentinel-1 SAR and Sentinel-2 optical bands) while preserving spatial structure.

Contribution

Cross-attention fusion treating sensor modalities as natural augmentations

The authors interpret different aligned modalities (S1-SAR, S2-L1C, S2-L2A) as complementary views of the same scene and introduce a cross-attention fusion module that dynamically aggregates modality-specific tokens using learnable spatial queries within a single DINO-style multi-crop backbone.

Contribution

Dual-centering strategy for addressing long-tailed land-cover distributions

The authors introduce a dual-centering mechanism into the distillation process that leverages WorldCover-derived class statistics to compute a frequency-aware center. This improves balance across dominant and rare semantic categories without requiring supervised objectives.

TerraFM: A Scalable Foundation Model for Unified Multisensor Earth Observation | Novelty Validation