TerraFM: A Scalable Foundation Model for Unified Multisensor Earth Observation

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Remote SensingFoundation ModelGeospatial

Modern Earth observation (EO) increasingly leverages deep learning to harness the scale and diversity of satellite imagery across sensors and regions. While recent foundation models have demonstrated promising generalization across EO tasks, many remain limited by the scale, geographical coverage, and spectral diversity of their training data, factors critical for learning globally transferable representations. In this work, we introduce TerraFM, a scalable self-supervised learning model that leverages globally distributed Sentinel-1 and Sentinel-2 imagery, combined with large spatial tiles and land-cover aware sampling to enrich spatial and semantic coverage. By treating sensing modalities as natural augmentations in our self-supervised approach, we unify radar and optical inputs via modality-specific patch embeddings and adaptive cross-attention fusion. Our training strategy integrates local-global contrastive learning and introduces a dual-centering mechanism that incorporates class-frequency-aware regularization to address long-tailed distributions in land cover. TerraFM achieves strong generalization on both classification and segmentation tasks, outperforming prior models on GEO-Bench and Copernicus-Bench. Our code and pretrained models will be publicly released.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

TerraFM contributes a self-supervised learning framework that unifies Sentinel-1 radar and Sentinel-2 optical imagery through modality-specific patch embeddings, adaptive cross-attention fusion, and a dual-centering mechanism for long-tailed land-cover distributions. The paper resides in the Attention-Based Cross-Modal Alignment leaf, which contains only three papers total, indicating a relatively sparse research direction within the broader taxonomy. This leaf sits under Multimodal Fusion Mechanisms, one of six major branches in the field, suggesting that while fusion strategies are well-studied, attention-based alignment remains a less crowded niche compared to contrastive pretraining or masked modeling approaches.

The taxonomy reveals that TerraFM's neighboring research directions include Early and Late Fusion Strategies (two papers) and Incomplete Modality Handling (one paper), both of which address multimodal integration but through different mechanisms—concatenation-based fusion versus robustness to missing sensors. The broader Pretraining Objectives and Architectures branch contains substantially more work, particularly in Contrastive Learning Approaches (eight papers across three sub-leaves) and Masked Autoencoding (three papers). TerraFM's emphasis on cross-attention distinguishes it from these pretraining-centric methods, which typically treat fusion as a secondary concern rather than a core architectural innovation.

Among the three contributions analyzed, the literature search examined twenty-four candidates total. The modality-specific patch embedding mechanism was evaluated against ten candidates with zero refutations, the cross-attention fusion treating sensors as augmentations was similarly examined against ten candidates with no clear prior work, and the dual-centering strategy for long-tailed distributions was assessed against four candidates, again with no refutations. These statistics reflect a limited search scope—top-K semantic matches plus citation expansion—rather than exhaustive coverage. The absence of refutations across all contributions suggests that within this bounded candidate set, the specific combination of techniques appears novel.

Based on the limited search of twenty-four candidates, TerraFM's contributions appear to occupy a relatively unexplored intersection of attention-based fusion and land-cover-aware regularization. However, the sparse population of the Attention-Based Cross-Modal Alignment leaf (three papers) and the modest search scope mean that broader prior work in adjacent areas—such as hybrid pretraining strategies or specialized encoder architectures—may not have been fully captured. The analysis provides evidence of novelty within the examined sample but does not constitute an exhaustive field survey.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: self-supervised learning for multisensor earth observation imagery. The field organizes around several complementary branches that reflect both methodological and application-oriented concerns. Pretraining Objectives and Architectures explores foundational self-supervised paradigms—contrastive learning, masked modeling, and hybrid strategies—that enable models to extract robust representations from unlabeled satellite data. Multimodal Fusion Mechanisms addresses the challenge of integrating heterogeneous sensor modalities (optical, SAR, hyperspectral) through attention-based alignment, early or late fusion, and cross-modal consistency constraints. Application-Driven Methods tailors these techniques to specific downstream tasks such as land cover classification, change detection, and building extraction, while Domain Adaptation and Transfer Learning tackles the distribution shifts that arise when models trained on one geographic region or sensor type are deployed elsewhere. Finally, Datasets, Benchmarks, and Frameworks (e.g., SSL4EO-S12[30], Mmearth[49]) provide the infrastructure for reproducible experimentation, and Reviews and Surveys (Self-supervised Remote Sensing Review[3], Orbit to Ground Review[50]) synthesize emerging trends across these branches. Recent work highlights a tension between general-purpose foundation models and task-specific fusion strategies. On one hand, large-scale pretraining efforts like OmniSat[19] and AnySat[46] aim to build unified representations across diverse sensors and resolutions, leveraging masked modeling or contrastive objectives at scale. On the other hand, attention-based cross-modal alignment methods—exemplified by TerraFM[0]—focus on learning fine-grained correspondences between modalities, often using transformer architectures to dynamically weight sensor contributions. TerraFM[0] sits within this latter cluster, emphasizing cross-modal attention mechanisms that align optical and SAR features for improved downstream performance. Compared to broader multimodal frameworks like Self-supervised Multimodal EO[1], which may employ simpler concatenation or early fusion, TerraFM[0] invests more heavily in learned alignment, trading off architectural complexity for potentially richer inter-sensor synergies. This design choice reflects an ongoing question in the field: whether to prioritize scalable, modality-agnostic pretraining or to encode domain-specific sensor relationships directly into the fusion architecture.

Claimed Contributions

Modality-specific patch embedding mechanism for heterogeneous remote sensing data

10 retrieved papers

The authors propose modality-specific patch embeddings that replace the shared projection in standard ViTs with modality-aware embeddings. This enables flexible handling of sensor-specific spectral profiles (e.g., Sentinel-1 SAR and Sentinel-2 optical bands) while preserving spatial structure.

10 retrieved papers

Cross-attention fusion treating sensor modalities as natural augmentations

10 retrieved papers

The authors interpret different aligned modalities (S1-SAR, S2-L1C, S2-L2A) as complementary views of the same scene and introduce a cross-attention fusion module that dynamically aggregates modality-specific tokens using learnable spatial queries within a single DINO-style multi-crop backbone.

10 retrieved papers

Dual-centering strategy for addressing long-tailed land-cover distributions

4 retrieved papers

The authors introduce a dual-centering mechanism into the distillation process that leverages WorldCover-derived class statistics to compute a frequency-aware center. This improves balance across dominant and rare semantic categories without requiring supervised objectives.

4 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[6] Deep Feature Correlation Learning for Multi-Modal Remote Sensing Image Registration PDF

Dou Quan, Shuang Wang, Yu Gu, Ruiqi Lei, Bowu Yang, Shaowei Wei, Biao Hou, Licheng Jiao, B. Hou (2022)

[19] OmniSat: Self-Supervised Modality Fusion for Earth Observation PDF

Gonthier, Nicolas, Guillaume Astruc, Mallet Clement, Nicolas Gonthier, Landrieu, LoÃ¯c, ClÃ©ment Mallet, Loic Landrieu (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Modality-specific patch embedding mechanism for heterogeneous remote sensing data

[65] Satmae: Pre-training transformers for temporal and multi-spectral satellite imagery PDF

Cannot Refute

[66] SCViT: A spatial-channel feature preserving vision transformer for remote sensing image scene classification PDF

Cannot Refute

[67] Context-aware masking and learnable diffusion-guided patch refinement in transformers via sparse supervision for hyperspectral image classification PDF

Cannot Refute

[68] Diffformer: a differential spatial-spectral transformer for hyperspectral image classification PDF

Cannot Refute

[69] SpectralFormer: Rethinking hyperspectral image classification with transformers PDF

Cannot Refute

[70] Multiscale spatialâspectral transformer network for hyperspectral and multispectral image fusion PDF

Cannot Refute

[71] Squeeze-SwinFormer: Spectral Squeeze and Excitation Swin Transformer Network for Hyperspectral Image Classification PDF

Cannot Refute

[72] Coarse-to-Fine Sparse Transformer for Hyperspectral Image Reconstruction PDF

Cannot Refute

[73] Masked Vision Transformers for Hyperspectral Image Classification PDF

Cannot Refute

[74] Dual Branch Masked Transformer for Hyperspectral Image Classification PDF

Cannot Refute

Contribution

Cross-attention fusion treating sensor modalities as natural augmentations

[55] Cross on cross attention: Deep fusion transformer for image captioning PDF

Cannot Refute

[56] Attention-based multimodal fusion for video description PDF

Cannot Refute

[57] Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers PDF

Cannot Refute

[58] Event-Based Fusion for Motion Deblurring with Cross-modal Attention PDF

Cannot Refute

[59] SFusion: Self-attention based n-to-one multimodal fusion block PDF

Cannot Refute

[60] ATTSF-Net: Attention-based Similarity Fusion Network for Audio-Visual Emotion Recognition PDF

Cannot Refute

[61] Dual-attention transformer-based hybrid network for multi-modal medical image segmentation PDF

Cannot Refute

[62] Attention driven fusion for multi-modal emotion recognition PDF

Cannot Refute

[63] Encoder Fusion Network with Co-Attention Embedding for Referring Image Segmentation PDF

Cannot Refute

[64] Attention-based multimodal image feature fusion module for transmission line detection PDF

Cannot Refute

Contribution

Dual-centering strategy for addressing long-tailed land-cover distributions

[51] A dual-branch model with inter- and intra-branch contrastive loss for long-tailed recognition PDF

Cannot Refute

[52] Divide and Retain: A Dual-Phase Modeling for Long-Tailed Visual Recognition PDF

Cannot Refute

[53] Long-Tailed Recognition by Hierarchical Rebalancing Dual-Classifier PDF

Cannot Refute

[54] Long-Tailed Graph Representation Learning via Dual Cost-Sensitive Graph Convolutional Network PDF

Cannot Refute

TerraFM: A Scalable Foundation Model for Unified Multisensor Earth Observation

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[6] Deep Feature Correlation Learning for Multi-Modal Remote Sensing Image Registration PDF

[19] OmniSat: Self-Supervised Modality Fusion for Earth Observation PDF

Contribution Analysis

Modality-specific patch embedding mechanism for heterogeneous remote sensing data

[65] Satmae: Pre-training transformers for temporal and multi-spectral satellite imagery PDF

[66] SCViT: A spatial-channel feature preserving vision transformer for remote sensing image scene classification PDF

[67] Context-aware masking and learnable diffusion-guided patch refinement in transformers via sparse supervision for hyperspectral image classification PDF

[68] Diffformer: a differential spatial-spectral transformer for hyperspectral image classification PDF

[69] SpectralFormer: Rethinking hyperspectral image classification with transformers PDF

[70] Multiscale spatialâspectral transformer network for hyperspectral and multispectral image fusion PDF

[71] Squeeze-SwinFormer: Spectral Squeeze and Excitation Swin Transformer Network for Hyperspectral Image Classification PDF

[72] Coarse-to-Fine Sparse Transformer for Hyperspectral Image Reconstruction PDF

[73] Masked Vision Transformers for Hyperspectral Image Classification PDF

[74] Dual Branch Masked Transformer for Hyperspectral Image Classification PDF

Cross-attention fusion treating sensor modalities as natural augmentations

[55] Cross on cross attention: Deep fusion transformer for image captioning PDF

[56] Attention-based multimodal fusion for video description PDF

[57] Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers PDF

[58] Event-Based Fusion for Motion Deblurring with Cross-modal Attention PDF

[59] SFusion: Self-attention based n-to-one multimodal fusion block PDF

[60] ATTSF-Net: Attention-based Similarity Fusion Network for Audio-Visual Emotion Recognition PDF

[61] Dual-attention transformer-based hybrid network for multi-modal medical image segmentation PDF

[62] Attention driven fusion for multi-modal emotion recognition PDF

[63] Encoder Fusion Network with Co-Attention Embedding for Referring Image Segmentation PDF

[64] Attention-based multimodal image feature fusion module for transmission line detection PDF

Dual-centering strategy for addressing long-tailed land-cover distributions

[51] A dual-branch model with inter- and intra-branch contrastive loss for long-tailed recognition PDF

[52] Divide and Retain: A Dual-Phase Modeling for Long-Tailed Visual Recognition PDF

[53] Long-Tailed Recognition by Hierarchical Rebalancing Dual-Classifier PDF

[54] Long-Tailed Graph Representation Learning via Dual Cost-Sensitive Graph Convolutional Network PDF

Table of Contents

[70] Multiscale spatialâspectral transformer network for hyperspectral and multispectral image fusion PDF