MoRA: Mobility as the Backbone for Geospatial Representation Learning at Scale

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

GeoAIspatial representation learninglocation embeddingmulti-modalcontrastive learning

Representation learning of geospatial locations remains a core challenge in achieving general geospatial intelligence, with increasingly diverging philosophies and techniques. While Earth observation paradigms excel at depicting locations in their physical states, we propose that a location’s full characterization requires grounding in both its physical attributes and its internal human activity pattern, the latter being particularly crucial for understanding its human-centric functions. We present MoRA, a human-centric geospatial framework that leverages a mobility graph as its core backbone to fuse various data modalities, aiming to learn embeddings that represent the socio-economic context and functional role of a location. MoRA achieves this through the integration of spatial tokenization, GNNs, and asymmetric contrastive learning to align 100M+ POIs, massive remote sensing imagery, and structured demographic statistics with a billion-edge mobility graph, ensuring the three auxiliary modalities are interpreted through the lens of fundamental human dynamics. To rigorously evaluate the effectiveness of MoRA, we construct a benchmark dataset composed of 9 downstream prediction tasks across social and economic domains. Experiments show that MoRA, with four input modalities and a compact 128-dimensional representation space, achieves superior predictive performances than state-of-the-art models by an average of 12.9%. Echoing LLM scaling laws, we further demonstrate the scaling behavior in geospatial representation learning. We open-source code and pretrained models at: https://anonymous.4open.science/r/MoRA-.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

MoRA proposes a human-centric geospatial framework that uses a mobility graph as its core backbone to fuse POIs, remote sensing imagery, and demographic statistics, learning embeddings that represent socio-economic context and functional roles of locations. The paper resides in the 'Scalable Geospatial Representation Learning' leaf under 'Foundation Models and Large-Scale Geospatial Learning', which contains only three papers total. This is a relatively sparse research direction within the broader taxonomy of fifty papers, suggesting that large-scale, general-purpose geospatial representation learning remains an emerging area compared to more crowded branches like multimodal fusion or mobility prediction.

The taxonomy reveals that MoRA's neighboring work spans multiple branches: 'Multimodal Fusion Frameworks' (thirteen papers across three sub-leaves) explores integration strategies but often without mobility as the primary organizing principle, while 'Mobility-Driven Region Representation Learning' (six papers) focuses on mobility patterns but typically without the scale or multimodal scope MoRA claims. The 'Contrastive and Self-Supervised Multimodal Learning' sub-leaf (four papers) shares methodological overlap in using contrastive objectives, yet those works do not explicitly position mobility graphs as the interpretive lens for auxiliary modalities. MoRA's approach of grounding physical and demographic data through human dynamics appears to bridge these directions.

Among thirty candidates examined, the framework contribution shows two refutable candidates out of ten examined, the benchmark contribution also has two refutable candidates from ten, and the scaling laws contribution has one refutable candidate from ten. The statistics indicate that while some prior work exists in each area, the search scope was limited and the majority of examined candidates did not clearly overlap. The framework's emphasis on asymmetric contrastive learning and billion-edge mobility graphs at scale distinguishes it from smaller-scale or single-modality methods, though the extent of novelty depends on how thoroughly the limited candidate pool represents the field.

Based on the top-thirty semantic matches and citation expansion, MoRA appears to occupy a relatively under-explored intersection of mobility-centric reasoning and large-scale multimodal fusion. The analysis does not cover exhaustive literature in urban computing or remote sensing communities, and the taxonomy's sparse 'Scalable Geospatial Representation Learning' leaf suggests this direction is still maturing. The contribution-level statistics hint at incremental overlap rather than wholesale redundancy, though a broader search might reveal additional related efforts.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: geospatial representation learning using multimodal data and mobility graphs. The field has evolved into several interconnected branches that address different facets of learning meaningful region embeddings. Mobility-Driven Region Representation Learning focuses on extracting patterns from human movement data, often leveraging trajectory flows and transition graphs to capture functional connectivity between areas. Multimodal Fusion Frameworks for Region Representation integrate diverse data sources—such as points of interest, satellite imagery, and social media—to build richer semantic profiles, as seen in works like Effective urban region representation[2] and MGRL4RE[5]. Graph Neural Network Architectures for Regions provide the structural backbone for encoding spatial relationships, while Remote Sensing and Spatial Context Integration emphasizes the role of imagery and environmental features. Mobility Prediction and Forecasting applies these representations to anticipate future flows, and Foundation Models and Large-Scale Geospatial Learning explores scalable pretraining strategies that generalize across cities and tasks. Specialized Applications and Emerging Paradigms address domain-specific challenges, from urban planning to location recommendation. Recent efforts have increasingly turned toward scalable, self-supervised methods that can handle the heterogeneity and volume of geospatial data. Within the Foundation Models and Large-Scale Geospatial Learning branch, MoRA[0] sits alongside other scalable approaches like MobCLIP[27] and Temporal Embeddings[50], emphasizing efficient representation learning that can transfer across diverse urban contexts. While MobCLIP[27] leverages contrastive learning on mobility traces and MoRA[0] focuses on mobility-aware region aggregation, both share the goal of reducing reliance on task-specific labels. In contrast, earlier methods such as Region representation learning via[1] and Unsupervised Representation Learning of[18] laid foundational ideas but operated at smaller scales. The central tension across these lines of work involves balancing expressiveness—capturing fine-grained spatial semantics—with computational efficiency and the ability to generalize to new regions with limited supervision.

Claimed Contributions

MoRA framework using mobility as backbone for multimodal geospatial representation learning

Can Refute

10 retrieved papers

The authors introduce MoRA, a framework that positions human mobility graphs as the central structural backbone for aligning multiple geospatial data modalities (POIs, satellite imagery, demographics). This mobility-centric design ensures all auxiliary modalities are interpreted through fundamental human dynamics, producing comprehensive location embeddings for socio-economic inference.

10 retrieved papers

Can Refute

Benchmark dataset for human-centric geospatial representation evaluation

Can Refute

10 retrieved papers

The authors curate a benchmark comprising 9 diverse downstream tasks spanning social and economic domains at multiple spatial scales (point, grid, county, city). This benchmark enables rigorous evaluation of geospatial representation quality for human-centric inference tasks.

10 retrieved papers

Can Refute

Empirical evidence of scaling laws in geospatial representation learning

Can Refute

10 retrieved papers

The authors demonstrate that geospatial representation learning exhibits scaling behavior analogous to large language models: increasing pretraining data size and spatial coverage from local to national scales consistently improves downstream task performance, revealing predictable performance gains with scale.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[27] MobCLIP: Learning General-purpose Geospatial Representation at Scale PDF

Wen-Ya, Li Linyan, Chen Xin-hua, Webster, Chris, Zhou Yulun (2025)

[50] Temporal Embeddings: Scalable Self-Supervised Temporal Representation Learning from Spatiotemporal Data for Multimodal Computer Vision PDF

Cao, Yi, Ganguli, Swetava, Yi Cao, Pandey, Vipul, Swetava Ganguli, Vipul Pandey (2023) • arXiv.org

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MoRA framework using mobility as backbone for multimodal geospatial representation learning

[18] Unsupervised Representation Learning of Spatial Data via Multimodal Embedding PDF

Can Refute

[27] MobCLIP: Learning General-purpose Geospatial Representation at Scale PDF

Can Refute

[9] Revealing intra-urban hierarchical spatial structure through representation learning by combining road network abstraction model and taxi trajectory data PDF

Cannot Refute

[15] Fusiontransnet for smart urban mobility: Spatiotemporal traffic forecasting through multimodal network integration PDF

Cannot Refute

[16] Perspectives on geospatial artificial intelligence platforms for multimodal spatiotemporal datasets PDF

Cannot Refute

[19] Geospatial big data: theory, methods, and applications PDF

Cannot Refute

[24] Reachability Embeddings: Scalable self-supervised representation learning from mobility trajectories for multimodal geospatial computer vision PDF

Cannot Refute

[46] M3G: Learning urban neighborhood representation from multi-modal multi-graph PDF

Cannot Refute

[69] Exploring the spatial distribution structure of intercity human mobility networks under multimodal transportation systems in China PDF

Cannot Refute

[70] Expert Comment Generation Considering Sports Skill Level Using a Large Multimodal Model with Video and Spatial-Temporal Motion Features PDF

Cannot Refute

Contribution

Benchmark dataset for human-centric geospatial representation evaluation

[27] MobCLIP: Learning General-purpose Geospatial Representation at Scale PDF

Can Refute

[54] General geospatial inference with a population dynamics foundation model PDF

Can Refute

[51] Satclip: Global, general-purpose location embeddings with satellite imagery PDF

Cannot Refute

[52] Geobert: Pre-training geospatial representation learning on point-of-interest PDF

Cannot Refute

[53] GeoFM: how will geo-foundation models reshape spatial data science and GeoAI? PDF

Cannot Refute

[55] S2Vec: Self-Supervised Geospatial Embeddings PDF

Cannot Refute

[56] Platonic Representations for Poverty Mapping: Unified Vision-Language Codes or Agent-Induced Novelty? PDF

Cannot Refute

[57] Census2Vec: Enhancing Socioeconomic Predictive Models with Geo-Embedded Data PDF

Cannot Refute

[58] Deep residential representations: Using unsupervised learning to unlock elevation data for geo-demographic prediction PDF

Cannot Refute

[59] CityLens: Benchmarking Large Language-Vision Models for Urban Socioeconomic Sensing PDF

Cannot Refute

Contribution

Empirical evidence of scaling laws in geospatial representation learning

[27] MobCLIP: Learning General-purpose Geospatial Representation at Scale PDF

Can Refute

[60] Scale-MAE: A Scale-Aware Masked Autoencoder for Multiscale Geospatial Representation Learning PDF

Cannot Refute

[61] Remoteclip: A vision language foundation model for remote sensing PDF

Cannot Refute

[62] Geollm: Extracting geospatial knowledge from large language models PDF

Cannot Refute

[63] An empirical study of remote sensing pretraining PDF

Cannot Refute

[64] MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning PDF

Cannot Refute

[65] Towards Geospatial Foundation Models via Continual Pretraining PDF

Cannot Refute

[66] Exploring the limits of large scale pre-training PDF

Cannot Refute

[67] Roma: Scaling up mamba-based foundation models for remote sensing PDF

Cannot Refute

[68] TerraFM: A Scalable Foundation Model for Unified Multisensor Earth Observation PDF

Cannot Refute

MoRA: Mobility as the Backbone for Geospatial Representation Learning at Scale

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[27] MobCLIP: Learning General-purpose Geospatial Representation at Scale PDF

[50] Temporal Embeddings: Scalable Self-Supervised Temporal Representation Learning from Spatiotemporal Data for Multimodal Computer Vision PDF

Contribution Analysis

MoRA framework using mobility as backbone for multimodal geospatial representation learning

[18] Unsupervised Representation Learning of Spatial Data via Multimodal Embedding PDF

[27] MobCLIP: Learning General-purpose Geospatial Representation at Scale PDF

[9] Revealing intra-urban hierarchical spatial structure through representation learning by combining road network abstraction model and taxi trajectory data PDF

[15] Fusiontransnet for smart urban mobility: Spatiotemporal traffic forecasting through multimodal network integration PDF

[16] Perspectives on geospatial artificial intelligence platforms for multimodal spatiotemporal datasets PDF

[19] Geospatial big data: theory, methods, and applications PDF

[24] Reachability Embeddings: Scalable self-supervised representation learning from mobility trajectories for multimodal geospatial computer vision PDF

[46] M3G: Learning urban neighborhood representation from multi-modal multi-graph PDF

[69] Exploring the spatial distribution structure of intercity human mobility networks under multimodal transportation systems in China PDF

[70] Expert Comment Generation Considering Sports Skill Level Using a Large Multimodal Model with Video and Spatial-Temporal Motion Features PDF

Benchmark dataset for human-centric geospatial representation evaluation

[27] MobCLIP: Learning General-purpose Geospatial Representation at Scale PDF

[54] General geospatial inference with a population dynamics foundation model PDF

[51] Satclip: Global, general-purpose location embeddings with satellite imagery PDF

[52] Geobert: Pre-training geospatial representation learning on point-of-interest PDF

[53] GeoFM: how will geo-foundation models reshape spatial data science and GeoAI? PDF

[55] S2Vec: Self-Supervised Geospatial Embeddings PDF

[56] Platonic Representations for Poverty Mapping: Unified Vision-Language Codes or Agent-Induced Novelty? PDF

[57] Census2Vec: Enhancing Socioeconomic Predictive Models with Geo-Embedded Data PDF

[58] Deep residential representations: Using unsupervised learning to unlock elevation data for geo-demographic prediction PDF

[59] CityLens: Benchmarking Large Language-Vision Models for Urban Socioeconomic Sensing PDF

Empirical evidence of scaling laws in geospatial representation learning

[27] MobCLIP: Learning General-purpose Geospatial Representation at Scale PDF

[60] Scale-MAE: A Scale-Aware Masked Autoencoder for Multiscale Geospatial Representation Learning PDF

[61] Remoteclip: A vision language foundation model for remote sensing PDF

[62] Geollm: Extracting geospatial knowledge from large language models PDF

[63] An empirical study of remote sensing pretraining PDF

[64] MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning PDF

[65] Towards Geospatial Foundation Models via Continual Pretraining PDF

[66] Exploring the limits of large scale pre-training PDF

[67] Roma: Scaling up mamba-based foundation models for remote sensing PDF

[68] TerraFM: A Scalable Foundation Model for Unified Multisensor Earth Observation PDF

Table of Contents