Earth-Agent: Unlocking the Full Landscape of Earth Observation with Agents

ICLR 2026 Conference SubmissionAnonymous Authors
Earth observationEarth-AgentEarth-Bench
Abstract:

Earth observation (EO) is essential for understanding the evolving states of the Earth system. Although recent MLLMs have advanced EO research, they still lack the capability to tackle complex tasks that require multi-step reasoning and the use of domain-specific tools. Agent-based methods offer a promising direction, but current attempts remain in their infancy, confined to RGB perception, shallow reasoning, and lacking systematic evaluation protocols. To overcome these limitations, we introduce Earth-Agent, the first agentic framework that unifies RGB and spectral EO data within an MCP-based tool ecosystem, enabling cross-modal, multi-step, and quantitative spatiotemporal reasoning beyond pretrained MLLMs. Earth-Agent supports complex scientific tasks such as geophysical parameter retrieval and quantitative spatiotemporal analysis by dynamically invoking expert tools and models across modalities. To support comprehensive evaluation, we further propose Earth-Bench, a benchmark of 248 expert-curated tasks with 13,729 images, spanning spectrum, products and RGB modalities, and equipped with a dual-level evaluation protocol that assesses both reasoning trajectories and final outcomes. We conduct comprehensive experiments varying different LLM backbones, comparisons with general agent frameworks, and comparisons with MLLMs on remote sensing benchmarks, demonstrating both the effectiveness and potential of Earth-Agent. Earth-Agent establishes a new paradigm for EO analysis, moving the field toward scientifically grounded, next-generation applications of LLMs in Earth observation. Our code and dataset will be publicly released.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

Earth-Agent introduces an agentic framework unifying RGB and spectral Earth observation data within an MCP-based tool ecosystem, enabling cross-modal, multi-step reasoning for geophysical parameter retrieval and spatiotemporal analysis. The paper resides in the Remote Sensing-Specific Agents leaf, which contains eight papers focused on agent systems specialized for remote sensing image analysis and EO-specific reasoning tasks. This leaf sits within the broader Agentic Frameworks and Multi-Agent Systems branch, indicating a moderately populated research direction where agent-based approaches to Earth observation are actively explored but not yet saturated.

The taxonomy reveals neighboring work in General-Purpose Geospatial Agents (five papers on broad geospatial task automation) and Interactive and Conversational Agents (one paper on multi-turn dialogue for environmental analysis). The Vision-Language Models for Earth Observation branch offers an alternative paradigm without explicit agent architecture, while Tool-Use Mechanisms and Evaluation provides complementary infrastructure for assessing tool invocation. Earth-Agent bridges these areas by combining agent planning with multi-modal vision-language understanding and systematic tool orchestration, positioning itself at the intersection of agentic reasoning and domain-specific EO capabilities.

Among thirty candidates examined, the framework-level contribution (Contribution A) shows no clear refutation across ten candidates, suggesting the integrated MCP-based architecture may offer distinctive design choices. Multi-spectral image processing (Contribution B) encountered one refutable candidate among ten examined, indicating prior work addresses spectral data handling. Interactive reasoning with external tools (Contribution C) found four refutable candidates among ten, reflecting established precedent in tool-augmented EO agents. The limited search scope means these statistics capture top semantic matches rather than exhaustive field coverage, and the dual-level evaluation protocol for Earth-Bench appears less examined in the candidate set.

Given the search examined thirty papers from semantic retrieval, the analysis provides a snapshot of closely related work rather than comprehensive field coverage. The framework's integration of spectral data, MCP-based tools, and dual-level evaluation appears to combine elements present separately in prior work, though the specific architectural synthesis may offer incremental advances. The benchmark's scale (248 tasks, 13,729 images) and cross-modal scope warrant attention, but the novelty assessment remains constrained by the limited literature sample and the presence of overlapping tool-use and multi-spectral capabilities in examined candidates.

Taxonomy

Core-task Taxonomy Papers
40
3
Claimed Contributions
30
Contribution Candidate Papers Compared
5
Refutable Paper

Research Landscape Overview

Core task: Multi-step reasoning and tool use for Earth observation analysis. The field structure reflects a convergence of agentic AI capabilities with domain-specific Earth observation challenges. The taxonomy organizes work into several main branches: Agentic Frameworks and Multi-Agent Systems focus on building autonomous or collaborative agents that orchestrate complex geospatial workflows; Vision-Language Models for Earth Observation adapt multimodal foundation models to interpret satellite and aerial imagery alongside textual queries; Tool-Use Mechanisms and Evaluation address how agents select, invoke, and assess specialized geospatial tools; Instruction-Following Datasets and Training Resources provide the supervised signals needed to teach models domain conventions; Task-Specific Applications and Domain Integration demonstrate end-to-end systems for problems like flood detection or forest monitoring; and Foundational Methods and Cross-Domain Techniques supply general reasoning strategies that transfer across domains. Representative works such as ThinkGeo[2] and Teochat[1] illustrate how agents can chain reasoning steps with tool calls, while Geo-olm[11] and Naiad[33] show efforts to ground vision-language understanding in geospatial semantics. A particularly active line of work explores remote sensing-specific agents that combine large language models with geospatial APIs and domain knowledge bases. Earth-Agent[0] sits within this cluster, emphasizing multi-step reasoning pipelines that decompose complex Earth observation queries into executable tool sequences. Nearby efforts like Multi-agent Remote Sensing[5] and Agentic AI Remote Sensing[29] similarly pursue collaborative or hierarchical agent architectures, though they differ in whether they prioritize single-agent orchestration or multi-agent negotiation. Compared to more general agentic frameworks, Earth-Agent[0] and its neighbors integrate domain-specific constraints—such as coordinate reference systems, temporal resolution trade-offs, and sensor modality selection—directly into the reasoning loop. Open questions remain around how to evaluate reasoning quality beyond task success, how to handle noisy or incomplete geospatial metadata, and whether hybrid approaches that blend symbolic planning with neural tool selection offer better generalization than end-to-end learned policies.

Claimed Contributions

Earth-Agent framework for comprehensive Earth observation tasks

The authors propose Earth-Agent, a framework that extends beyond existing MLLM-based and agent-based Earth observation research by supporting multi-spectral imagery, processing numerous images simultaneously, and performing complex multi-step reasoning while integrating external tools and expert models.

10 retrieved papers
Multi-spectral image processing capability

Unlike prior work limited to RGB images, Earth-Agent can handle both multi-spectral and RGB imagery, expanding the range of Earth observation data that can be analyzed.

10 retrieved papers
Can Refute
Interactive reasoning with external tools and models

The framework enables complex multi-step interactive reasoning by integrating with external tools and expert models, making it extensible beyond the capabilities of standalone models.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Earth-Agent framework for comprehensive Earth observation tasks

The authors propose Earth-Agent, a framework that extends beyond existing MLLM-based and agent-based Earth observation research by supporting multi-spectral imagery, processing numerous images simultaneously, and performing complex multi-step reasoning while integrating external tools and expert models.

Contribution

Multi-spectral image processing capability

Unlike prior work limited to RGB images, Earth-Agent can handle both multi-spectral and RGB imagery, expanding the range of Earth observation data that can be analyzed.

Contribution

Interactive reasoning with external tools and models

The framework enables complex multi-step interactive reasoning by integrating with external tools and expert models, making it extensible beyond the capabilities of standalone models.

Earth-Agent: Unlocking the Full Landscape of Earth Observation with Agents | Novelty Validation