Zephyrus: An Agentic Framework for Weather Science

ICLR 2026 Conference SubmissionAnonymous Authors
AgentsLarge Language ModelsWeather ScienceCode Generation
Abstract:

Foundation models for weather science are pre-trained on vast amounts of structured numerical data and outperform traditional weather forecasting systems. However, these models lack language-based reasoning capabilities, limiting their utility in interactive scientific workflows. Large language models (LLMs) excel at understanding and generating text but cannot reason about high-dimensional meteorological datasets. We bridge this gap by building a novel agentic framework for weather science. Our framework includes a Python code-based environment for agents (ZephyrusWorld) to interact with weather data, featuring tools like an interface to WeatherBench 2 dataset, geoquerying for geographical masks from natural language, weather forecasting, and climate simulation capabilities. We design Zephyrus, a multi-turn LLM-based weather agent that iteratively analyzes weather datasets, observes results, and refines its approach through conversational feedback loops. We accompany the agent with a new benchmark, ZephyrusBench, with a scalable data generation pipeline that constructs diverse question-answer pairs across weather-related tasks, from basic lookups to advanced forecasting, extreme event detection, and counterfactual reasoning. Experiments on this benchmark demonstrate the strong performance of Zephyrus agents over text-only baselines, outperforming them by up to 35 percentage points in correctness. However, on harder tasks, Zephyrus performs similarly to text-only baselines, highlighting the challenging nature of our benchmark and suggesting promising directions for future work.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces an agentic framework for weather science comprising three components: ZephyrusWorld (a code-based environment with tools for dataset interaction, geoquerying, and forecasting), Zephyrus (a multi-turn LLM agent performing iterative analysis), and ZephyrusBench (a benchmark with scalable question-answer generation). It resides in the 'Multi-Scale Weather Reasoning and Report Generation' leaf under 'Agentic Weather Reasoning and Code-Based Analysis', which contains only three papers total. This represents a relatively sparse research direction within the broader taxonomy of 34 papers across 19 leaf nodes, suggesting the work targets an emerging rather than saturated area.

The taxonomy reveals neighboring branches focused on geospatial weather agents (integrating infrastructure and environmental context) and broader multimodal forecasting systems. The paper's emphasis on code execution and tool-based interaction distinguishes it from passive conversational interfaces (e.g., ChatClimate, VayuChat) and from multimodal visual interpretation systems that process satellite imagery. Its sibling papers—Hierarchical AI Meteorologist and Modular Weather Interpretation—share the multi-scale reasoning theme but differ in architectural choices. The taxonomy's scope notes clarify that this branch excludes single-scale forecasting and non-agentic interpretation, positioning the work at the intersection of language models and executable meteorological analysis.

Among 25 candidates examined across three contributions, none were flagged as clearly refuting the work. The agentic environment (ZephyrusWorld) examined 10 candidates with zero refutable overlaps; the multi-turn agent (Zephyrus) examined 5 candidates with similar results; and the benchmark (ZephyrusBench) examined 10 candidates, also without refutation. This suggests that within the limited search scope—top-K semantic matches plus citation expansion—no prior work provides directly overlapping implementations of a code-based weather agent environment, multi-turn reasoning framework, and accompanying benchmark. The statistics indicate all three contributions appear novel relative to the examined candidate set, though the search was not exhaustive.

Given the sparse taxonomy leaf (three papers) and the absence of refuting candidates among 25 examined, the work appears to occupy a distinct position within agentic weather reasoning. The limited search scope means undiscovered prior work may exist, particularly in adjacent domains like general scientific agents or climate modeling tools. The analysis covers semantic proximity and citation networks but does not guarantee comprehensive coverage of all relevant meteorological AI systems or code-generation frameworks applied to atmospheric science.

Taxonomy

Core-task Taxonomy Papers
34
3
Claimed Contributions
25
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Bridging language models with meteorological data for interactive weather reasoning. The field encompasses a diverse set of approaches that connect natural language capabilities with atmospheric science. At the highest level, the taxonomy distinguishes between branches focused on multimodal data interpretation (integrating satellite imagery, radar, and numerical outputs), forecasting and prediction pipelines that leverage language models for temporal reasoning, conversational systems designed for user-facing weather information delivery, and agentic frameworks that perform code-based analysis and multi-scale reasoning. Additional branches address knowledge integration from climate science, text-based event classification, application-driven systems for specific domains like agriculture or disaster response, and the underlying conversational AI architectures that enable these interactions. Works such as WeatherQA[9] and ClimateIQA[11] illustrate efforts to build question-answering benchmarks, while systems like ChatClimate[6] and VayuChat[27] exemplify conversational interfaces that translate technical meteorological content into accessible dialogue. Within this landscape, a particularly active line of work centers on agentic weather reasoning and code-based analysis, where systems generate and execute code to process meteorological datasets and produce interpretable reports. Zephyrus[0] sits squarely in this branch, emphasizing multi-scale reasoning and report generation that spans local to synoptic phenomena. It shares thematic ground with Hierarchical AI Meteorologist[17], which similarly adopts a hierarchical approach to weather interpretation, and with Modular Weather Interpretation[25], which decomposes reasoning into modular components. In contrast, works like AirGPT[3] and Cllmate[5] focus more on integrating domain-specific knowledge bases or retrieval-augmented generation to ground language model outputs in authoritative climate data. The central trade-off across these directions involves balancing end-to-end neural generation with explicit symbolic reasoning or code execution, and determining how much domain expertise to encode directly versus retrieving on demand. Zephyrus[0] leans toward the code-execution paradigm, enabling transparent, reproducible analysis at multiple spatial and temporal scales.

Claimed Contributions

ZEPHYRUS WORLD agentic environment for weather science

The authors introduce a comprehensive execution environment that unifies weather science capabilities through Python APIs, including interfaces to WeatherBench 2 dataset, geoquerying functionality, state-of-the-art forecasting models, and physics-based simulators, enabling LLMs to interact programmatically with meteorological data.

10 retrieved papers
ZEPHYRUS multi-turn LLM-based weather agents

The authors design two LLM-based agent systems with different execution strategies: ZEPHYRUS-DIRECT generates complete solutions in one attempt, while ZEPHYRUS-REFLECTIVE implements a multi-turn workflow that alternates between code generation and execution phases with iterative refinement through conversational feedback loops.

5 retrieved papers
ZEPHYRUS BENCH weather reasoning benchmark with scalable data generation pipeline

The authors construct a comprehensive benchmark built on ERA5 reanalysis data with a scalable data generation pipeline that combines human-authored and semi-synthetic tasks spanning diverse weather-related problems, from basic lookups to advanced forecasting, extreme event detection, and counterfactual reasoning, accompanied by robust evaluation schemes.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

ZEPHYRUS WORLD agentic environment for weather science

The authors introduce a comprehensive execution environment that unifies weather science capabilities through Python APIs, including interfaces to WeatherBench 2 dataset, geoquerying functionality, state-of-the-art forecasting models, and physics-based simulators, enabling LLMs to interact programmatically with meteorological data.

Contribution

ZEPHYRUS multi-turn LLM-based weather agents

The authors design two LLM-based agent systems with different execution strategies: ZEPHYRUS-DIRECT generates complete solutions in one attempt, while ZEPHYRUS-REFLECTIVE implements a multi-turn workflow that alternates between code generation and execution phases with iterative refinement through conversational feedback loops.

Contribution

ZEPHYRUS BENCH weather reasoning benchmark with scalable data generation pipeline

The authors construct a comprehensive benchmark built on ERA5 reanalysis data with a scalable data generation pipeline that combines human-authored and semi-synthetic tasks spanning diverse weather-related problems, from basic lookups to advanced forecasting, extreme event detection, and counterfactual reasoning, accompanied by robust evaluation schemes.