Zephyrus: An Agentic Framework for Weather Science

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 7.0 Download Report PDF

AgentsLarge Language ModelsWeather ScienceCode Generation

Foundation models for weather science are pre-trained on vast amounts of structured numerical data and outperform traditional weather forecasting systems. However, these models lack language-based reasoning capabilities, limiting their utility in interactive scientific workflows. Large language models (LLMs) excel at understanding and generating text but cannot reason about high-dimensional meteorological datasets. We bridge this gap by building a novel agentic framework for weather science. Our framework includes a Python code-based environment for agents (ZephyrusWorld) to interact with weather data, featuring tools like an interface to WeatherBench 2 dataset, geoquerying for geographical masks from natural language, weather forecasting, and climate simulation capabilities. We design Zephyrus, a multi-turn LLM-based weather agent that iteratively analyzes weather datasets, observes results, and refines its approach through conversational feedback loops. We accompany the agent with a new benchmark, ZephyrusBench, with a scalable data generation pipeline that constructs diverse question-answer pairs across weather-related tasks, from basic lookups to advanced forecasting, extreme event detection, and counterfactual reasoning. Experiments on this benchmark demonstrate the strong performance of Zephyrus agents over text-only baselines, outperforming them by up to 35 percentage points in correctness. However, on harder tasks, Zephyrus performs similarly to text-only baselines, highlighting the challenging nature of our benchmark and suggesting promising directions for future work.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces an agentic framework for weather science comprising three components: ZephyrusWorld (a code-based environment with tools for dataset interaction, geoquerying, and forecasting), Zephyrus (a multi-turn LLM agent performing iterative analysis), and ZephyrusBench (a benchmark with scalable question-answer generation). It resides in the 'Multi-Scale Weather Reasoning and Report Generation' leaf under 'Agentic Weather Reasoning and Code-Based Analysis', which contains only three papers total. This represents a relatively sparse research direction within the broader taxonomy of 34 papers across 19 leaf nodes, suggesting the work targets an emerging rather than saturated area.

The taxonomy reveals neighboring branches focused on geospatial weather agents (integrating infrastructure and environmental context) and broader multimodal forecasting systems. The paper's emphasis on code execution and tool-based interaction distinguishes it from passive conversational interfaces (e.g., ChatClimate, VayuChat) and from multimodal visual interpretation systems that process satellite imagery. Its sibling papers—Hierarchical AI Meteorologist and Modular Weather Interpretation—share the multi-scale reasoning theme but differ in architectural choices. The taxonomy's scope notes clarify that this branch excludes single-scale forecasting and non-agentic interpretation, positioning the work at the intersection of language models and executable meteorological analysis.

Among 25 candidates examined across three contributions, none were flagged as clearly refuting the work. The agentic environment (ZephyrusWorld) examined 10 candidates with zero refutable overlaps; the multi-turn agent (Zephyrus) examined 5 candidates with similar results; and the benchmark (ZephyrusBench) examined 10 candidates, also without refutation. This suggests that within the limited search scope—top-K semantic matches plus citation expansion—no prior work provides directly overlapping implementations of a code-based weather agent environment, multi-turn reasoning framework, and accompanying benchmark. The statistics indicate all three contributions appear novel relative to the examined candidate set, though the search was not exhaustive.

Given the sparse taxonomy leaf (three papers) and the absence of refuting candidates among 25 examined, the work appears to occupy a distinct position within agentic weather reasoning. The limited search scope means undiscovered prior work may exist, particularly in adjacent domains like general scientific agents or climate modeling tools. The analysis covers semantic proximity and citation networks but does not guarantee comprehensive coverage of all relevant meteorological AI systems or code-generation frameworks applied to atmospheric science.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Bridging language models with meteorological data for interactive weather reasoning. The field encompasses a diverse set of approaches that connect natural language capabilities with atmospheric science. At the highest level, the taxonomy distinguishes between branches focused on multimodal data interpretation (integrating satellite imagery, radar, and numerical outputs), forecasting and prediction pipelines that leverage language models for temporal reasoning, conversational systems designed for user-facing weather information delivery, and agentic frameworks that perform code-based analysis and multi-scale reasoning. Additional branches address knowledge integration from climate science, text-based event classification, application-driven systems for specific domains like agriculture or disaster response, and the underlying conversational AI architectures that enable these interactions. Works such as WeatherQA[9] and ClimateIQA[11] illustrate efforts to build question-answering benchmarks, while systems like ChatClimate[6] and VayuChat[27] exemplify conversational interfaces that translate technical meteorological content into accessible dialogue. Within this landscape, a particularly active line of work centers on agentic weather reasoning and code-based analysis, where systems generate and execute code to process meteorological datasets and produce interpretable reports. Zephyrus[0] sits squarely in this branch, emphasizing multi-scale reasoning and report generation that spans local to synoptic phenomena. It shares thematic ground with Hierarchical AI Meteorologist[17], which similarly adopts a hierarchical approach to weather interpretation, and with Modular Weather Interpretation[25], which decomposes reasoning into modular components. In contrast, works like AirGPT[3] and Cllmate[5] focus more on integrating domain-specific knowledge bases or retrieval-augmented generation to ground language model outputs in authoritative climate data. The central trade-off across these directions involves balancing end-to-end neural generation with explicit symbolic reasoning or code execution, and determining how much domain expertise to encode directly versus retrieving on demand. Zephyrus[0] leans toward the code-execution paradigm, enabling transparent, reproducible analysis at multiple spatial and temporal scales.

Claimed Contributions

ZEPHYRUS WORLD agentic environment for weather science

10 retrieved papers

The authors introduce a comprehensive execution environment that unifies weather science capabilities through Python APIs, including interfaces to WeatherBench 2 dataset, geoquerying functionality, state-of-the-art forecasting models, and physics-based simulators, enabling LLMs to interact programmatically with meteorological data.

10 retrieved papers

ZEPHYRUS multi-turn LLM-based weather agents

5 retrieved papers

The authors design two LLM-based agent systems with different execution strategies: ZEPHYRUS-DIRECT generates complete solutions in one attempt, while ZEPHYRUS-REFLECTIVE implements a multi-turn workflow that alternates between code generation and execution phases with iterative refinement through conversational feedback loops.

5 retrieved papers

ZEPHYRUS BENCH weather reasoning benchmark with scalable data generation pipeline

10 retrieved papers

The authors construct a comprehensive benchmark built on ERA5 reanalysis data with a scalable data generation pipeline that combines human-authored and semi-synthetic tasks spanning diverse weather-related problems, from basic lookups to advanced forecasting, extreme event detection, and counterfactual reasoning, accompanied by robust evaluation schemes.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[17] Hierarchical AI-Meteorologist: LLM-Agent System for Multi-Scale and Explainable Weather Forecast Reporting PDF

Daniil Sukhorukov, Andrei Zakharov, Nikita Glazkov, Katsiaryna Yanchanka, Vladimir Kirilin, Maxim Dubovitsky, Roman Sultimov, Yuri Maksimov, Ilya Makarov (2025)

[25] A Modular LLM-Agent System for Transparent Multi-Parameter Weather Interpretation PDF

Daniil Sukhorukov, Andrei Zakharov, Nikita Glazkov, Katsiaryna Yanchanka, Vladimir Kirilin, Maxim Dubovitsky, Roman Sultimov, Yuri Maksimov, Ilya Makarov (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

ZEPHYRUS WORLD agentic environment for weather science

[40] MetPy: A meteorological Python library for data analysis and visualization PDF

Cannot Refute

[41] Wind Energy Plugins for Weather Prediction Models PDF

Cannot Refute

[42] MAchinE Learning for Scalable meTeoROlogy and climate PDF

Cannot Refute

[43] Data Analytics and Machine Learning in Agro-Meteorology PDF

Cannot Refute

[44] The Weather On-Demand Framework PDF

Cannot Refute

[45] IoT-driven real-time weather measurement and forecasting mobile application with machine learning integration PDF

Cannot Refute

[46] Weather forecasting using application programming interface PDF

Cannot Refute

[47] Time series forecasting in python PDF

Cannot Refute

[48] WB-CPI: Weather based crop prediction in India using big data analytics PDF

Cannot Refute

[49] Development of Weather Forecast Application Using API PDF

Cannot Refute

Contribution

ZEPHYRUS multi-turn LLM-based weather agents

[35] From powerpoint ui sketches to web-based applications: Pattern-driven code generation for gis dashboard development using knowledge-augmented llms, context â¦ PDF

Cannot Refute

[36] GeoCogent: an LLM-based agent for geospatial code generation PDF

Cannot Refute

[37] An llm agent for automatic geospatial data analysis PDF

Cannot Refute

[38] LLM-Agents Driven Automated Simulation Testing and Analysis of small Uncrewed Aerial Systems PDF

Cannot Refute

[39] CLIMATEAGENT: Multi-Agent Orchestration for Complex Climate Data Science Workflows PDF

Cannot Refute

Contribution

ZEPHYRUS BENCH weather reasoning benchmark with scalable data generation pipeline

[7] ClimaEmpact: Domain-Aligned Small Language Models and Datasets for Extreme Weather Analytics PDF

Cannot Refute

[50] WeatherBench 2: A benchmark for the next generation of dataâdriven global weather models PDF

Cannot Refute

[51] Cllmate: A multimodal benchmark for weather and climate events forecasting PDF

Cannot Refute

[52] WeatherBench: a benchmark data set for dataâdriven weather forecasting PDF

Cannot Refute

[53] Deep learning-based weather prediction: a survey PDF

Cannot Refute

[54] Utility of Graph Neural Networks in Short-to Medium-Range Weather Forecasting. PDF

Cannot Refute

[55] Challenges and benchmark datasets for machine learning in the atmospheric sciences: Definition, status, and outlook PDF

Cannot Refute

[56] WxC-Bench: A Novel Dataset for Weather and Climate Downstream Tasks PDF

Cannot Refute

[57] SuryaBench: Benchmark Dataset for Advancing Machine Learning in Heliophysics and Space Weather Prediction PDF

Cannot Refute

[58] A Benchmark for AI-based Weather Data Assimilation PDF

Cannot Refute

Zephyrus: An Agentic Framework for Weather Science

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[17] Hierarchical AI-Meteorologist: LLM-Agent System for Multi-Scale and Explainable Weather Forecast Reporting PDF

[25] A Modular LLM-Agent System for Transparent Multi-Parameter Weather Interpretation PDF

Contribution Analysis

ZEPHYRUS WORLD agentic environment for weather science

[40] MetPy: A meteorological Python library for data analysis and visualization PDF

[41] Wind Energy Plugins for Weather Prediction Models PDF

[42] MAchinE Learning for Scalable meTeoROlogy and climate PDF

[43] Data Analytics and Machine Learning in Agro-Meteorology PDF

[44] The Weather On-Demand Framework PDF

[45] IoT-driven real-time weather measurement and forecasting mobile application with machine learning integration PDF

[46] Weather forecasting using application programming interface PDF

[47] Time series forecasting in python PDF

[48] WB-CPI: Weather based crop prediction in India using big data analytics PDF

[49] Development of Weather Forecast Application Using API PDF

ZEPHYRUS multi-turn LLM-based weather agents

[35] From powerpoint ui sketches to web-based applications: Pattern-driven code generation for gis dashboard development using knowledge-augmented llms, context â¦ PDF

[36] GeoCogent: an LLM-based agent for geospatial code generation PDF

[37] An llm agent for automatic geospatial data analysis PDF

[38] LLM-Agents Driven Automated Simulation Testing and Analysis of small Uncrewed Aerial Systems PDF

[39] CLIMATEAGENT: Multi-Agent Orchestration for Complex Climate Data Science Workflows PDF

ZEPHYRUS BENCH weather reasoning benchmark with scalable data generation pipeline

[7] ClimaEmpact: Domain-Aligned Small Language Models and Datasets for Extreme Weather Analytics PDF

[50] WeatherBench 2: A benchmark for the next generation of dataâdriven global weather models PDF

[51] Cllmate: A multimodal benchmark for weather and climate events forecasting PDF

[52] WeatherBench: a benchmark data set for dataâdriven weather forecasting PDF

[53] Deep learning-based weather prediction: a survey PDF

[54] Utility of Graph Neural Networks in Short-to Medium-Range Weather Forecasting. PDF

[55] Challenges and benchmark datasets for machine learning in the atmospheric sciences: Definition, status, and outlook PDF

[56] WxC-Bench: A Novel Dataset for Weather and Climate Downstream Tasks PDF

[57] SuryaBench: Benchmark Dataset for Advancing Machine Learning in Heliophysics and Space Weather Prediction PDF

[58] A Benchmark for AI-based Weather Data Assimilation PDF

Table of Contents

[35] From powerpoint ui sketches to web-based applications: Pattern-driven code generation for gis dashboard development using knowledge-augmented llms, context â¦ PDF

[50] WeatherBench 2: A benchmark for the next generation of dataâdriven global weather models PDF

[52] WeatherBench: a benchmark data set for dataâdriven weather forecasting PDF