USTBench: Benchmarking and Dissecting Spatiotemporal Reasoning Capabilities of LLMs as Urban Agents

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

large language modelspatiotemporal reasoningurban science

Large language models (LLMs) have shown emerging potential in spatiotemporal reasoning, making them promising candidates for building urban agents that support diverse urban downstream applications. Despite these benefits, existing studies primarily focus on evaluating urban LLM agent on outcome-level metrics (e.g., prediction accuracy, traffic efficiency), offering limited insight into their underlying reasoning processes. As a result, the strengths and limitations of urban LLM agents in spatiotemporal reasoning remain poorly understood. To this end, we introduce USTBench, the first benchmark to evaluate LLMs’ spatiotemporal reasoning abilities as urban agents across four decomposed dimensions: spatiotemporal understanding, forecasting, planning, and reflection. Specifically, USTBench supports five diverse urban decision-making and four spatiotemporal prediction tasks, all running within our constructed interactive city environment UAgentEnv. The benchmark includes 62,466 structured QA pairs for process-based evaluation and standardized end-to-end task assessments, enabling fine-grained diagnostics and broad task-level comparison across diverse urban scenarios. Through extensive evaluation of fourteen leading LLMs, we reveal that although LLMs show promising potential across various urban downstream tasks, they still struggle in long-horizon planning and reflective adaptation in dynamic urban contexts. Notably, recent advanced reasoning models (e.g., DeepSeek-R1) trained on general logic or mathematical problems do not consistently outperform non-reasoning LLMs. This discrepancy highlights the need for domain-specialized adaptation methods to enhance urban spatiotemporal reasoning. Overall, USTBench provides a foundation to build more adaptive and effective LLM-based urban agents and broad smart city applications. Our project is available at https://anonymous.4open.science/r/USTBench.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces USTBench, a benchmark evaluating LLM spatiotemporal reasoning as urban agents across four dimensions: understanding, forecasting, planning, and reflection. It resides in the 'Comprehensive Urban Agent Benchmarks' leaf, which contains only two papers including this one. This sparse population suggests the research direction—systematic multi-dimensional evaluation of urban LLM agents—is relatively nascent. The taxonomy shows the broader 'Benchmarking and Evaluation Frameworks' branch contains just five papers total, indicating that rigorous evaluation methodologies for urban LLM agents remain an emerging area compared to application-driven systems.

The taxonomy reveals neighboring work in 'Task-Specific Evaluation Benchmarks' (three papers focused on POI QA or traffic monitoring) and 'Urban-Specialized LLM Architectures' (seven papers developing spatiotemporal-enhanced models). USTBench diverges from task-specific benchmarks by pursuing comprehensive multi-task evaluation rather than depth in single domains. It connects to application branches like 'Urban Mobility and Behavior Simulation' and 'Spatiotemporal Prediction and Forecasting' by providing evaluation infrastructure for capabilities these systems require. The taxonomy's scope notes clarify that comprehensive benchmarks explicitly exclude single-task focus, positioning USTBench as infrastructure for broad capability assessment.

Among thirty candidates examined, none clearly refute the three core contributions. The USTBench benchmark contribution examined ten candidates with zero refutable overlaps; similarly, the UAgentEnv interactive environment and process-based evaluation methodology each showed no clear prior work among ten candidates examined. This limited search scope suggests the specific combination—comprehensive multi-dimensional urban agent evaluation with interactive environments and process-based assessment—appears distinctive within the examined literature. However, the analysis covers top-thirty semantic matches, not exhaustive field coverage, leaving open whether related evaluation frameworks exist beyond this search radius.

Given the sparse taxonomy leaf and absence of refuting candidates in the limited search, the work appears to occupy relatively uncontested ground within the examined scope. The combination of comprehensive task coverage, interactive environments, and process-level diagnostics distinguishes it from neighboring task-specific benchmarks. Limitations include the restricted search scale and the possibility that related evaluation methodologies exist in adjacent research communities not captured by semantic search over this candidate set.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Evaluating spatiotemporal reasoning capabilities of large language models as urban agents. The field structure reflects a maturing ecosystem where researchers pursue complementary directions. Benchmarking and Evaluation Frameworks establish systematic ways to measure LLM performance on urban tasks, often through comprehensive testbeds that probe spatial understanding, temporal dynamics, and agent-like decision-making. Urban-Specialized LLM Architectures and Training focuses on developing models tailored to urban contexts, incorporating domain knowledge through specialized pre-training, fine-tuning on urban corpora, or integrating structured knowledge graphs like those in Urbankgent[5]. Application-Driven Urban LLM Systems emphasize deploying LLMs for concrete urban challenges—traffic management, urban planning, mobility prediction—where works such as Citygpt[2] and Urbangpt[1] demonstrate practical utility. Foundational Research and Conceptual Frameworks explores theoretical underpinnings, examining how LLMs capture urban science concepts and proposing hybrid intelligence paradigms that combine human expertise with machine reasoning. Particularly active lines of work reveal tensions between generalist evaluation and specialized application. Comprehensive benchmarks like USTBench[0] and its predecessor USTBench[7] aim to systematically assess spatiotemporal reasoning across diverse urban scenarios, providing standardized metrics for comparing model capabilities. These evaluation-focused efforts contrast with application-driven systems that prioritize end-task performance in specific domains like traffic forecasting or urban planning. USTBench[0] sits squarely within the benchmarking branch, emphasizing rigorous assessment of core reasoning abilities rather than optimizing for particular applications. Compared to neighboring comprehensive benchmarks, it likely shares the goal of broad coverage across spatial and temporal dimensions while potentially differing in task granularity, dataset scale, or the specific reasoning competencies tested. Open questions persist around whether general-purpose LLMs can match specialized urban models, how to balance benchmark comprehensiveness with practical relevance, and whether spatiotemporal reasoning transfers across different urban contexts.

Claimed Contributions

USTBench benchmark for evaluating LLM spatiotemporal reasoning as urban agents

10 retrieved papers

The authors introduce USTBench, a novel benchmark that systematically evaluates large language models' spatiotemporal reasoning capabilities in urban contexts. It decomposes reasoning into four dimensions (understanding, forecasting, planning, reflection) and includes 62,466 structured QA pairs for process-based evaluation alongside standardized end-to-end task assessments across nine urban tasks.

10 retrieved papers

UAgentEnv interactive city environment

10 retrieved papers

The authors develop UAgentEnv, an interactive urban environment that supports five diverse urban decision-making tasks and four spatiotemporal prediction tasks. This environment enables agents to perceive, interact with, and respond to dynamic urban contexts, facilitating both benchmark dataset collection and uniform downstream task evaluation.

10 retrieved papers

Process-based evaluation methodology for urban spatiotemporal reasoning

10 retrieved papers

The authors propose a dual-level evaluation framework that moves beyond outcome-based metrics by providing process-based diagnostics of intermediate reasoning steps. This methodology enables fine-grained analysis of where reasoning succeeds or fails across the understand-forecast-plan-reflect loop, combined with standardized end-to-end performance assessment.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[7] USTBench: Benchmarking and Dissecting Spatiotemporal Reasoning of LLMs as Urban Agents PDF

Lai, Siqi, Ning, Yansong, Chen Zhi-xi, Liu Hao (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

USTBench benchmark for evaluating LLM spatiotemporal reasoning as urban agents

[2] Citygpt: Empowering urban spatial cognition of large language models PDF

Cannot Refute

[6] TraffiCoT-R: A framework for advanced spatio-temporal reasoning in large language models PDF

Cannot Refute

[7] USTBench: Benchmarking and Dissecting Spatiotemporal Reasoning of LLMs as Urban Agents PDF

Cannot Refute

[8] A Dataset for Spatiotemporal-Sensitive POI Question Answering PDF

Cannot Refute

[9] A Study on Individual Spatiotemporal Activity Generation Method Using MCP-Enhanced Chain-of-Thought Large Language Models PDF

Cannot Refute

[16] Towards urban general intelligence: A review and outlook of urban foundation models PDF

Cannot Refute

[53] Citybench: Evaluating the capabilities of large language models for urban tasks PDF

Cannot Refute

[54] STRIDE-QA: Visual Question Answering Dataset for Spatiotemporal Reasoning in Urban Driving Scenes PDF

Cannot Refute

[55] Geobench-vlm: Benchmarking vision-language models for geospatial tasks PDF

Cannot Refute

[56] Urbench: A comprehensive benchmark for evaluating large multimodal models in multi-view urban scenarios PDF

Cannot Refute

Contribution

UAgentEnv interactive city environment

[43] Generative Agents: Interactive Simulacra of Human Behavior PDF

Cannot Refute

[44] Spatio-temporal graph neural networks for predictive learning in urban computing: A survey PDF

Cannot Refute

[45] AgentFormer: Agent-Aware Transformers for Socio-Temporal Multi-Agent Forecasting PDF

Cannot Refute

[46] Map Prediction and Generative Entropy for Multi-Agent Exploration PDF

Cannot Refute

[47] V2xpnp: Vehicle-to-everything spatio-temporal fusion for multi-agent perception and prediction PDF

Cannot Refute

[48] Social InteractionâAware Dynamical Models and DecisionâMaking for Autonomous Vehicles PDF

Cannot Refute

[49] Spatiotemporal relationship reasoning for pedestrian intent prediction PDF

Cannot Refute

[50] Agent-Based Modeling at the Micro-Scale of Urban Environments: A Framework Integrating Imitation in Pedestrian Wayfinding PDF

Cannot Refute

[51] Traffic agents trajectory prediction based on enhanced bidirectional recurrent network and adaptive social interaction model PDF

Cannot Refute

[52] Equivariant Map and Agent Geometry for Autonomous Driving Motion Prediction PDF

Cannot Refute

Contribution

Process-based evaluation methodology for urban spatiotemporal reasoning

[2] Citygpt: Empowering urban spatial cognition of large language models PDF

Cannot Refute

[7] USTBench: Benchmarking and Dissecting Spatiotemporal Reasoning of LLMs as Urban Agents PDF

Cannot Refute

[57] VSP: Assessing the dual challenges of perception and reasoning in spatial planning tasks for VLMs PDF

Cannot Refute

[58] GeoGPT: An assistant for understanding and processing geospatial tasks PDF

Cannot Refute

[59] ThinkGeo: Evaluating Tool-Augmented Agents for Remote Sensing Tasks PDF

Cannot Refute

[60] Neuro-Symbolic Urban Twins: Embedding Reasoning in Data-Driven City Models PDF

Cannot Refute

[61] Exploring the potential of large language models (LLMs) in analyzing passengers' perceptions of transit service quality PDF

Cannot Refute

[62] Foundation models and intelligent decision-making: Progress, challenges, and perspectives PDF

Cannot Refute

[63] Assessment of a process-based urban river restoration using biological and hydro-geomorphological indicators. The Congost River at Granollers (Catalonia â¦ PDF

Cannot Refute

[64] How Well Do Vision-Language Models Understand Cities? A Comparative Study on Spatial Reasoning from Street-View Images PDF

Cannot Refute

USTBench: Benchmarking and Dissecting Spatiotemporal Reasoning Capabilities of LLMs as Urban Agents

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[7] USTBench: Benchmarking and Dissecting Spatiotemporal Reasoning of LLMs as Urban Agents PDF

Contribution Analysis

USTBench benchmark for evaluating LLM spatiotemporal reasoning as urban agents

[2] Citygpt: Empowering urban spatial cognition of large language models PDF

[6] TraffiCoT-R: A framework for advanced spatio-temporal reasoning in large language models PDF

[7] USTBench: Benchmarking and Dissecting Spatiotemporal Reasoning of LLMs as Urban Agents PDF

[8] A Dataset for Spatiotemporal-Sensitive POI Question Answering PDF

[9] A Study on Individual Spatiotemporal Activity Generation Method Using MCP-Enhanced Chain-of-Thought Large Language Models PDF

[16] Towards urban general intelligence: A review and outlook of urban foundation models PDF

[53] Citybench: Evaluating the capabilities of large language models for urban tasks PDF

[54] STRIDE-QA: Visual Question Answering Dataset for Spatiotemporal Reasoning in Urban Driving Scenes PDF

[55] Geobench-vlm: Benchmarking vision-language models for geospatial tasks PDF

[56] Urbench: A comprehensive benchmark for evaluating large multimodal models in multi-view urban scenarios PDF

UAgentEnv interactive city environment

[43] Generative Agents: Interactive Simulacra of Human Behavior PDF

[44] Spatio-temporal graph neural networks for predictive learning in urban computing: A survey PDF

[45] AgentFormer: Agent-Aware Transformers for Socio-Temporal Multi-Agent Forecasting PDF

[46] Map Prediction and Generative Entropy for Multi-Agent Exploration PDF

[47] V2xpnp: Vehicle-to-everything spatio-temporal fusion for multi-agent perception and prediction PDF

[48] Social InteractionâAware Dynamical Models and DecisionâMaking for Autonomous Vehicles PDF

[49] Spatiotemporal relationship reasoning for pedestrian intent prediction PDF

[50] Agent-Based Modeling at the Micro-Scale of Urban Environments: A Framework Integrating Imitation in Pedestrian Wayfinding PDF

[51] Traffic agents trajectory prediction based on enhanced bidirectional recurrent network and adaptive social interaction model PDF

[52] Equivariant Map and Agent Geometry for Autonomous Driving Motion Prediction PDF

Process-based evaluation methodology for urban spatiotemporal reasoning

[2] Citygpt: Empowering urban spatial cognition of large language models PDF

[7] USTBench: Benchmarking and Dissecting Spatiotemporal Reasoning of LLMs as Urban Agents PDF

[57] VSP: Assessing the dual challenges of perception and reasoning in spatial planning tasks for VLMs PDF

[58] GeoGPT: An assistant for understanding and processing geospatial tasks PDF

[59] ThinkGeo: Evaluating Tool-Augmented Agents for Remote Sensing Tasks PDF

[60] Neuro-Symbolic Urban Twins: Embedding Reasoning in Data-Driven City Models PDF

[61] Exploring the potential of large language models (LLMs) in analyzing passengers' perceptions of transit service quality PDF

[62] Foundation models and intelligent decision-making: Progress, challenges, and perspectives PDF

[63] Assessment of a process-based urban river restoration using biological and hydro-geomorphological indicators. The Congost River at Granollers (Catalonia â¦ PDF

[64] How Well Do Vision-Language Models Understand Cities? A Comparative Study on Spatial Reasoning from Street-View Images PDF

Table of Contents

[48] Social InteractionâAware Dynamical Models and DecisionâMaking for Autonomous Vehicles PDF

[63] Assessment of a process-based urban river restoration using biological and hydro-geomorphological indicators. The Congost River at Granollers (Catalonia â¦ PDF