USTBench: Benchmarking and Dissecting Spatiotemporal Reasoning Capabilities of LLMs as Urban Agents

ICLR 2026 Conference SubmissionAnonymous Authors
large language modelspatiotemporal reasoningurban science
Abstract:

Large language models (LLMs) have shown emerging potential in spatiotemporal reasoning, making them promising candidates for building urban agents that support diverse urban downstream applications. Despite these benefits, existing studies primarily focus on evaluating urban LLM agent on outcome-level metrics (e.g., prediction accuracy, traffic efficiency), offering limited insight into their underlying reasoning processes. As a result, the strengths and limitations of urban LLM agents in spatiotemporal reasoning remain poorly understood. To this end, we introduce USTBench, the first benchmark to evaluate LLMs’ spatiotemporal reasoning abilities as urban agents across four decomposed dimensions: spatiotemporal understanding, forecasting, planning, and reflection. Specifically, USTBench supports five diverse urban decision-making and four spatiotemporal prediction tasks, all running within our constructed interactive city environment UAgentEnv. The benchmark includes 62,466 structured QA pairs for process-based evaluation and standardized end-to-end task assessments, enabling fine-grained diagnostics and broad task-level comparison across diverse urban scenarios. Through extensive evaluation of fourteen leading LLMs, we reveal that although LLMs show promising potential across various urban downstream tasks, they still struggle in long-horizon planning and reflective adaptation in dynamic urban contexts. Notably, recent advanced reasoning models (e.g., DeepSeek-R1) trained on general logic or mathematical problems do not consistently outperform non-reasoning LLMs. This discrepancy highlights the need for domain-specialized adaptation methods to enhance urban spatiotemporal reasoning. Overall, USTBench provides a foundation to build more adaptive and effective LLM-based urban agents and broad smart city applications. Our project is available at https://anonymous.4open.science/r/USTBench.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces USTBench, a benchmark evaluating LLM spatiotemporal reasoning as urban agents across four dimensions: understanding, forecasting, planning, and reflection. It resides in the 'Comprehensive Urban Agent Benchmarks' leaf, which contains only two papers including this one. This sparse population suggests the research direction—systematic multi-dimensional evaluation of urban LLM agents—is relatively nascent. The taxonomy shows the broader 'Benchmarking and Evaluation Frameworks' branch contains just five papers total, indicating that rigorous evaluation methodologies for urban LLM agents remain an emerging area compared to application-driven systems.

The taxonomy reveals neighboring work in 'Task-Specific Evaluation Benchmarks' (three papers focused on POI QA or traffic monitoring) and 'Urban-Specialized LLM Architectures' (seven papers developing spatiotemporal-enhanced models). USTBench diverges from task-specific benchmarks by pursuing comprehensive multi-task evaluation rather than depth in single domains. It connects to application branches like 'Urban Mobility and Behavior Simulation' and 'Spatiotemporal Prediction and Forecasting' by providing evaluation infrastructure for capabilities these systems require. The taxonomy's scope notes clarify that comprehensive benchmarks explicitly exclude single-task focus, positioning USTBench as infrastructure for broad capability assessment.

Among thirty candidates examined, none clearly refute the three core contributions. The USTBench benchmark contribution examined ten candidates with zero refutable overlaps; similarly, the UAgentEnv interactive environment and process-based evaluation methodology each showed no clear prior work among ten candidates examined. This limited search scope suggests the specific combination—comprehensive multi-dimensional urban agent evaluation with interactive environments and process-based assessment—appears distinctive within the examined literature. However, the analysis covers top-thirty semantic matches, not exhaustive field coverage, leaving open whether related evaluation frameworks exist beyond this search radius.

Given the sparse taxonomy leaf and absence of refuting candidates in the limited search, the work appears to occupy relatively uncontested ground within the examined scope. The combination of comprehensive task coverage, interactive environments, and process-level diagnostics distinguishes it from neighboring task-specific benchmarks. Limitations include the restricted search scale and the possibility that related evaluation methodologies exist in adjacent research communities not captured by semantic search over this candidate set.

Taxonomy

Core-task Taxonomy Papers
42
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Evaluating spatiotemporal reasoning capabilities of large language models as urban agents. The field structure reflects a maturing ecosystem where researchers pursue complementary directions. Benchmarking and Evaluation Frameworks establish systematic ways to measure LLM performance on urban tasks, often through comprehensive testbeds that probe spatial understanding, temporal dynamics, and agent-like decision-making. Urban-Specialized LLM Architectures and Training focuses on developing models tailored to urban contexts, incorporating domain knowledge through specialized pre-training, fine-tuning on urban corpora, or integrating structured knowledge graphs like those in Urbankgent[5]. Application-Driven Urban LLM Systems emphasize deploying LLMs for concrete urban challenges—traffic management, urban planning, mobility prediction—where works such as Citygpt[2] and Urbangpt[1] demonstrate practical utility. Foundational Research and Conceptual Frameworks explores theoretical underpinnings, examining how LLMs capture urban science concepts and proposing hybrid intelligence paradigms that combine human expertise with machine reasoning. Particularly active lines of work reveal tensions between generalist evaluation and specialized application. Comprehensive benchmarks like USTBench[0] and its predecessor USTBench[7] aim to systematically assess spatiotemporal reasoning across diverse urban scenarios, providing standardized metrics for comparing model capabilities. These evaluation-focused efforts contrast with application-driven systems that prioritize end-task performance in specific domains like traffic forecasting or urban planning. USTBench[0] sits squarely within the benchmarking branch, emphasizing rigorous assessment of core reasoning abilities rather than optimizing for particular applications. Compared to neighboring comprehensive benchmarks, it likely shares the goal of broad coverage across spatial and temporal dimensions while potentially differing in task granularity, dataset scale, or the specific reasoning competencies tested. Open questions persist around whether general-purpose LLMs can match specialized urban models, how to balance benchmark comprehensiveness with practical relevance, and whether spatiotemporal reasoning transfers across different urban contexts.

Claimed Contributions

USTBench benchmark for evaluating LLM spatiotemporal reasoning as urban agents

The authors introduce USTBench, a novel benchmark that systematically evaluates large language models' spatiotemporal reasoning capabilities in urban contexts. It decomposes reasoning into four dimensions (understanding, forecasting, planning, reflection) and includes 62,466 structured QA pairs for process-based evaluation alongside standardized end-to-end task assessments across nine urban tasks.

10 retrieved papers
UAgentEnv interactive city environment

The authors develop UAgentEnv, an interactive urban environment that supports five diverse urban decision-making tasks and four spatiotemporal prediction tasks. This environment enables agents to perceive, interact with, and respond to dynamic urban contexts, facilitating both benchmark dataset collection and uniform downstream task evaluation.

10 retrieved papers
Process-based evaluation methodology for urban spatiotemporal reasoning

The authors propose a dual-level evaluation framework that moves beyond outcome-based metrics by providing process-based diagnostics of intermediate reasoning steps. This methodology enables fine-grained analysis of where reasoning succeeds or fails across the understand-forecast-plan-reflect loop, combined with standardized end-to-end performance assessment.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

USTBench benchmark for evaluating LLM spatiotemporal reasoning as urban agents

The authors introduce USTBench, a novel benchmark that systematically evaluates large language models' spatiotemporal reasoning capabilities in urban contexts. It decomposes reasoning into four dimensions (understanding, forecasting, planning, reflection) and includes 62,466 structured QA pairs for process-based evaluation alongside standardized end-to-end task assessments across nine urban tasks.

Contribution

UAgentEnv interactive city environment

The authors develop UAgentEnv, an interactive urban environment that supports five diverse urban decision-making tasks and four spatiotemporal prediction tasks. This environment enables agents to perceive, interact with, and respond to dynamic urban contexts, facilitating both benchmark dataset collection and uniform downstream task evaluation.

Contribution

Process-based evaluation methodology for urban spatiotemporal reasoning

The authors propose a dual-level evaluation framework that moves beyond outcome-based metrics by providing process-based diagnostics of intermediate reasoning steps. This methodology enables fine-grained analysis of where reasoning succeeds or fails across the understand-forecast-plan-reflect loop, combined with standardized end-to-end performance assessment.