FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction

ICLR 2026 Conference SubmissionAnonymous Authors
BenchmarkFuture PredictionAgent
Abstract:

Future prediction is a complex task for LLM agents, requiring a high level of analytical thinking, information gathering, contextual understanding, and decision-making under uncertainty. Agents must not only gather and interpret vast amounts of dynamic information but also integrate diverse data sources, weigh uncertainties, and adapt predictions based on emerging trends, just as human experts do in fields like politics, economics, and finance. Despite its importance, no large-scale benchmark exists for evaluating agents on future prediction, largely due to challenges in handling real-time updates and retrieving timely, accurate answers. To address this, we introduce FutureX, a dynamic and live evaluation benchmark specifically designed for LLM agents performing future prediction tasks. FutureX is the largest and most diverse live benchmark for future prediction, supporting real-time daily updates and eliminating data contamination through an automated pipeline for question gathering and answer collection. We evaluate 25 LLM/agent models, including those with reasoning, search capabilities, and integration of external tools such as the open-source Deep Research Agent and closed-source Deep Research models. This comprehensive evaluation assesses agents’ adaptive reasoning and performance in dynamic environments. Our goal is to establish a dynamic, contamination-free evaluation standard that drives the development of LLM agents capable of performing at the level of professional human analysts in complex reasoning and predictive thinking.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces FutureX, a dynamic live benchmark for evaluating LLM agents on future prediction tasks across diverse domains. It resides in the 'Benchmarking and Evaluation of Predictive Agents' leaf, which contains only two papers in the taxonomy (FutureX and Realm Bench). This is a relatively sparse research direction within a 50-paper taxonomy spanning 21 leaf nodes, suggesting that standardized evaluation frameworks for predictive agents remain underdeveloped compared to domain-specific forecasting applications, which occupy multiple crowded branches like Financial Market Prediction (9 papers) and Social Behavior Simulation (7 papers).

The taxonomy reveals that most neighboring work focuses on building predictive systems for specific domains rather than cross-domain evaluation infrastructure. Branches like Domain-Specific Forecasting Applications (10 papers across healthcare, energy, education, and industrial contexts) and World Model Construction (7 papers) emphasize system design and application. The Benchmarking leaf's scope explicitly covers 'benchmarks or evaluation frameworks assessing agent performance on prediction or planning tasks,' distinguishing it from application-focused branches. FutureX bridges this gap by providing evaluation infrastructure that complements domain-specific efforts like FinArena and AgentsBench Legal mentioned in the taxonomy narrative.

Among 29 candidates examined, the first contribution (dynamic live benchmark) shows one refutable candidate out of 10 examined, while the automated pipeline and model evaluation contributions show no clear refutations across 10 and 9 candidates respectively. The limited refutation suggests that among the top-30 semantic matches, most prior benchmarks either focus on static datasets, narrow domains, or lack the real-time update mechanism that FutureX emphasizes. The automated pipeline and comprehensive evaluation appear more novel within this search scope, though the analysis does not claim exhaustive coverage of all benchmarking literature.

Based on the limited search scope of 29 candidates, the work appears to occupy a relatively underexplored niche at the intersection of live evaluation and cross-domain future prediction. The taxonomy structure confirms that evaluation infrastructure lags behind application development in this field. However, the analysis reflects top-K semantic similarity and does not guarantee discovery of all relevant benchmarking efforts, particularly those published in specialized venues or using different terminology.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
29
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: future prediction by LLM agents in dynamic real-world environments. This emerging field spans a diverse set of application domains and methodological approaches, organized into thirteen major branches. Financial Market Prediction and Trading explores how agents forecast stock movements and execute trades, with works like StockAgent[1] and EconAgent[2] demonstrating domain-specific reasoning. Social and Human Behavior Simulation investigates how agents model collective dynamics and individual actions, as seen in Socioverse[4] and Social Opinions Prediction[5]. Domain-Specific Forecasting Applications address specialized prediction tasks ranging from load forecasting (Interactive Load Forecasting[6]) to pandemic modeling (Pandemic Forecasting LLM[21]). World Model Construction and State Prediction focuses on building internal representations of environment dynamics, with surveys like World Models Survey[3] and systems such as WorldGPT[11] and World Modelling Agents[13]. Autonomous Driving and Navigation, Time Series Analysis and Contextualization, and Proactive and Context-Aware Agent Systems each tackle distinct aspects of temporal reasoning and anticipatory behavior. Meanwhile, Agent Planning and Decision-Making Frameworks, Process Automation and Intelligent Assistance, and Foundation Models for Decision-Making[22] provide the architectural and algorithmic underpinnings. Explainability and Interpretability of Agent Behavior addresses transparency concerns, while Benchmarking and Evaluation of Predictive Agents establishes rigorous assessment protocols. A central tension across these branches involves balancing domain-specific expertise with general-purpose reasoning capabilities, as highlighted by works like Prompting Not Enough[12] which question whether prompting alone suffices for complex prediction tasks. The field also grapples with how to validate agent predictions in open-ended environments where ground truth may be ambiguous or delayed. FutureX[0] sits squarely within the Benchmarking and Evaluation branch, alongside Realm Bench[36], addressing the critical need for standardized evaluation frameworks that can assess predictive performance across diverse scenarios. While many works focus on building predictive systems for specific domains, FutureX[0] emphasizes the meta-level challenge of how to rigorously measure and compare agent forecasting abilities, complementing domain-focused efforts like FinArena[35] and AgentsBench Legal[34] by providing cross-domain evaluation infrastructure that can reveal whether predictive capabilities generalize beyond narrow task settings.

Claimed Contributions

FutureX: A dynamic live benchmark for LLM agents in future prediction

The authors introduce FutureX, a large-scale, continuously updated benchmark that evaluates LLM agents on future prediction tasks. It features daily updates, eliminates data contamination by focusing on future events, and covers 195 diverse websites across 11 domains including politics, economics, finance, sports, and technology.

10 retrieved papers
Can Refute
Automated pipeline for event collection, curation, and evaluation

The authors develop a fully automated evaluation pipeline that daily updates future questions, runs various LLM agents on event start dates, collects outcomes after resolution dates, and evaluates agent performance without manual intervention, ensuring timeliness and scalability.

10 retrieved papers
Comprehensive evaluation of 25 LLM/agent models across four difficulty tiers

The authors conduct a comprehensive evaluation of 25 models spanning base LLMs, LLMs with search capabilities, open-source and closed-source Deep Research agents. They assess performance across four difficulty tiers (basic, wide search, deep search, super agent) that test progressively more complex planning, reasoning, and searching skills.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

FutureX: A dynamic live benchmark for LLM agents in future prediction

The authors introduce FutureX, a large-scale, continuously updated benchmark that evaluates LLM agents on future prediction tasks. It features daily updates, eliminates data contamination by focusing on future events, and covers 195 diverse websites across 11 domains including politics, economics, finance, sports, and technology.

Contribution

Automated pipeline for event collection, curation, and evaluation

The authors develop a fully automated evaluation pipeline that daily updates future questions, runs various LLM agents on event start dates, collects outcomes after resolution dates, and evaluates agent performance without manual intervention, ensuring timeliness and scalability.

Contribution

Comprehensive evaluation of 25 LLM/agent models across four difficulty tiers

The authors conduct a comprehensive evaluation of 25 models spanning base LLMs, LLMs with search capabilities, open-source and closed-source Deep Research agents. They assess performance across four difficulty tiers (basic, wide search, deep search, super agent) that test progressively more complex planning, reasoning, and searching skills.

FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction | Novelty Validation