FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 7.3 Download Report PDF

BenchmarkFuture PredictionAgent

Future prediction is a complex task for LLM agents, requiring a high level of analytical thinking, information gathering, contextual understanding, and decision-making under uncertainty. Agents must not only gather and interpret vast amounts of dynamic information but also integrate diverse data sources, weigh uncertainties, and adapt predictions based on emerging trends, just as human experts do in fields like politics, economics, and finance. Despite its importance, no large-scale benchmark exists for evaluating agents on future prediction, largely due to challenges in handling real-time updates and retrieving timely, accurate answers. To address this, we introduce FutureX, a dynamic and live evaluation benchmark specifically designed for LLM agents performing future prediction tasks. FutureX is the largest and most diverse live benchmark for future prediction, supporting real-time daily updates and eliminating data contamination through an automated pipeline for question gathering and answer collection. We evaluate 25 LLM/agent models, including those with reasoning, search capabilities, and integration of external tools such as the open-source Deep Research Agent and closed-source Deep Research models. This comprehensive evaluation assesses agents’ adaptive reasoning and performance in dynamic environments. Our goal is to establish a dynamic, contamination-free evaluation standard that drives the development of LLM agents capable of performing at the level of professional human analysts in complex reasoning and predictive thinking.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces FutureX, a dynamic live benchmark for evaluating LLM agents on future prediction tasks across diverse domains. It resides in the 'Benchmarking and Evaluation of Predictive Agents' leaf, which contains only two papers in the taxonomy (FutureX and Realm Bench). This is a relatively sparse research direction within a 50-paper taxonomy spanning 21 leaf nodes, suggesting that standardized evaluation frameworks for predictive agents remain underdeveloped compared to domain-specific forecasting applications, which occupy multiple crowded branches like Financial Market Prediction (9 papers) and Social Behavior Simulation (7 papers).

The taxonomy reveals that most neighboring work focuses on building predictive systems for specific domains rather than cross-domain evaluation infrastructure. Branches like Domain-Specific Forecasting Applications (10 papers across healthcare, energy, education, and industrial contexts) and World Model Construction (7 papers) emphasize system design and application. The Benchmarking leaf's scope explicitly covers 'benchmarks or evaluation frameworks assessing agent performance on prediction or planning tasks,' distinguishing it from application-focused branches. FutureX bridges this gap by providing evaluation infrastructure that complements domain-specific efforts like FinArena and AgentsBench Legal mentioned in the taxonomy narrative.

Among 29 candidates examined, the first contribution (dynamic live benchmark) shows one refutable candidate out of 10 examined, while the automated pipeline and model evaluation contributions show no clear refutations across 10 and 9 candidates respectively. The limited refutation suggests that among the top-30 semantic matches, most prior benchmarks either focus on static datasets, narrow domains, or lack the real-time update mechanism that FutureX emphasizes. The automated pipeline and comprehensive evaluation appear more novel within this search scope, though the analysis does not claim exhaustive coverage of all benchmarking literature.

Based on the limited search scope of 29 candidates, the work appears to occupy a relatively underexplored niche at the intersection of live evaluation and cross-domain future prediction. The taxonomy structure confirms that evaluation infrastructure lags behind application development in this field. However, the analysis reflects top-K semantic similarity and does not guarantee discovery of all relevant benchmarking efforts, particularly those published in specialized venues or using different terminology.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: future prediction by LLM agents in dynamic real-world environments. This emerging field spans a diverse set of application domains and methodological approaches, organized into thirteen major branches. Financial Market Prediction and Trading explores how agents forecast stock movements and execute trades, with works like StockAgent[1] and EconAgent[2] demonstrating domain-specific reasoning. Social and Human Behavior Simulation investigates how agents model collective dynamics and individual actions, as seen in Socioverse[4] and Social Opinions Prediction[5]. Domain-Specific Forecasting Applications address specialized prediction tasks ranging from load forecasting (Interactive Load Forecasting[6]) to pandemic modeling (Pandemic Forecasting LLM[21]). World Model Construction and State Prediction focuses on building internal representations of environment dynamics, with surveys like World Models Survey[3] and systems such as WorldGPT[11] and World Modelling Agents[13]. Autonomous Driving and Navigation, Time Series Analysis and Contextualization, and Proactive and Context-Aware Agent Systems each tackle distinct aspects of temporal reasoning and anticipatory behavior. Meanwhile, Agent Planning and Decision-Making Frameworks, Process Automation and Intelligent Assistance, and Foundation Models for Decision-Making[22] provide the architectural and algorithmic underpinnings. Explainability and Interpretability of Agent Behavior addresses transparency concerns, while Benchmarking and Evaluation of Predictive Agents establishes rigorous assessment protocols. A central tension across these branches involves balancing domain-specific expertise with general-purpose reasoning capabilities, as highlighted by works like Prompting Not Enough[12] which question whether prompting alone suffices for complex prediction tasks. The field also grapples with how to validate agent predictions in open-ended environments where ground truth may be ambiguous or delayed. FutureX[0] sits squarely within the Benchmarking and Evaluation branch, alongside Realm Bench[36], addressing the critical need for standardized evaluation frameworks that can assess predictive performance across diverse scenarios. While many works focus on building predictive systems for specific domains, FutureX[0] emphasizes the meta-level challenge of how to rigorously measure and compare agent forecasting abilities, complementing domain-focused efforts like FinArena[35] and AgentsBench Legal[34] by providing cross-domain evaluation infrastructure that can reveal whether predictive capabilities generalize beyond narrow task settings.

Claimed Contributions

FutureX: A dynamic live benchmark for LLM agents in future prediction

Can Refute

10 retrieved papers

The authors introduce FutureX, a large-scale, continuously updated benchmark that evaluates LLM agents on future prediction tasks. It features daily updates, eliminates data contamination by focusing on future events, and covers 195 diverse websites across 11 domains including politics, economics, finance, sports, and technology.

10 retrieved papers

Can Refute

Automated pipeline for event collection, curation, and evaluation

10 retrieved papers

The authors develop a fully automated evaluation pipeline that daily updates future questions, runs various LLM agents on event start dates, collects outcomes after resolution dates, and evaluates agent performance without manual intervention, ensuring timeliness and scalability.

10 retrieved papers

Comprehensive evaluation of 25 LLM/agent models across four difficulty tiers

9 retrieved papers

The authors conduct a comprehensive evaluation of 25 models spanning base LLMs, LLMs with search capabilities, open-source and closed-source Deep Research agents. They assess performance across four difficulty tiers (basic, wide search, deep search, super agent) that test progressively more complex planning, reasoning, and searching skills.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[36] Realm-bench: A real-world planning benchmark for llms and multi-agent systems PDF

L Geng, EY Chang (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

FutureX: A dynamic live benchmark for LLM agents in future prediction

[53] Forecastbench: A dynamic benchmark of ai forecasting capabilities PDF

Can Refute

[51] Bench to the Future: A Pastcasting Benchmark for Forecasting Agents PDF

Cannot Refute

[52] Yeah, un, oh: Continuous and real-time backchannel prediction with fine-tuning of voice activity projection PDF

Cannot Refute

[54] Multimodal Transformer Models for Turn-taking Prediction: Effects on Conversational Dynamics of Human-Agent Interaction during Cooperative Gameplay PDF

Cannot Refute

[55] Benchmarking Large Language Models Under Data Contamination: A Survey from Static to Dynamic Evaluation PDF

Cannot Refute

[56] Real-time multimodal turn-taking prediction to enhance cooperative dialogue during human-agent interaction PDF

Cannot Refute

[57] CryptoBench: A Dynamic Benchmark for Expert-Level Evaluation of LLM Agents in Cryptocurrency PDF

Cannot Refute

[58] Real-Time Progress Prediction in Reasoning Language Models PDF

Cannot Refute

[59] Revisiting Dynamic Evaluation: Online Adaptation for Large Language Models PDF

Cannot Refute

[60] Dynamic Evaluation of Transformer Language Models PDF

Cannot Refute

Contribution

Automated pipeline for event collection, curation, and evaluation

[61] A template-based approach for question answering over knowledge bases PDF

Cannot Refute

[62] A virtual patient dialogue system based on question-answering on clinical records PDF

Cannot Refute

[63] FinTextQA: A Dataset for Long-form Financial Question Answering PDF

Cannot Refute

[64] AutoPaperBench: An MLLM-Based Framework for Automatic Generation of Paper Understanding Evaluation Benchmarks PDF

Cannot Refute

[65] Benchmarking Foundation Models with Language-Model-as-an-Examiner PDF

Cannot Refute

[66] CUS-QA: Local-Knowledge-Oriented Open-Ended Question Answering Dataset PDF

Cannot Refute

[67] Automatic question generation and answer assessment: a survey PDF

Cannot Refute

[68] LaMP-QA: A Benchmark for Personalized Long-form Question Answering PDF

Cannot Refute

[69] VideoVista: A Versatile Benchmark for Video Understanding and Reasoning PDF

Cannot Refute

[70] Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA PDF

Cannot Refute

Contribution

Comprehensive evaluation of 25 LLM/agent models across four difficulty tiers

[71] A survey on evaluation of large language models PDF

Cannot Refute

[72] Benchmarking large language models as ai research agents PDF

Cannot Refute

[73] Evaluating language model agency through negotiations PDF

Cannot Refute

[74] Physbench: Benchmarking and enhancing vision-language models for physical world understanding PDF

Cannot Refute

[76] A survey on large language model benchmarks PDF

Cannot Refute

[77] Advancing AI in Higher Education: A Comparative Study of Large Language Model-Based Agents for Exam Question Generation, Improvement, and Evaluation PDF

Cannot Refute

[78] Breakpoint: Scalable evaluation of system-level reasoning in LLM code agents PDF

Cannot Refute

[79] Multi-agent large language models for conversational task-solving PDF

Cannot Refute

[80] The Bias is in the Details: An Assessment of Cognitive Bias in LLMs PDF

Cannot Refute

FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[36] Realm-bench: A real-world planning benchmark for llms and multi-agent systems PDF

Contribution Analysis

FutureX: A dynamic live benchmark for LLM agents in future prediction

[53] Forecastbench: A dynamic benchmark of ai forecasting capabilities PDF

[51] Bench to the Future: A Pastcasting Benchmark for Forecasting Agents PDF

[52] Yeah, un, oh: Continuous and real-time backchannel prediction with fine-tuning of voice activity projection PDF

[54] Multimodal Transformer Models for Turn-taking Prediction: Effects on Conversational Dynamics of Human-Agent Interaction during Cooperative Gameplay PDF

[55] Benchmarking Large Language Models Under Data Contamination: A Survey from Static to Dynamic Evaluation PDF

[56] Real-time multimodal turn-taking prediction to enhance cooperative dialogue during human-agent interaction PDF

[57] CryptoBench: A Dynamic Benchmark for Expert-Level Evaluation of LLM Agents in Cryptocurrency PDF

[58] Real-Time Progress Prediction in Reasoning Language Models PDF

[59] Revisiting Dynamic Evaluation: Online Adaptation for Large Language Models PDF

[60] Dynamic Evaluation of Transformer Language Models PDF

Automated pipeline for event collection, curation, and evaluation

[61] A template-based approach for question answering over knowledge bases PDF

[62] A virtual patient dialogue system based on question-answering on clinical records PDF

[63] FinTextQA: A Dataset for Long-form Financial Question Answering PDF

[64] AutoPaperBench: An MLLM-Based Framework for Automatic Generation of Paper Understanding Evaluation Benchmarks PDF

[65] Benchmarking Foundation Models with Language-Model-as-an-Examiner PDF

[66] CUS-QA: Local-Knowledge-Oriented Open-Ended Question Answering Dataset PDF

[67] Automatic question generation and answer assessment: a survey PDF

[68] LaMP-QA: A Benchmark for Personalized Long-form Question Answering PDF

[69] VideoVista: A Versatile Benchmark for Video Understanding and Reasoning PDF

[70] Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA PDF

Comprehensive evaluation of 25 LLM/agent models across four difficulty tiers

[71] A survey on evaluation of large language models PDF

[72] Benchmarking large language models as ai research agents PDF

[73] Evaluating language model agency through negotiations PDF

[74] Physbench: Benchmarking and enhancing vision-language models for physical world understanding PDF

[76] A survey on large language model benchmarks PDF

[77] Advancing AI in Higher Education: A Comparative Study of Large Language Model-Based Agents for Exam Question Generation, Improvement, and Evaluation PDF

[78] Breakpoint: Scalable evaluation of system-level reasoning in LLM code agents PDF

[79] Multi-agent large language models for conversational task-solving PDF

[80] The Bias is in the Details: An Assessment of Cognitive Bias in LLMs PDF

Table of Contents