Pitfalls in Evaluating Language Model Forecasters

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

forecastingevaluationcriticismleakagestandardsLLMspredictionfuturebenchmarks

Large language models (LLMs) have recently been applied to forecasting tasks, with some works claiming these systems match or exceed human performance. In this paper, we argue that, as a community, we should be careful about such conclusions as evaluating LLM forecasters presents unique challenges. We identify two broad categories of issues: (1) difficulty in trusting evaluation results due to many forms of temporal leakage, and (2) difficulty in extrapolating from evaluation performance to real-world forecasting. Through systematic analysis and concrete examples from prior work, we demonstrate how evaluation flaws can raise concerns about current and future performance claims. We argue that more rigorous evaluation methodologies are needed to confidently assess the forecasting abilities of LLMs.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper contributes a critical methodological analysis of LLM forecasting evaluation, identifying temporal leakage issues and extrapolation challenges that may undermine performance claims. It resides in the Real-World Event Forecasting leaf, which contains four papers total including this one. The three sibling papers focus on benchmark construction and comparative performance assessment (ForecastBench, Expert Forecasters Comparison, Forecasting Strategies), making this a moderately populated but focused research direction. The paper's emphasis on evaluation validity rather than forecasting performance itself represents a distinct angle within this cluster.

The taxonomy reveals that Real-World Event Forecasting sits alongside Time Series and Sequential Forecasting and Domain-Specific Forecasting Applications within the broader Forecasting and Prediction Tasks branch. Neighboring branches include Clinical and Healthcare Prediction and Biological and Scientific Prediction, which address specialized forecasting domains. The General LLM Evaluation Frameworks branch contains methodological work on evaluation metrics and LLM-as-Evaluator approaches, which share conceptual overlap with this paper's concerns about evaluation rigor. The paper bridges practical forecasting assessment with broader evaluation methodology critiques found in works like Unfair Evaluators and Model Written Evaluations.

Among thirty candidates examined, all three contributions show evidence of prior work overlap. The temporal leakage contribution examined ten candidates with two appearing to provide overlapping analysis. The extrapolation challenges contribution examined ten candidates with one refutable match. The concrete demonstrations contribution examined ten candidates with two potential overlaps. These statistics suggest that within the limited search scope, concerns about evaluation methodology in LLM forecasting have received some prior attention, though the specific framing and systematic analysis may differ across works.

Based on the top-thirty semantic matches examined, the paper appears to occupy a methodologically critical position within an active forecasting evaluation area. The analysis does not cover the full breadth of evaluation methodology literature or all forecasting domains, and the refutable matches may address overlapping concerns from different angles rather than identical contributions. The limited search scope means additional relevant prior work may exist beyond the candidates examined.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Evaluating language model forecasting capabilities. The field has evolved into a rich landscape spanning multiple dimensions. General LLM Evaluation Frameworks and Methodologies (e.g., LLM Evaluation Survey[1], Holistic Evaluation[4]) establish foundational principles for assessing model performance across diverse tasks. Forecasting and Prediction Tasks form a central branch, encompassing real-world event forecasting, time series analysis (Time Series LLMs[19]), and domain-specific applications such as climate prediction (Climate Forecasting LLMs[29]) and energy forecasting (EnergyGPT[28]). Clinical and Healthcare Prediction and Biological and Scientific Prediction represent specialized branches where models tackle medical outcomes (Clinical Prediction LLM[9], Perioperative Risk Prediction[11]) and biological phenomena (Evolutionary Protein Structure[8]). Human Behavior and Cognitive Modeling explores how LLMs capture decision-making patterns, while Domain-Specific LLM Applications span finance (Pixiu Finance[6]), code generation (Code LLM Evaluation[22]), and other verticals. Model Scaling, Training, and Optimization addresses technical foundations, and LLM Impact and Broader Perspectives examines societal implications. Within forecasting tasks, a particularly active tension exists between general-purpose evaluation frameworks and domain-specific benchmarks. Works like ForecastBench[18] and Expert Forecasters Comparison[33] systematically assess LLM forecasting abilities on real-world events, while Forecasting Strategies[45] explores methodological variations in prompting and reasoning. Forecaster Evaluation Pitfalls[0] sits squarely within this real-world event forecasting cluster, closely aligned with ForecastBench[18] and Expert Forecasters Comparison[33] in examining how LLMs perform on concrete prediction tasks. However, where neighboring works often focus on benchmark construction or comparative performance, Forecaster Evaluation Pitfalls[0] emphasizes methodological rigor and potential evaluation biases—concerns echoed more broadly in works like Unfair Evaluators[12] and Model Written Evaluations[2]. This positions the original paper as bridging practical forecasting assessment with critical examination of evaluation validity.

Claimed Contributions

Identification of temporal leakage issues in LLM forecasting evaluation

Can Refute

10 retrieved papers

The authors systematically identify and categorize various forms of temporal leakage that compromise the trustworthiness of LLM forecasting evaluations, including logical leakage from backtesting, unreliable date-restricted retrieval, and over-reliance on model cutoff dates.

10 retrieved papers

Can Refute

Analysis of extrapolation challenges from benchmarks to real-world forecasting

Can Refute

10 retrieved papers

The authors analyze how good benchmark performance may not translate to real-world forecasting ability, identifying issues such as piggybacking on human forecasts, gaming benchmarks through strategic betting, and skewed data distributions that limit generalizability.

10 retrieved papers

Can Refute

Concrete demonstrations of evaluation flaws in prior forecasting work

Can Refute

10 retrieved papers

The authors provide empirical evidence and concrete examples from existing LLM forecasting papers to demonstrate how various evaluation flaws manifest in practice, showing that these issues affect real published benchmarks and may have led to overly optimistic assessments.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[18] Forecastbench: A dynamic benchmark of ai forecasting capabilities PDF

Karger, Ezra, Bastani, Houtan, Ezra Karger, Yueh-Han Chen, Houtan Bastani, Chen Yueh-Han, Halawi, Danny, Zachary Jacobs, Zhang, Fred, Danny Halawi, Fred Zhang, P. Tetlock (2024)

[33] Evaluating LLMs on Real-World Forecasting Against Expert Forecasters PDF

Janna Lu (2025)

[45] Can Language Models Use Forecasting Strategies? PDF

Pratt, Sarah, Sarah Pratt, Seth Blumberg, Morris, Meredith Ringel, Pietro K. Carolino, Meredith Ringel Morris (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Identification of temporal leakage issues in LLM forecasting evaluation

[74] Lookahead bias in pretrained language models PDF

Can Refute

[77] Evaluating forecasting is more difficult than other llm evaluations PDF

Can Refute

[38] Scieval: A multi-level large language model evaluation benchmark for scientific research PDF

Cannot Refute

[70] Benchmarking benchmark leakage in large language models PDF

Cannot Refute

[71] Are language models actually useful for time series forecasting? PDF

Cannot Refute

[72] Mind the Gap: Assessing Temporal Generalization in Neural Language Models PDF

Cannot Refute

[73] Analyzing information leakage of updates to natural language models PDF

Cannot Refute

[75] Analyzing Leakage of Personally Identifiable Information in Language Models PDF

Cannot Refute

[76] Autotimes: Autoregressive time series forecasters via large language models PDF

Cannot Refute

[78] Assessment methods and protection strategies for data leakage risks in large language models PDF

Cannot Refute

Contribution

Analysis of extrapolation challenges from benchmarks to real-world forecasting

[51] Forecast evaluation for data scientists: common pitfalls and best practices PDF

Can Refute

[61] TravelPlanner: A Benchmark for Real-World Planning with Language Agents PDF

Cannot Refute

[62] STAR: A Benchmark for Situated Reasoning in Real-World Videos PDF

Cannot Refute

[63] Largest: A benchmark dataset for large-scale traffic forecasting PDF

Cannot Refute

[64] Nico++: Towards better benchmarking for domain generalization PDF

Cannot Refute

[65] A practical generalization metric for deep networks benchmarking PDF

Cannot Refute

[66] Open graph benchmark: Datasets for machine learning on graphs PDF

Cannot Refute

[67] Unveiling the limits of deep learning models in hydrological extrapolation tasks PDF

Cannot Refute

[68] Measuring ai ability to complete long tasks PDF

Cannot Refute

[69] XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization PDF

Cannot Refute

Contribution

Concrete demonstrations of evaluation flaws in prior forecasting work

[51] Forecast evaluation for data scientists: common pitfalls and best practices PDF

Can Refute

[54] Are we learning yet? a meta review of evaluation failures across machine learning PDF

Can Refute

[52] Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms PDF

Cannot Refute

[53] GPT4Graph: Can Large Language Models Understand Graph Structured Data ? An Empirical Evaluation and Benchmarking PDF

Cannot Refute

[55] An Empirical Evaluation of Temporal Graph Benchmark PDF

Cannot Refute

[56] Establishing best practices for building rigorous agentic benchmarks PDF

Cannot Refute

[57]

Recent developments in forecast evaluation

PDF

Cannot Refute

[58] The Future Outcome Reasoning and Confidence Assessment Benchmark PDF

Cannot Refute

[59] On the need for time series data mining benchmarks: a survey and empirical demonstration PDF

Cannot Refute

[60] Evaluating time series forecasting models: An empirical study on performance estimation methods PDF

Cannot Refute

Pitfalls in Evaluating Language Model Forecasters

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[18] Forecastbench: A dynamic benchmark of ai forecasting capabilities PDF

[33] Evaluating LLMs on Real-World Forecasting Against Expert Forecasters PDF

[45] Can Language Models Use Forecasting Strategies? PDF

Contribution Analysis

Identification of temporal leakage issues in LLM forecasting evaluation

[74] Lookahead bias in pretrained language models PDF

[77] Evaluating forecasting is more difficult than other llm evaluations PDF

[38] Scieval: A multi-level large language model evaluation benchmark for scientific research PDF

[70] Benchmarking benchmark leakage in large language models PDF

[71] Are language models actually useful for time series forecasting? PDF

[72] Mind the Gap: Assessing Temporal Generalization in Neural Language Models PDF

[73] Analyzing information leakage of updates to natural language models PDF

[75] Analyzing Leakage of Personally Identifiable Information in Language Models PDF

[76] Autotimes: Autoregressive time series forecasters via large language models PDF

[78] Assessment methods and protection strategies for data leakage risks in large language models PDF

Analysis of extrapolation challenges from benchmarks to real-world forecasting

[51] Forecast evaluation for data scientists: common pitfalls and best practices PDF

[61] TravelPlanner: A Benchmark for Real-World Planning with Language Agents PDF

[62] STAR: A Benchmark for Situated Reasoning in Real-World Videos PDF

[63] Largest: A benchmark dataset for large-scale traffic forecasting PDF

[64] Nico++: Towards better benchmarking for domain generalization PDF

[65] A practical generalization metric for deep networks benchmarking PDF

[66] Open graph benchmark: Datasets for machine learning on graphs PDF

[67] Unveiling the limits of deep learning models in hydrological extrapolation tasks PDF

[68] Measuring ai ability to complete long tasks PDF

[69] XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization PDF

Concrete demonstrations of evaluation flaws in prior forecasting work

[51] Forecast evaluation for data scientists: common pitfalls and best practices PDF

[54] Are we learning yet? a meta review of evaluation failures across machine learning PDF

[52] Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms PDF

[53] GPT4Graph: Can Large Language Models Understand Graph Structured Data ? An Empirical Evaluation and Benchmarking PDF

[55] An Empirical Evaluation of Temporal Graph Benchmark PDF

[56] Establishing best practices for building rigorous agentic benchmarks PDF

[57] Recent developments in forecast evaluation PDF

[58] The Future Outcome Reasoning and Confidence Assessment Benchmark PDF

[59] On the need for time series data mining benchmarks: a survey and empirical demonstration PDF

[60] Evaluating time series forecasting models: An empirical study on performance estimation methods PDF

Table of Contents

[57]

Recent developments in forecast evaluation

PDF