Pitfalls in Evaluating Language Model Forecasters

ICLR 2026 Conference SubmissionAnonymous Authors
forecastingevaluationcriticismleakagestandardsLLMspredictionfuturebenchmarks
Abstract:

Large language models (LLMs) have recently been applied to forecasting tasks, with some works claiming these systems match or exceed human performance. In this paper, we argue that, as a community, we should be careful about such conclusions as evaluating LLM forecasters presents unique challenges. We identify two broad categories of issues: (1) difficulty in trusting evaluation results due to many forms of temporal leakage, and (2) difficulty in extrapolating from evaluation performance to real-world forecasting. Through systematic analysis and concrete examples from prior work, we demonstrate how evaluation flaws can raise concerns about current and future performance claims. We argue that more rigorous evaluation methodologies are needed to confidently assess the forecasting abilities of LLMs.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper contributes a critical methodological analysis of LLM forecasting evaluation, identifying temporal leakage issues and extrapolation challenges that may undermine performance claims. It resides in the Real-World Event Forecasting leaf, which contains four papers total including this one. The three sibling papers focus on benchmark construction and comparative performance assessment (ForecastBench, Expert Forecasters Comparison, Forecasting Strategies), making this a moderately populated but focused research direction. The paper's emphasis on evaluation validity rather than forecasting performance itself represents a distinct angle within this cluster.

The taxonomy reveals that Real-World Event Forecasting sits alongside Time Series and Sequential Forecasting and Domain-Specific Forecasting Applications within the broader Forecasting and Prediction Tasks branch. Neighboring branches include Clinical and Healthcare Prediction and Biological and Scientific Prediction, which address specialized forecasting domains. The General LLM Evaluation Frameworks branch contains methodological work on evaluation metrics and LLM-as-Evaluator approaches, which share conceptual overlap with this paper's concerns about evaluation rigor. The paper bridges practical forecasting assessment with broader evaluation methodology critiques found in works like Unfair Evaluators and Model Written Evaluations.

Among thirty candidates examined, all three contributions show evidence of prior work overlap. The temporal leakage contribution examined ten candidates with two appearing to provide overlapping analysis. The extrapolation challenges contribution examined ten candidates with one refutable match. The concrete demonstrations contribution examined ten candidates with two potential overlaps. These statistics suggest that within the limited search scope, concerns about evaluation methodology in LLM forecasting have received some prior attention, though the specific framing and systematic analysis may differ across works.

Based on the top-thirty semantic matches examined, the paper appears to occupy a methodologically critical position within an active forecasting evaluation area. The analysis does not cover the full breadth of evaluation methodology literature or all forecasting domains, and the refutable matches may address overlapping concerns from different angles rather than identical contributions. The limited search scope means additional relevant prior work may exist beyond the candidates examined.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
5
Refutable Paper

Research Landscape Overview

Core task: Evaluating language model forecasting capabilities. The field has evolved into a rich landscape spanning multiple dimensions. General LLM Evaluation Frameworks and Methodologies (e.g., LLM Evaluation Survey[1], Holistic Evaluation[4]) establish foundational principles for assessing model performance across diverse tasks. Forecasting and Prediction Tasks form a central branch, encompassing real-world event forecasting, time series analysis (Time Series LLMs[19]), and domain-specific applications such as climate prediction (Climate Forecasting LLMs[29]) and energy forecasting (EnergyGPT[28]). Clinical and Healthcare Prediction and Biological and Scientific Prediction represent specialized branches where models tackle medical outcomes (Clinical Prediction LLM[9], Perioperative Risk Prediction[11]) and biological phenomena (Evolutionary Protein Structure[8]). Human Behavior and Cognitive Modeling explores how LLMs capture decision-making patterns, while Domain-Specific LLM Applications span finance (Pixiu Finance[6]), code generation (Code LLM Evaluation[22]), and other verticals. Model Scaling, Training, and Optimization addresses technical foundations, and LLM Impact and Broader Perspectives examines societal implications. Within forecasting tasks, a particularly active tension exists between general-purpose evaluation frameworks and domain-specific benchmarks. Works like ForecastBench[18] and Expert Forecasters Comparison[33] systematically assess LLM forecasting abilities on real-world events, while Forecasting Strategies[45] explores methodological variations in prompting and reasoning. Forecaster Evaluation Pitfalls[0] sits squarely within this real-world event forecasting cluster, closely aligned with ForecastBench[18] and Expert Forecasters Comparison[33] in examining how LLMs perform on concrete prediction tasks. However, where neighboring works often focus on benchmark construction or comparative performance, Forecaster Evaluation Pitfalls[0] emphasizes methodological rigor and potential evaluation biases—concerns echoed more broadly in works like Unfair Evaluators[12] and Model Written Evaluations[2]. This positions the original paper as bridging practical forecasting assessment with critical examination of evaluation validity.

Claimed Contributions

Identification of temporal leakage issues in LLM forecasting evaluation

The authors systematically identify and categorize various forms of temporal leakage that compromise the trustworthiness of LLM forecasting evaluations, including logical leakage from backtesting, unreliable date-restricted retrieval, and over-reliance on model cutoff dates.

10 retrieved papers
Can Refute
Analysis of extrapolation challenges from benchmarks to real-world forecasting

The authors analyze how good benchmark performance may not translate to real-world forecasting ability, identifying issues such as piggybacking on human forecasts, gaming benchmarks through strategic betting, and skewed data distributions that limit generalizability.

10 retrieved papers
Can Refute
Concrete demonstrations of evaluation flaws in prior forecasting work

The authors provide empirical evidence and concrete examples from existing LLM forecasting papers to demonstrate how various evaluation flaws manifest in practice, showing that these issues affect real published benchmarks and may have led to overly optimistic assessments.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Identification of temporal leakage issues in LLM forecasting evaluation

The authors systematically identify and categorize various forms of temporal leakage that compromise the trustworthiness of LLM forecasting evaluations, including logical leakage from backtesting, unreliable date-restricted retrieval, and over-reliance on model cutoff dates.

Contribution

Analysis of extrapolation challenges from benchmarks to real-world forecasting

The authors analyze how good benchmark performance may not translate to real-world forecasting ability, identifying issues such as piggybacking on human forecasts, gaming benchmarks through strategic betting, and skewed data distributions that limit generalizability.

Contribution

Concrete demonstrations of evaluation flaws in prior forecasting work

The authors provide empirical evidence and concrete examples from existing LLM forecasting papers to demonstrate how various evaluation flaws manifest in practice, showing that these issues affect real published benchmarks and may have led to overly optimistic assessments.