AlphaBench: Benchmarking Large Language Models in Formulaic Alpha Factor Mining

ICLR 2026 Conference SubmissionAnonymous Authors
Alpha MiningLLM BenchmarkLLM AgentData Science and Engineering
Abstract:

Formulaic alpha factor mining (FAFM) is a central problem in quantitative investment, where interpretable formulas are designed to extract predictive signals from historical financial series. With the emergence of large language models (LLMs), recent studies have begun to explore their roles in FAFM, yet their capabilities across different tasks and configurations remain unclear. In this work, we introduce AlphaBench, the first systematic benchmark for evaluating LLMs in FAFM. AlphaBench covers three core tasks, including factor generation, factor evaluation, and factor searching, which are all popular tasks integrated in the workflow of quantitative researchers. Beyond task-level evaluation, we further analyze how different LLM settings, including model type, prompting paradigm, and reasoning strategy, influence performance. Our experiments on a range of open-source and closed-source models reveal that LLMs hold strong potential in automating factor mining, while also facing persistent challenges in robustness, search efficiency, and practical usability.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces AlphaBench, a systematic benchmark for evaluating large language models across three core tasks in formulaic alpha factor mining: factor generation, evaluation, and searching. According to the taxonomy, the work is positioned in the 'Comprehensive LLM Benchmarking and Surveys' leaf under 'Large Language Model-Based Factor Generation', which contains only one other paper. This places the work in a relatively sparse research direction within the broader LLM-based factor generation branch, which itself comprises six distinct leaves covering prompt-based generation, LLM-guided search, multi-agent collaboration, code-based evolution, hybrid integration, and benchmarking.

The taxonomy reveals that LLM-based factor generation sits alongside two major alternative paradigms: reinforcement learning-based mining (three leaves, nine papers) and evolutionary approaches (two leaves, three papers). The LLM branch appears to be the most actively explored methodological stream, with prompt-based generation and LLM-guided search each containing two to three papers. AlphaBench's focus on systematic evaluation across multiple tasks and configurations distinguishes it from method-specific papers in sibling leaves, such as those proposing novel prompting strategies or search algorithms. The taxonomy's scope and exclude notes clarify that benchmarking work explicitly excludes method-specific contributions, positioning AlphaBench as infrastructure rather than a new mining technique.

Among the three contributions analyzed, the first (systematic benchmark) examined nine candidates with zero refutable matches, suggesting novelty within the limited search scope. The second contribution (formal definition and unified perspective) examined ten candidates and found one refutable match, indicating some overlap with prior conceptual frameworks among the twenty-nine total candidates reviewed. The third contribution (multi-task evaluation framework) examined ten candidates with no refutations. These statistics reflect a targeted literature search rather than exhaustive coverage, meaning the analysis captures top semantic matches and immediate citations but may not encompass all relevant prior work in the broader quantitative finance or LLM evaluation literature.

Based on the limited search scope of twenty-nine candidates, the benchmark contribution appears relatively novel within the specific intersection of LLMs and formulaic alpha mining, though the formal framing shows some conceptual overlap with existing work. The taxonomy structure suggests this is an emerging research direction with fewer established benchmarks compared to the more mature RL and evolutionary branches. The analysis does not cover potential overlaps with general LLM benchmarking literature outside the alpha mining domain or with proprietary industry practices that may not appear in academic publications.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
29
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: formulaic alpha factor mining. The field centers on discovering mathematical expressions that predict asset returns, with the taxonomy revealing several distinct methodological streams. Reinforcement Learning-Based Factor Mining treats formula construction as a sequential decision problem, where agents learn to compose operators and features through reward signals tied to financial performance. Large Language Model-Based Factor Generation leverages pre-trained models to propose candidate formulas via prompting or fine-tuning, often drawing on textual financial knowledge. Evolutionary and Genetic Programming Approaches apply mutation and crossover to evolve factor populations over generations, while Dynamic Factor Combination and Weighting focuses on adaptively blending existing factors rather than discovering new primitives. Foundational Alpha Factor Research includes seminal collections like 101 Formulaic Alphas[34] that catalog hand-crafted expressions, and Automated Trading Systems and Applications address end-to-end deployment concerns. Auxiliary Methods and Cross-Domain Techniques bring in tools from neighboring domains, such as sentiment analysis or conformal prediction. Within Reinforcement Learning-Based Factor Mining, a particularly active line emphasizes single-factor discovery with carefully engineered reward functions that balance profitability, risk, and diversity. AlphaBench[0] sits squarely in this cluster, proposing a benchmark and evaluation framework for RL-driven factor generation. It shares methodological DNA with QuantFactor REINFORCE[1], which also uses policy-gradient methods to optimize formula construction, and Expert Factors[5], which incorporates domain heuristics into the reward structure. These works contrast with evolutionary methods like Evolutionary Alpha Generation[35] or Hierarchical Genetic Algorithms[40], which rely on population-based search rather than gradient-driven policy updates. A recurring tension across branches is the trade-off between interpretability—favoring simple, transparent formulas—and predictive power, which sometimes demands complex compositions. AlphaBench[0] addresses this by standardizing evaluation protocols, enabling fairer comparisons between RL, LLM-based generators like AlphaForge[10], and hybrid approaches such as Hybrid Alpha Discovery[17].

Claimed Contributions

AlphaBench: First Systematic Benchmark for LLMs in FAFM

The authors introduce AlphaBench, the first benchmark designed to systematically evaluate large language models in formulaic alpha factor mining. It covers three core tasks (factor generation, factor evaluation, and factor searching) and analyzes how different LLM settings influence performance.

9 retrieved papers
Formal Definition and Unified Perspective on LLMs in FAFM

The authors formally define the role of LLMs in formulaic alpha factor mining and provide a unified perspective on how LLMs can be applied across different tasks in the factor discovery workflow.

10 retrieved papers
Can Refute
Multi-Task Evaluation Framework with Diverse Metrics

The authors design and evaluate multiple tasks (generation, evaluation, searching) with diverse metrics to systematically uncover the strengths and limitations of different LLMs across various dimensions in the FAFM setting.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

AlphaBench: First Systematic Benchmark for LLMs in FAFM

The authors introduce AlphaBench, the first benchmark designed to systematically evaluate large language models in formulaic alpha factor mining. It covers three core tasks (factor generation, factor evaluation, and factor searching) and analyzes how different LLM settings influence performance.

Contribution

Formal Definition and Unified Perspective on LLMs in FAFM

The authors formally define the role of LLMs in formulaic alpha factor mining and provide a unified perspective on how LLMs can be applied across different tasks in the factor discovery workflow.

Contribution

Multi-Task Evaluation Framework with Diverse Metrics

The authors design and evaluate multiple tasks (generation, evaluation, searching) with diverse metrics to systematically uncover the strengths and limitations of different LLMs across various dimensions in the FAFM setting.