AlphaBench: Benchmarking Large Language Models in Formulaic Alpha Factor Mining

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Alpha MiningLLM BenchmarkLLM AgentData Science and Engineering

Formulaic alpha factor mining (FAFM) is a central problem in quantitative investment, where interpretable formulas are designed to extract predictive signals from historical financial series. With the emergence of large language models (LLMs), recent studies have begun to explore their roles in FAFM, yet their capabilities across different tasks and configurations remain unclear. In this work, we introduce AlphaBench, the first systematic benchmark for evaluating LLMs in FAFM. AlphaBench covers three core tasks, including factor generation, factor evaluation, and factor searching, which are all popular tasks integrated in the workflow of quantitative researchers. Beyond task-level evaluation, we further analyze how different LLM settings, including model type, prompting paradigm, and reasoning strategy, influence performance. Our experiments on a range of open-source and closed-source models reveal that LLMs hold strong potential in automating factor mining, while also facing persistent challenges in robustness, search efficiency, and practical usability.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces AlphaBench, a systematic benchmark for evaluating large language models across three core tasks in formulaic alpha factor mining: factor generation, evaluation, and searching. According to the taxonomy, the work is positioned in the 'Comprehensive LLM Benchmarking and Surveys' leaf under 'Large Language Model-Based Factor Generation', which contains only one other paper. This places the work in a relatively sparse research direction within the broader LLM-based factor generation branch, which itself comprises six distinct leaves covering prompt-based generation, LLM-guided search, multi-agent collaboration, code-based evolution, hybrid integration, and benchmarking.

The taxonomy reveals that LLM-based factor generation sits alongside two major alternative paradigms: reinforcement learning-based mining (three leaves, nine papers) and evolutionary approaches (two leaves, three papers). The LLM branch appears to be the most actively explored methodological stream, with prompt-based generation and LLM-guided search each containing two to three papers. AlphaBench's focus on systematic evaluation across multiple tasks and configurations distinguishes it from method-specific papers in sibling leaves, such as those proposing novel prompting strategies or search algorithms. The taxonomy's scope and exclude notes clarify that benchmarking work explicitly excludes method-specific contributions, positioning AlphaBench as infrastructure rather than a new mining technique.

Among the three contributions analyzed, the first (systematic benchmark) examined nine candidates with zero refutable matches, suggesting novelty within the limited search scope. The second contribution (formal definition and unified perspective) examined ten candidates and found one refutable match, indicating some overlap with prior conceptual frameworks among the twenty-nine total candidates reviewed. The third contribution (multi-task evaluation framework) examined ten candidates with no refutations. These statistics reflect a targeted literature search rather than exhaustive coverage, meaning the analysis captures top semantic matches and immediate citations but may not encompass all relevant prior work in the broader quantitative finance or LLM evaluation literature.

Based on the limited search scope of twenty-nine candidates, the benchmark contribution appears relatively novel within the specific intersection of LLMs and formulaic alpha mining, though the formal framing shows some conceptual overlap with existing work. The taxonomy structure suggests this is an emerging research direction with fewer established benchmarks compared to the more mature RL and evolutionary branches. The analysis does not cover potential overlaps with general LLM benchmarking literature outside the alpha mining domain or with proprietary industry practices that may not appear in academic publications.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: formulaic alpha factor mining. The field centers on discovering mathematical expressions that predict asset returns, with the taxonomy revealing several distinct methodological streams. Reinforcement Learning-Based Factor Mining treats formula construction as a sequential decision problem, where agents learn to compose operators and features through reward signals tied to financial performance. Large Language Model-Based Factor Generation leverages pre-trained models to propose candidate formulas via prompting or fine-tuning, often drawing on textual financial knowledge. Evolutionary and Genetic Programming Approaches apply mutation and crossover to evolve factor populations over generations, while Dynamic Factor Combination and Weighting focuses on adaptively blending existing factors rather than discovering new primitives. Foundational Alpha Factor Research includes seminal collections like 101 Formulaic Alphas[34] that catalog hand-crafted expressions, and Automated Trading Systems and Applications address end-to-end deployment concerns. Auxiliary Methods and Cross-Domain Techniques bring in tools from neighboring domains, such as sentiment analysis or conformal prediction. Within Reinforcement Learning-Based Factor Mining, a particularly active line emphasizes single-factor discovery with carefully engineered reward functions that balance profitability, risk, and diversity. AlphaBench[0] sits squarely in this cluster, proposing a benchmark and evaluation framework for RL-driven factor generation. It shares methodological DNA with QuantFactor REINFORCE[1], which also uses policy-gradient methods to optimize formula construction, and Expert Factors[5], which incorporates domain heuristics into the reward structure. These works contrast with evolutionary methods like Evolutionary Alpha Generation[35] or Hierarchical Genetic Algorithms[40], which rely on population-based search rather than gradient-driven policy updates. A recurring tension across branches is the trade-off between interpretability—favoring simple, transparent formulas—and predictive power, which sometimes demands complex compositions. AlphaBench[0] addresses this by standardizing evaluation protocols, enabling fairer comparisons between RL, LLM-based generators like AlphaForge[10], and hybrid approaches such as Hybrid Alpha Discovery[17].

Claimed Contributions

AlphaBench: First Systematic Benchmark for LLMs in FAFM

9 retrieved papers

The authors introduce AlphaBench, the first benchmark designed to systematically evaluate large language models in formulaic alpha factor mining. It covers three core tasks (factor generation, factor evaluation, and factor searching) and analyzes how different LLM settings influence performance.

9 retrieved papers

Formal Definition and Unified Perspective on LLMs in FAFM

Can Refute

10 retrieved papers

The authors formally define the role of LLMs in formulaic alpha factor mining and provide a unified perspective on how LLMs can be applied across different tasks in the factor discovery workflow.

10 retrieved papers

Can Refute

Multi-Task Evaluation Framework with Diverse Metrics

10 retrieved papers

The authors design and evaluate multiple tasks (generation, evaluation, searching) with diverse metrics to systematically uncover the strengths and limitations of different LLMs across various dimensions in the FAFM setting.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] QuantFactor REINFORCE: Mining Steady Formulaic Alpha Factors With Variance-Bounded REINFORCE PDF

Junjie Zhao, Chengxi Zhang, Min Qin, Peng Yang (2025)

[5] Learning from Expert Factors: Trajectory-level Reward Shaping for Formulaic Alpha Mining PDF

Zhao Junjie, Zhang Chengxi, Junjie Zhao, Chengxi Zhang, Yang Peng, Chenkai Wang, Peng Yang (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

AlphaBench: First Systematic Benchmark for LLMs in FAFM

[11] Automate strategy finding with llm in quant investment PDF

Cannot Refute

[16] A survey on large language model-based alpha mining PDF

Cannot Refute

[62] Alphaeval: A comprehensive and efficient evaluation framework for formula alpha mining PDF

Cannot Refute

[63] EFS: Evolutionary Factor Searching for Sparse Portfolio Optimization Using Large Language Models PDF

Cannot Refute

[65] PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark for Finance PDF

Cannot Refute

[66] Alphaagent: Llm-driven alpha mining with regularized exploration to counteract alpha decay PDF

Cannot Refute

[67] LiveTradeBench: Seeking Real-World Alpha with Large Language Models PDF

Cannot Refute

[68] TradExpert: Revolutionizing Trading with Mixture of Expert LLMs PDF

Cannot Refute

[69] Alpha-R1: Alpha Screening with LLM Reasoning via Reinforcement Learning PDF

Cannot Refute

Contribution

Formal Definition and Unified Perspective on LLMs in FAFM

The authors formally define the role of LLMs in formulaic alpha factor mining and provide a unified perspective on how LLMs can be applied across different tasks in the factor discovery workflow.

[16] A survey on large language model-based alpha mining PDF

Can Refute

[2] Navigating the Alpha Jungle: An LLM-Powered MCTS Framework for Formulaic Factor Mining PDF

Cannot Refute

[8] Sentiment-Aware Stock Price Prediction with Transformer and LLM-Generated Formulaic Alpha PDF

Cannot Refute

[9] GPT-signal: Generative AI for semi-automated feature engineering in the alpha research process PDF

Cannot Refute

[13] Chain-of-Alpha: Unleashing the Power of Large Language Models for Alpha Mining in Quantitative Trading PDF

Cannot Refute

[15] Adaptive Alpha Weighting with PPO: Enhancing Prompt-Based LLM-Generated Alphas in Quant Trading PDF

Cannot Refute

[61] Alpha-gpt: Human-ai interactive alpha mining for quantitative investment PDF

Cannot Refute

[62] Alphaeval: A comprehensive and efficient evaluation framework for formula alpha mining PDF

Cannot Refute

[63] EFS: Evolutionary Factor Searching for Sparse Portfolio Optimization Using Large Language Models PDF

Cannot Refute

[64] Can Large Language Models Mine Interpretable Financial Factors More Effectively? A Neural-Symbolic Factor Mining Agent Model PDF

Cannot Refute

Contribution

Multi-Task Evaluation Framework with Diverse Metrics

[51] Large language model as attributed training data generator: A tale of diversity and bias PDF

Cannot Refute

[52] Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks PDF

Cannot Refute

[53] Enhancing Retrieval-Augmented Large Language Models with Iterative Retrieval-Generation Synergy PDF

Cannot Refute

[54] RAGAs: Automated Evaluation of Retrieval Augmented Generation PDF

Cannot Refute

[55] Active Retrieval Augmented Generation PDF

Cannot Refute

[56] Almanac: Retrieval-Augmented Language Models for Clinical Medicine PDF

Cannot Refute

[57] Evaluating Retrieval Quality in Retrieval-Augmented Generation PDF

Cannot Refute

[58] Retrieve anything to augment large language models PDF

Cannot Refute

[59] Hydra: Structured Cross-Source Enhanced Large Language Model Reasoning PDF

Cannot Refute

[60] The Consensus Game: Language Model Generation via Equilibrium Search PDF

Cannot Refute

AlphaBench: Benchmarking Large Language Models in Formulaic Alpha Factor Mining

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] QuantFactor REINFORCE: Mining Steady Formulaic Alpha Factors With Variance-Bounded REINFORCE PDF

[5] Learning from Expert Factors: Trajectory-level Reward Shaping for Formulaic Alpha Mining PDF

Contribution Analysis

AlphaBench: First Systematic Benchmark for LLMs in FAFM

[11] Automate strategy finding with llm in quant investment PDF

[16] A survey on large language model-based alpha mining PDF

[62] Alphaeval: A comprehensive and efficient evaluation framework for formula alpha mining PDF

[63] EFS: Evolutionary Factor Searching for Sparse Portfolio Optimization Using Large Language Models PDF

[65] PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark for Finance PDF

[66] Alphaagent: Llm-driven alpha mining with regularized exploration to counteract alpha decay PDF

[67] LiveTradeBench: Seeking Real-World Alpha with Large Language Models PDF

[68] TradExpert: Revolutionizing Trading with Mixture of Expert LLMs PDF

[69] Alpha-R1: Alpha Screening with LLM Reasoning via Reinforcement Learning PDF

Formal Definition and Unified Perspective on LLMs in FAFM

[16] A survey on large language model-based alpha mining PDF

[2] Navigating the Alpha Jungle: An LLM-Powered MCTS Framework for Formulaic Factor Mining PDF

[8] Sentiment-Aware Stock Price Prediction with Transformer and LLM-Generated Formulaic Alpha PDF

[9] GPT-signal: Generative AI for semi-automated feature engineering in the alpha research process PDF

[13] Chain-of-Alpha: Unleashing the Power of Large Language Models for Alpha Mining in Quantitative Trading PDF

[15] Adaptive Alpha Weighting with PPO: Enhancing Prompt-Based LLM-Generated Alphas in Quant Trading PDF

[61] Alpha-gpt: Human-ai interactive alpha mining for quantitative investment PDF

[62] Alphaeval: A comprehensive and efficient evaluation framework for formula alpha mining PDF

[63] EFS: Evolutionary Factor Searching for Sparse Portfolio Optimization Using Large Language Models PDF

[64] Can Large Language Models Mine Interpretable Financial Factors More Effectively? A Neural-Symbolic Factor Mining Agent Model PDF

Multi-Task Evaluation Framework with Diverse Metrics

[51] Large language model as attributed training data generator: A tale of diversity and bias PDF

[52] Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks PDF

[53] Enhancing Retrieval-Augmented Large Language Models with Iterative Retrieval-Generation Synergy PDF

[54] RAGAs: Automated Evaluation of Retrieval Augmented Generation PDF

[55] Active Retrieval Augmented Generation PDF

[56] Almanac: Retrieval-Augmented Language Models for Clinical Medicine PDF

[57] Evaluating Retrieval Quality in Retrieval-Augmented Generation PDF

[58] Retrieve anything to augment large language models PDF

[59] Hydra: Structured Cross-Source Enhanced Large Language Model Reasoning PDF

[60] The Consensus Game: Language Model Generation via Equilibrium Search PDF

Table of Contents