AlphaBench: Benchmarking Large Language Models in Formulaic Alpha Factor Mining
Overview
Overall Novelty Assessment
The paper introduces AlphaBench, a systematic benchmark for evaluating large language models across three core tasks in formulaic alpha factor mining: factor generation, evaluation, and searching. According to the taxonomy, the work is positioned in the 'Comprehensive LLM Benchmarking and Surveys' leaf under 'Large Language Model-Based Factor Generation', which contains only one other paper. This places the work in a relatively sparse research direction within the broader LLM-based factor generation branch, which itself comprises six distinct leaves covering prompt-based generation, LLM-guided search, multi-agent collaboration, code-based evolution, hybrid integration, and benchmarking.
The taxonomy reveals that LLM-based factor generation sits alongside two major alternative paradigms: reinforcement learning-based mining (three leaves, nine papers) and evolutionary approaches (two leaves, three papers). The LLM branch appears to be the most actively explored methodological stream, with prompt-based generation and LLM-guided search each containing two to three papers. AlphaBench's focus on systematic evaluation across multiple tasks and configurations distinguishes it from method-specific papers in sibling leaves, such as those proposing novel prompting strategies or search algorithms. The taxonomy's scope and exclude notes clarify that benchmarking work explicitly excludes method-specific contributions, positioning AlphaBench as infrastructure rather than a new mining technique.
Among the three contributions analyzed, the first (systematic benchmark) examined nine candidates with zero refutable matches, suggesting novelty within the limited search scope. The second contribution (formal definition and unified perspective) examined ten candidates and found one refutable match, indicating some overlap with prior conceptual frameworks among the twenty-nine total candidates reviewed. The third contribution (multi-task evaluation framework) examined ten candidates with no refutations. These statistics reflect a targeted literature search rather than exhaustive coverage, meaning the analysis captures top semantic matches and immediate citations but may not encompass all relevant prior work in the broader quantitative finance or LLM evaluation literature.
Based on the limited search scope of twenty-nine candidates, the benchmark contribution appears relatively novel within the specific intersection of LLMs and formulaic alpha mining, though the formal framing shows some conceptual overlap with existing work. The taxonomy structure suggests this is an emerging research direction with fewer established benchmarks compared to the more mature RL and evolutionary branches. The analysis does not cover potential overlaps with general LLM benchmarking literature outside the alpha mining domain or with proprietary industry practices that may not appear in academic publications.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors introduce AlphaBench, the first benchmark designed to systematically evaluate large language models in formulaic alpha factor mining. It covers three core tasks (factor generation, factor evaluation, and factor searching) and analyzes how different LLM settings influence performance.
The authors formally define the role of LLMs in formulaic alpha factor mining and provide a unified perspective on how LLMs can be applied across different tasks in the factor discovery workflow.
The authors design and evaluate multiple tasks (generation, evaluation, searching) with diverse metrics to systematically uncover the strengths and limitations of different LLMs across various dimensions in the FAFM setting.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[1] QuantFactor REINFORCE: Mining Steady Formulaic Alpha Factors With Variance-Bounded REINFORCE PDF
[5] Learning from Expert Factors: Trajectory-level Reward Shaping for Formulaic Alpha Mining PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
AlphaBench: First Systematic Benchmark for LLMs in FAFM
The authors introduce AlphaBench, the first benchmark designed to systematically evaluate large language models in formulaic alpha factor mining. It covers three core tasks (factor generation, factor evaluation, and factor searching) and analyzes how different LLM settings influence performance.
[11] Automate strategy finding with llm in quant investment PDF
[16] A survey on large language model-based alpha mining PDF
[62] Alphaeval: A comprehensive and efficient evaluation framework for formula alpha mining PDF
[63] EFS: Evolutionary Factor Searching for Sparse Portfolio Optimization Using Large Language Models PDF
[65] PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark for Finance PDF
[66] Alphaagent: Llm-driven alpha mining with regularized exploration to counteract alpha decay PDF
[67] LiveTradeBench: Seeking Real-World Alpha with Large Language Models PDF
[68] TradExpert: Revolutionizing Trading with Mixture of Expert LLMs PDF
[69] Alpha-R1: Alpha Screening with LLM Reasoning via Reinforcement Learning PDF
Formal Definition and Unified Perspective on LLMs in FAFM
The authors formally define the role of LLMs in formulaic alpha factor mining and provide a unified perspective on how LLMs can be applied across different tasks in the factor discovery workflow.
[16] A survey on large language model-based alpha mining PDF
[2] Navigating the Alpha Jungle: An LLM-Powered MCTS Framework for Formulaic Factor Mining PDF
[8] Sentiment-Aware Stock Price Prediction with Transformer and LLM-Generated Formulaic Alpha PDF
[9] GPT-signal: Generative AI for semi-automated feature engineering in the alpha research process PDF
[13] Chain-of-Alpha: Unleashing the Power of Large Language Models for Alpha Mining in Quantitative Trading PDF
[15] Adaptive Alpha Weighting with PPO: Enhancing Prompt-Based LLM-Generated Alphas in Quant Trading PDF
[61] Alpha-gpt: Human-ai interactive alpha mining for quantitative investment PDF
[62] Alphaeval: A comprehensive and efficient evaluation framework for formula alpha mining PDF
[63] EFS: Evolutionary Factor Searching for Sparse Portfolio Optimization Using Large Language Models PDF
[64] Can Large Language Models Mine Interpretable Financial Factors More Effectively? A Neural-Symbolic Factor Mining Agent Model PDF
Multi-Task Evaluation Framework with Diverse Metrics
The authors design and evaluate multiple tasks (generation, evaluation, searching) with diverse metrics to systematically uncover the strengths and limitations of different LLMs across various dimensions in the FAFM setting.