FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning

ICLR 2026 Conference SubmissionAnonymous Authors
Agent BenchmarkFinancial SearchFinancial Reasoning
Abstract:

Search has emerged as core infrastructure for LLM-based agents and is widely viewed as critical on the path toward more general intelligence. Finance is a particularly demanding proving ground: analysts routinely conduct complex, multi-step searches over time-sensitive, domain-specific data, making it ideal for assessing both search proficiency and knowledge-grounded reasoning. Yet no existing open financial datasets evaluate data searching capability of end-to-end agents, largely because constructing realistic, complicated tasks requires deep financial expertise and time-sensitive data is hard to evaluate. We present FinSearchComp, the first fully open-source agent benchmark for realistic, open-domain financial search and reasoning. FinSearchComp comprises three tasks, Time-Sensitive Data Fetching, Simple Historical Lookup, and Complex Historical Investigation, closely reproducing real-world financial analyst workflows. To ensure difficulty and reliability, we engage 7070 professional financial experts for annotation and implement a rigorous multi-stage quality-assurance pipeline. The benchmark includes 635635 questions spanning global and Greater China markets, and we evaluate 2121 models (products) on it. Grok 4 (web) tops the global subset, approaching expert-level accuracy. DouBao (web) leads on the Greater China subset. Experimental analyses show that equipping agents with web search and financial plugins substantially improves results on FinSearchComp, and the country origin of models and tools impact performance significantly. By aligning with realistic analyst tasks and providing end-to-end evaluation, FinSearchComp offers a professional, high-difficulty testbed for complex financial search and reasoning.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces FinSearchComp, a benchmark for evaluating LLM-based agents on realistic financial search and reasoning tasks. It resides in the 'Real-Time Financial Search and Data Fetching' leaf, which contains only two papers including this work. This is a notably sparse research direction within the broader taxonomy of 50 papers across 36 topics, suggesting that real-time, open-domain financial search remains an under-explored area despite its practical importance. The benchmark's focus on time-sensitive data retrieval and multi-step reasoning workflows positions it at the intersection of information retrieval and domain-specific agent evaluation.

The taxonomy reveals several neighboring research directions that contextualize this work's boundaries. Adjacent leaves include 'Document-Based QA and Structured Extraction' (six papers focusing on static financial documents) and 'Multi-Step Reasoning for Financial QA' (one paper on iterative inference). The broader 'Financial Question Answering and Information Retrieval' branch encompasses nine papers total, while sibling branches address trading execution and specialized analysis tasks. FinSearchComp explicitly excludes trading decisions and static document extraction, instead targeting the dynamic search infrastructure that precedes such downstream applications. This positioning suggests the work fills a gap between retrieval-focused systems and decision-making frameworks.

Among 30 candidates examined through semantic search, none were found to clearly refute any of the three core contributions. For the benchmark itself, 10 candidates were reviewed with zero refutable matches; the same pattern holds for the curated dataset contribution and the evaluation study. This absence of overlapping prior work across all contributions is notable but must be interpreted carefully: the search examined a limited candidate pool rather than exhaustively surveying all financial benchmarking literature. The statistics indicate that within this bounded scope, the specific combination of real-time search tasks, expert-annotated deterministic answers, and comprehensive model evaluation appears distinctive.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: financial search and reasoning with LLM-based agents. The field organizes around several complementary branches that reflect both the technical infrastructure and the diverse application scenarios in finance. Agent Architectures and Frameworks for Financial Applications establish foundational designs—ranging from single-agent systems like Finrobot[4] to multi-agent collaborations such as Multi-Agent Finance[12]—that enable modular tool use and reasoning. Financial Trading and Investment Decision-Making focuses on portfolio optimization, market simulation, and trading strategies, with works like Trading-r1[6] and StockAgent[23] exploring how agents can interpret market signals and execute trades. Financial Question Answering and Information Retrieval addresses the challenge of extracting and synthesizing information from reports, news, and real-time data streams, exemplified by systems such as Real-Time Financial Agent[9] and FinMem[1]. Specialized Financial Reasoning and Analysis Tasks target domain-specific problems like credit assessment, anomaly detection, and regulatory compliance, while Benchmarks and Evaluation Frameworks (e.g., INVESTORBENCH[7], Finagentbench[15]) provide standardized testbeds. Data Generation and Knowledge Enhancement, Cross-Domain and Theoretical Perspectives, and Auxiliary Capabilities and Tool Integration round out the taxonomy by supporting knowledge augmentation, broader AI insights, and integration of external APIs or databases. Within this landscape, a particularly active line of work centers on real-time data fetching and dynamic reasoning under market volatility. FinSearchComp[0] sits squarely in the Financial Question Answering and Information Retrieval branch, specifically targeting Real-Time Financial Search and Data Fetching. It shares this niche with Real-Time Financial Agent[9], which similarly emphasizes live data integration and timely response generation. Compared to broader QA systems like FinMem[1]—which leverages memory mechanisms for historical context—or Fincon[2], which focuses on conversational interfaces, FinSearchComp[0] prioritizes the speed and accuracy of search operations in rapidly changing financial environments. This emphasis on immediacy distinguishes it from trading-focused agents such as Trading-r1[6] or LLM Trade Simulation[3], which concentrate more on decision-making and strategy execution than on the underlying search and retrieval infrastructure. The work thus addresses a critical bottleneck: ensuring that agents can reliably access and reason over up-to-date financial information before making high-stakes recommendations.

Claimed Contributions

FinSearchComp benchmark for financial search and reasoning

The authors present FinSearchComp, a benchmark comprising 635 expert-curated queries spanning global and Greater China markets across three analyst-style task families (Time-Sensitive Data Fetching, Simple Historical Lookup, Complex Historical Investigation). It is designed to evaluate LLM-based agents on realistic financial search and reasoning tasks requiring tool use, time-sensitive data retrieval, and multi-source evidence integration.

10 retrieved papers
Curated dataset with deterministic answers and open-source evaluation harness

The authors provide a dataset with expert-verified answers and a complete evaluation framework. The benchmark includes multi-stage quality control involving 70 professional financial experts, rubric-based scoring guidelines, and an LLM-as-a-Judge evaluation protocol validated against human judgments.

10 retrieved papers
Comprehensive evaluation study of 21 models with analysis of search and plugin effects

The authors evaluate 21 mainstream models on FinSearchComp, demonstrating that web search capabilities and financial plugins substantially improve performance. Their analysis identifies recurring failure modes such as shallow search, outdated evidence retrieval, and report-calendar misalignment, and shows that model origin significantly impacts performance on different market subsets.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

FinSearchComp benchmark for financial search and reasoning

The authors present FinSearchComp, a benchmark comprising 635 expert-curated queries spanning global and Greater China markets across three analyst-style task families (Time-Sensitive Data Fetching, Simple Historical Lookup, Complex Historical Investigation). It is designed to evaluate LLM-based agents on realistic financial search and reasoning tasks requiring tool use, time-sensitive data retrieval, and multi-source evidence integration.

Contribution

Curated dataset with deterministic answers and open-source evaluation harness

The authors provide a dataset with expert-verified answers and a complete evaluation framework. The benchmark includes multi-stage quality control involving 70 professional financial experts, rubric-based scoring guidelines, and an LLM-as-a-Judge evaluation protocol validated against human judgments.

Contribution

Comprehensive evaluation study of 21 models with analysis of search and plugin effects

The authors evaluate 21 mainstream models on FinSearchComp, demonstrating that web search capabilities and financial plugins substantially improve performance. Their analysis identifies recurring failure modes such as shallow search, outdated evidence retrieval, and report-calendar misalignment, and shows that model origin significantly impacts performance on different market subsets.