FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Agent BenchmarkFinancial SearchFinancial Reasoning

Search has emerged as core infrastructure for LLM-based agents and is widely viewed as critical on the path toward more general intelligence. Finance is a particularly demanding proving ground: analysts routinely conduct complex, multi-step searches over time-sensitive, domain-specific data, making it ideal for assessing both search proficiency and knowledge-grounded reasoning. Yet no existing open financial datasets evaluate data searching capability of end-to-end agents, largely because constructing realistic, complicated tasks requires deep financial expertise and time-sensitive data is hard to evaluate. We present FinSearchComp, the first fully open-source agent benchmark for realistic, open-domain financial search and reasoning. FinSearchComp comprises three tasks, Time-Sensitive Data Fetching, Simple Historical Lookup, and Complex Historical Investigation, closely reproducing real-world financial analyst workflows. To ensure difficulty and reliability, we engage $70$ professional financial experts for annotation and implement a rigorous multi-stage quality-assurance pipeline. The benchmark includes $635$ questions spanning global and Greater China markets, and we evaluate $21$ models (products) on it. Grok 4 (web) tops the global subset, approaching expert-level accuracy. DouBao (web) leads on the Greater China subset. Experimental analyses show that equipping agents with web search and financial plugins substantially improves results on FinSearchComp, and the country origin of models and tools impact performance significantly. By aligning with realistic analyst tasks and providing end-to-end evaluation, FinSearchComp offers a professional, high-difficulty testbed for complex financial search and reasoning.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces FinSearchComp, a benchmark for evaluating LLM-based agents on realistic financial search and reasoning tasks. It resides in the 'Real-Time Financial Search and Data Fetching' leaf, which contains only two papers including this work. This is a notably sparse research direction within the broader taxonomy of 50 papers across 36 topics, suggesting that real-time, open-domain financial search remains an under-explored area despite its practical importance. The benchmark's focus on time-sensitive data retrieval and multi-step reasoning workflows positions it at the intersection of information retrieval and domain-specific agent evaluation.

The taxonomy reveals several neighboring research directions that contextualize this work's boundaries. Adjacent leaves include 'Document-Based QA and Structured Extraction' (six papers focusing on static financial documents) and 'Multi-Step Reasoning for Financial QA' (one paper on iterative inference). The broader 'Financial Question Answering and Information Retrieval' branch encompasses nine papers total, while sibling branches address trading execution and specialized analysis tasks. FinSearchComp explicitly excludes trading decisions and static document extraction, instead targeting the dynamic search infrastructure that precedes such downstream applications. This positioning suggests the work fills a gap between retrieval-focused systems and decision-making frameworks.

Among 30 candidates examined through semantic search, none were found to clearly refute any of the three core contributions. For the benchmark itself, 10 candidates were reviewed with zero refutable matches; the same pattern holds for the curated dataset contribution and the evaluation study. This absence of overlapping prior work across all contributions is notable but must be interpreted carefully: the search examined a limited candidate pool rather than exhaustively surveying all financial benchmarking literature. The statistics indicate that within this bounded scope, the specific combination of real-time search tasks, expert-annotated deterministic answers, and comprehensive model evaluation appears distinctive.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: financial search and reasoning with LLM-based agents. The field organizes around several complementary branches that reflect both the technical infrastructure and the diverse application scenarios in finance. Agent Architectures and Frameworks for Financial Applications establish foundational designs—ranging from single-agent systems like Finrobot[4] to multi-agent collaborations such as Multi-Agent Finance[12]—that enable modular tool use and reasoning. Financial Trading and Investment Decision-Making focuses on portfolio optimization, market simulation, and trading strategies, with works like Trading-r1[6] and StockAgent[23] exploring how agents can interpret market signals and execute trades. Financial Question Answering and Information Retrieval addresses the challenge of extracting and synthesizing information from reports, news, and real-time data streams, exemplified by systems such as Real-Time Financial Agent[9] and FinMem[1]. Specialized Financial Reasoning and Analysis Tasks target domain-specific problems like credit assessment, anomaly detection, and regulatory compliance, while Benchmarks and Evaluation Frameworks (e.g., INVESTORBENCH[7], Finagentbench[15]) provide standardized testbeds. Data Generation and Knowledge Enhancement, Cross-Domain and Theoretical Perspectives, and Auxiliary Capabilities and Tool Integration round out the taxonomy by supporting knowledge augmentation, broader AI insights, and integration of external APIs or databases. Within this landscape, a particularly active line of work centers on real-time data fetching and dynamic reasoning under market volatility. FinSearchComp[0] sits squarely in the Financial Question Answering and Information Retrieval branch, specifically targeting Real-Time Financial Search and Data Fetching. It shares this niche with Real-Time Financial Agent[9], which similarly emphasizes live data integration and timely response generation. Compared to broader QA systems like FinMem[1]—which leverages memory mechanisms for historical context—or Fincon[2], which focuses on conversational interfaces, FinSearchComp[0] prioritizes the speed and accuracy of search operations in rapidly changing financial environments. This emphasis on immediacy distinguishes it from trading-focused agents such as Trading-r1[6] or LLM Trade Simulation[3], which concentrate more on decision-making and strategy execution than on the underlying search and retrieval infrastructure. The work thus addresses a critical bottleneck: ensuring that agents can reliably access and reason over up-to-date financial information before making high-stakes recommendations.

Claimed Contributions

FinSearchComp benchmark for financial search and reasoning

10 retrieved papers

The authors present FinSearchComp, a benchmark comprising 635 expert-curated queries spanning global and Greater China markets across three analyst-style task families (Time-Sensitive Data Fetching, Simple Historical Lookup, Complex Historical Investigation). It is designed to evaluate LLM-based agents on realistic financial search and reasoning tasks requiring tool use, time-sensitive data retrieval, and multi-source evidence integration.

10 retrieved papers

Curated dataset with deterministic answers and open-source evaluation harness

10 retrieved papers

The authors provide a dataset with expert-verified answers and a complete evaluation framework. The benchmark includes multi-stage quality control involving 70 professional financial experts, rubric-based scoring guidelines, and an LLM-as-a-Judge evaluation protocol validated against human judgments.

10 retrieved papers

Comprehensive evaluation study of 21 models with analysis of search and plugin effects

10 retrieved papers

The authors evaluate 21 mainstream models on FinSearchComp, demonstrating that web search capabilities and financial plugins substantially improve performance. Their analysis identifies recurring failure modes such as shallow search, outdated evidence retrieval, and report-calendar misalignment, and shows that model origin significantly impacts performance on different market subsets.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[9] An Agent Framework for Real-Time Financial Information Searching with Large Language Models PDF

Li, Jinzheng, Zhang Jing-shu, Jinzheng Li, Li Hong-Guang <, Jingshu Zhang, Shen Yiqing, Hongguang Li, Yiqing Shen (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

FinSearchComp benchmark for financial search and reasoning

[7] INVESTORBENCH: A Benchmark for Financial Decision-Making Tasks with LLM-based Agent PDF

Cannot Refute

[21] From llm reasoning to autonomous ai agents: A comprehensive review PDF

Cannot Refute

[71] Finben: A holistic financial benchmark for large language models PDF

Cannot Refute

[72] Financereasoning: Benchmarking financial numerical reasoning more credible, comprehensive and challenging PDF

Cannot Refute

[73] Finance agent benchmark: Benchmarking llms on real-world financial research tasks PDF

Cannot Refute

[74] Finqa: A dataset of numerical reasoning over financial data PDF

Cannot Refute

[75] DABstep: Data Agent Benchmark for Multi-step Reasoning PDF

Cannot Refute

[76] MAFA: A multi-agent framework for annotation PDF

Cannot Refute

[77] Open Deep Search: Democratizing Search with Open-source Reasoning Agents PDF

Cannot Refute

[78] XFINBENCH: Benchmarking llms in complex financial problem solving and reasoning PDF

Cannot Refute

Contribution

Curated dataset with deterministic answers and open-source evaluation harness

[51] TUDataset: A collection of benchmark datasets for learning with graphs PDF

Cannot Refute

[52] Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation PDF

Cannot Refute

[53] Genomic benchmarks: a collection of datasets for genomic sequence classification PDF

Cannot Refute

[54] Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking PDF

Cannot Refute

[55] BenchQC: A Benchmarking Toolkit for Quantum Computation PDF

Cannot Refute

[56] Engibench: A benchmark for evaluating large language models on engineering problem solving PDF

Cannot Refute

[57] Mobile augmented reality: User interfaces, frameworks, and intelligence PDF

Cannot Refute

[58] Ultra-High-Definition Low-Light Image Enhancement: A Benchmark and Transformer-Based Method PDF

Cannot Refute

[59] Benchmarking zero-shot text classification: Datasets, evaluation and entailment approach PDF

Cannot Refute

[60] ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models PDF

Cannot Refute

Contribution

Comprehensive evaluation study of 21 models with analysis of search and plugin effects

[61] Using an llm to help with code understanding PDF

Cannot Refute

[62] ZeroSearch: Incentivize the Search Capability of LLMs without Searching PDF

Cannot Refute

[63] Between Truth and Hallucinations: Evaluation of the Performance of Large Language Model-Based AI Plugins in Website Quality Analysis PDF

Cannot Refute

[64] Gaia: a benchmark for general ai assistants PDF

Cannot Refute

[65] Adopting the power of AI chatbots for enriching students learning in civil engineering education: A study on capabilities and limitations PDF

Cannot Refute

[66] Mcp-radar: A multi-dimensional benchmark for evaluating tool use capabilities in large language models PDF

Cannot Refute

[67] A High Performance Computing Web Search Engine Based on Big Data and Parallel Distributed Models PDF

Cannot Refute

[68] Brinjal: A Web-Plugin for Collaborative Hate Speech Detection PDF

Cannot Refute

[69] VerlTool: Towards Holistic Agentic Reinforcement Learning with Tool Use PDF

Cannot Refute

[70] Private Web Search with Tiptoe PDF

Cannot Refute

FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[9] An Agent Framework for Real-Time Financial Information Searching with Large Language Models PDF

Contribution Analysis

FinSearchComp benchmark for financial search and reasoning

[7] INVESTORBENCH: A Benchmark for Financial Decision-Making Tasks with LLM-based Agent PDF

[21] From llm reasoning to autonomous ai agents: A comprehensive review PDF

[71] Finben: A holistic financial benchmark for large language models PDF

[72] Financereasoning: Benchmarking financial numerical reasoning more credible, comprehensive and challenging PDF

[73] Finance agent benchmark: Benchmarking llms on real-world financial research tasks PDF

[74] Finqa: A dataset of numerical reasoning over financial data PDF

[75] DABstep: Data Agent Benchmark for Multi-step Reasoning PDF

[76] MAFA: A multi-agent framework for annotation PDF

[77] Open Deep Search: Democratizing Search with Open-source Reasoning Agents PDF

[78] XFINBENCH: Benchmarking llms in complex financial problem solving and reasoning PDF

Curated dataset with deterministic answers and open-source evaluation harness

[51] TUDataset: A collection of benchmark datasets for learning with graphs PDF

[52] Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation PDF

[53] Genomic benchmarks: a collection of datasets for genomic sequence classification PDF

[54] Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking PDF

[55] BenchQC: A Benchmarking Toolkit for Quantum Computation PDF

[56] Engibench: A benchmark for evaluating large language models on engineering problem solving PDF

[57] Mobile augmented reality: User interfaces, frameworks, and intelligence PDF

[58] Ultra-High-Definition Low-Light Image Enhancement: A Benchmark and Transformer-Based Method PDF

[59] Benchmarking zero-shot text classification: Datasets, evaluation and entailment approach PDF

[60] ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models PDF

Comprehensive evaluation study of 21 models with analysis of search and plugin effects

[61] Using an llm to help with code understanding PDF

[62] ZeroSearch: Incentivize the Search Capability of LLMs without Searching PDF

[63] Between Truth and Hallucinations: Evaluation of the Performance of Large Language Model-Based AI Plugins in Website Quality Analysis PDF

[64] Gaia: a benchmark for general ai assistants PDF

[65] Adopting the power of AI chatbots for enriching students learning in civil engineering education: A study on capabilities and limitations PDF

[66] Mcp-radar: A multi-dimensional benchmark for evaluating tool use capabilities in large language models PDF

[67] A High Performance Computing Web Search Engine Based on Big Data and Parallel Distributed Models PDF

[68] Brinjal: A Web-Plugin for Collaborative Hate Speech Detection PDF

[69] VerlTool: Towards Holistic Agentic Reinforcement Learning with Tool Use PDF

[70] Private Web Search with Tiptoe PDF

Table of Contents