WideSearch: Benchmarking Agentic Broad Info-Seeking

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

LLM EvaluationInfo-Seeking BenchmarkSearch Agent

From professional research to everyday planning, many tasks are bottlenecked by wide-scale information seeking, which is more repetitive than cognitively complex. With the rapid development of Large Language Models (LLMs), automated search agents powered by LLMs offer a promising solution to liberate humans from this tedious work. However, the capability of these agents to perform such "wide-context" collection reliably and completely remains largely unevaluated due to a lack of suitable benchmarks. To bridge this gap, we introduce WideSearch, a new benchmark engineered to evaluate agent reliability on these large-scale collection tasks. The benchmark features 200 manually curated questions (100 in English, 100 in Chinese) from over 15 diverse domains, grounded in real user queries. Each task requires agents to collect large-scale atomic information, which could be verified one by one objectively, and arrange it into a well-organized output. A rigorous five-stage quality control pipeline ensures the difficulty, completeness, and verifiability of the dataset. We benchmark over 10 state-of-the-art agentic search systems, including single-agent, multi-agent frameworks, and end-to-end commercial systems. Most systems achieve overall success rates near 0%, with the best performer reaching just 7%. However, given sufficient time, cross-validation by multiple human testers can achieve a near 100% success rate. These results demonstrate that present search agents have critical deficiencies in large-scale information seeking, underscoring urgent areas for future research and development in agentic search.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces WideSearch, a benchmark for evaluating LLM-powered search agents on large-scale, multi-entity information gathering tasks. According to the taxonomy tree, this work occupies the 'Agentic Broad Information Seeking' leaf under 'Information Gathering Frameworks and Benchmarks'. Notably, this leaf contains only the original paper itself—no sibling papers are listed. This positioning suggests the work addresses a relatively sparse research direction within the broader field of information gathering and verification, which encompasses fifty papers across diverse topics including fact-checking, knowledge extraction, and misinformation detection.

The taxonomy reveals several neighboring research areas that provide context for this work. The closest branches include 'Multi-Entity Question Answering', 'Collaborative Information Aggregation', and 'Multi-Robot Coordination for Information Gathering', all within the same parent node. These directions emphasize structured question answering or coordinated retrieval rather than open-ended, wide-context collection. Meanwhile, the 'Fact-Checking and Claim Verification Systems' branch (the largest cluster with multiple subtopics) focuses on verifying specific claims rather than broad information seeking. WideSearch appears to bridge these areas by framing information gathering as an agentic exploration task requiring both breadth and verifiability.

Among thirty candidates examined through semantic search and citation expansion, the contribution-level analysis reveals mixed novelty signals. The core WideSearch benchmark (Contribution A) examined ten candidates with zero refutable overlaps, suggesting limited prior work on agentic broad information-seeking evaluation. The five-stage quality control pipeline (Contribution B) examined ten candidates and found one refutable match, indicating some methodological precedent in benchmark construction. The hybrid automated evaluation framework (Contribution C) also examined ten candidates with no refutations. These statistics reflect a focused search scope rather than exhaustive coverage, and the low refutation counts align with the sparse taxonomy positioning.

Based on the limited search scope of thirty semantically similar papers, the work appears to occupy a relatively underexplored niche at the intersection of agentic systems and broad information gathering. The taxonomy structure confirms that while fact-checking and entity extraction are well-populated areas, systematic evaluation of wide-context collection by autonomous agents remains sparse. However, this assessment is constrained by the top-K semantic matching approach and does not capture potential relevant work outside the examined candidate set or in adjacent communities such as web search evaluation or information retrieval benchmarking.

Taxonomy

This LLM-generated taxonomy tree may contain errors and therefore requires manual review; it could include omissions or duplicates.

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: large-scale multi-entity information gathering and verification. The field encompasses a diverse set of approaches organized into six main branches. Fact-Checking and Claim Verification Systems focus on automated methods for assessing the truthfulness of statements, often leveraging knowledge bases and retrieval mechanisms (e.g., Computational Fact Checking[4], ChatGPT Fact-Checking[11]). Information Gathering Frameworks and Benchmarks develop systematic approaches and evaluation protocols for broad information seeking tasks. Knowledge Representation and Entity Relation Extraction emphasizes structured extraction of entities and their relationships from text (e.g., Beyond Entities[8], Entity Relation LLM[26]). Misinformation Detection and Tracking targets the identification and monitoring of false information spread (e.g., MuMiN[10], Patient Zero Tracking[39]). Multi-Entity Coordination and Cooperation Systems explore how multiple agents or entities can work together, sometimes using distributed ledger technologies (e.g., Distributed Ledger Cooperation[7]). Specialized Data Collection and Verification Systems address domain-specific challenges in gathering and validating information across varied contexts. Several active lines of work reveal key trade-offs between automation and accuracy, as well as between breadth and depth of verification. Fact-checking systems must balance the need for comprehensive evidence retrieval with computational efficiency, while misinformation detection efforts grapple with rapidly evolving tactics and cross-platform propagation. WideSearch[0] sits within the Information Gathering Frameworks branch, specifically in the Agentic Broad Information Seeking cluster, emphasizing autonomous exploration across multiple entities at scale. Its approach contrasts with more narrowly scoped verification tools like Fact-audit[3], which focuses on auditing specific claims, and differs from retrieval-augmented methods such as LLM Retrieval Augmented[12] by prioritizing breadth of entity coverage. The work aligns closely with emerging frameworks that treat information gathering as an open-ended exploration problem rather than a targeted fact-checking task, reflecting ongoing questions about how to design systems that can handle the complexity and scale of real-world multi-entity scenarios.

Claimed Contributions

WideSearch benchmark for evaluating agentic broad information-seeking

10 retrieved papers

The authors present WideSearch, the first benchmark specifically designed to evaluate LLM-based search agents on wide-context information gathering tasks. The benchmark contains 200 manually curated questions requiring agents to collect large-scale atomic information and arrange it into structured, verifiable outputs.

10 retrieved papers

Five-stage quality control pipeline for benchmark construction

Can Refute

10 retrieved papers

The authors develop a multi-stage data curation and validation process that transforms real-world queries into standardized tasks. This pipeline includes sourcing and refinement, gold standard annotation, parametric knowledge filtering, difficulty-based pruning, and iterative refinement to ensure task quality and alignment between automated and human evaluation.

10 retrieved papers

Can Refute

Hybrid automated evaluation framework with table alignment

10 retrieved papers

The authors introduce an automated evaluation system that combines deterministic rule-based checks with semantic judgments from an LLM-as-a-judge. The framework performs syntax validation, table alignment using primary keys, and hybrid item-level scoring across multiple categories including exact match, numerical approximation, and URL matching.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

WideSearch benchmark for evaluating agentic broad information-seeking

[51] BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models PDF

Cannot Refute

[52] Conversational information seeking PDF

Cannot Refute

[53] Finance Agent Benchmark: Benchmarking LLMs on Real-world Financial Research Tasks PDF

Cannot Refute

[54] Faithdial: A faithful benchmark for information-seeking dialogue PDF

Cannot Refute

[55] WebDancer: Towards Autonomous Information Seeking Agency PDF

Cannot Refute

[56] INSTRUCTIR: A Benchmark for Instruction Following of Information Retrieval Models PDF

Cannot Refute

[57] Finagentbench: A benchmark dataset for agentic retrieval in financial question answering PDF

Cannot Refute

[58] Astabench: Rigorous benchmarking of ai agents with a scientific research suite PDF

Cannot Refute

[59] TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks PDF

Cannot Refute

[60] WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization PDF

Cannot Refute

Contribution

Five-stage quality control pipeline for benchmark construction

[61] From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline PDF

Can Refute

[62] CODC-S: A quality-controlled global ocean salinity profiles dataset PDF

Cannot Refute

[63] On the dataset quality control for image registration evaluation PDF

Cannot Refute

[64] VFHQ: A High-Quality Dataset and Benchmark for Video Face Super-Resolution PDF

Cannot Refute

[65] Implementing Data Quality Assurance Frameworks in Distributed Data Engineering Workflows PDF

Cannot Refute

[66] DCA-bench: a benchmark for dataset curation agents PDF

Cannot Refute

[67] No reproducibility, no progress: Rethinking CT benchmarking PDF

Cannot Refute

[68] Benchmarking of automatic quality control checks for ocean temperature profiles and recommendations for optimal sets PDF

Cannot Refute

[69] ChemLit-QA: a human evaluated dataset for chemistry RAG tasks PDF

Cannot Refute

[70] A Workflow to Create a High-Quality Protein-Ligand Binding Dataset for Training, Validation, and Prediction Tasks PDF

Cannot Refute

Contribution

Hybrid automated evaluation framework with table alignment

[71] Automatic Assessment of Quality of your Data for AI PDF

Cannot Refute

[72] FormalAlign: Automated Alignment Evaluation for Autoformalization PDF

Cannot Refute

[73] Beyond factual accuracy: Evaluating coverage of diverse factual information in long-form text generation PDF

Cannot Refute

[74] KAES: Multi-aspect Shared Knowledge Finding and Aligning for Cross-prompt Automated Scoring of Essay Traits PDF

Cannot Refute

[75] Improving Table Structure Recognition with Visual-Alignment Sequential Coordinate Modeling PDF

Cannot Refute

[76] ASSESSâautomatic self-assessment using linked data PDF

Cannot Refute

[77] SOS: Score-based Oversampling for Tabular Data PDF

Cannot Refute

[78] Table-Text Alignment: Explaining Claim Verification Against Tables in Scientific Papers PDF

Cannot Refute

[79] Automated Question Generation tool for structured data PDF

Cannot Refute

[80] LGPMA: Complicated Table Structure Recognition with Local and Global Pyramid Mask Alignment PDF

Cannot Refute

WideSearch: Benchmarking Agentic Broad Info-Seeking

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

WideSearch benchmark for evaluating agentic broad information-seeking

[51] BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models PDF

[52] Conversational information seeking PDF

[53] Finance Agent Benchmark: Benchmarking LLMs on Real-world Financial Research Tasks PDF

[54] Faithdial: A faithful benchmark for information-seeking dialogue PDF

[55] WebDancer: Towards Autonomous Information Seeking Agency PDF

[56] INSTRUCTIR: A Benchmark for Instruction Following of Information Retrieval Models PDF

[57] Finagentbench: A benchmark dataset for agentic retrieval in financial question answering PDF

[58] Astabench: Rigorous benchmarking of ai agents with a scientific research suite PDF

[59] TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks PDF

[60] WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization PDF

Five-stage quality control pipeline for benchmark construction

[61] From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline PDF

[62] CODC-S: A quality-controlled global ocean salinity profiles dataset PDF

[63] On the dataset quality control for image registration evaluation PDF

[64] VFHQ: A High-Quality Dataset and Benchmark for Video Face Super-Resolution PDF

[65] Implementing Data Quality Assurance Frameworks in Distributed Data Engineering Workflows PDF

[66] DCA-bench: a benchmark for dataset curation agents PDF

[67] No reproducibility, no progress: Rethinking CT benchmarking PDF

[68] Benchmarking of automatic quality control checks for ocean temperature profiles and recommendations for optimal sets PDF

[69] ChemLit-QA: a human evaluated dataset for chemistry RAG tasks PDF

[70] A Workflow to Create a High-Quality Protein-Ligand Binding Dataset for Training, Validation, and Prediction Tasks PDF

Hybrid automated evaluation framework with table alignment

[71] Automatic Assessment of Quality of your Data for AI PDF

[72] FormalAlign: Automated Alignment Evaluation for Autoformalization PDF

[73] Beyond factual accuracy: Evaluating coverage of diverse factual information in long-form text generation PDF

[74] KAES: Multi-aspect Shared Knowledge Finding and Aligning for Cross-prompt Automated Scoring of Essay Traits PDF

[75] Improving Table Structure Recognition with Visual-Alignment Sequential Coordinate Modeling PDF

[76] ASSESSâautomatic self-assessment using linked data PDF

[77] SOS: Score-based Oversampling for Tabular Data PDF

[78] Table-Text Alignment: Explaining Claim Verification Against Tables in Scientific Papers PDF

[79] Automated Question Generation tool for structured data PDF

[80] LGPMA: Complicated Table Structure Recognition with Local and Global Pyramid Mask Alignment PDF

Table of Contents

[76] ASSESSâautomatic self-assessment using linked data PDF