WideSearch: Benchmarking Agentic Broad Info-Seeking

ICLR 2026 Conference SubmissionAnonymous Authors
LLM EvaluationInfo-Seeking BenchmarkSearch Agent
Abstract:

From professional research to everyday planning, many tasks are bottlenecked by wide-scale information seeking, which is more repetitive than cognitively complex. With the rapid development of Large Language Models (LLMs), automated search agents powered by LLMs offer a promising solution to liberate humans from this tedious work. However, the capability of these agents to perform such "wide-context" collection reliably and completely remains largely unevaluated due to a lack of suitable benchmarks. To bridge this gap, we introduce WideSearch, a new benchmark engineered to evaluate agent reliability on these large-scale collection tasks. The benchmark features 200 manually curated questions (100 in English, 100 in Chinese) from over 15 diverse domains, grounded in real user queries. Each task requires agents to collect large-scale atomic information, which could be verified one by one objectively, and arrange it into a well-organized output. A rigorous five-stage quality control pipeline ensures the difficulty, completeness, and verifiability of the dataset. We benchmark over 10 state-of-the-art agentic search systems, including single-agent, multi-agent frameworks, and end-to-end commercial systems. Most systems achieve overall success rates near 0%, with the best performer reaching just 7%. However, given sufficient time, cross-validation by multiple human testers can achieve a near 100% success rate. These results demonstrate that present search agents have critical deficiencies in large-scale information seeking, underscoring urgent areas for future research and development in agentic search.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces WideSearch, a benchmark for evaluating LLM-powered search agents on large-scale, multi-entity information gathering tasks. According to the taxonomy tree, this work occupies the 'Agentic Broad Information Seeking' leaf under 'Information Gathering Frameworks and Benchmarks'. Notably, this leaf contains only the original paper itself—no sibling papers are listed. This positioning suggests the work addresses a relatively sparse research direction within the broader field of information gathering and verification, which encompasses fifty papers across diverse topics including fact-checking, knowledge extraction, and misinformation detection.

The taxonomy reveals several neighboring research areas that provide context for this work. The closest branches include 'Multi-Entity Question Answering', 'Collaborative Information Aggregation', and 'Multi-Robot Coordination for Information Gathering', all within the same parent node. These directions emphasize structured question answering or coordinated retrieval rather than open-ended, wide-context collection. Meanwhile, the 'Fact-Checking and Claim Verification Systems' branch (the largest cluster with multiple subtopics) focuses on verifying specific claims rather than broad information seeking. WideSearch appears to bridge these areas by framing information gathering as an agentic exploration task requiring both breadth and verifiability.

Among thirty candidates examined through semantic search and citation expansion, the contribution-level analysis reveals mixed novelty signals. The core WideSearch benchmark (Contribution A) examined ten candidates with zero refutable overlaps, suggesting limited prior work on agentic broad information-seeking evaluation. The five-stage quality control pipeline (Contribution B) examined ten candidates and found one refutable match, indicating some methodological precedent in benchmark construction. The hybrid automated evaluation framework (Contribution C) also examined ten candidates with no refutations. These statistics reflect a focused search scope rather than exhaustive coverage, and the low refutation counts align with the sparse taxonomy positioning.

Based on the limited search scope of thirty semantically similar papers, the work appears to occupy a relatively underexplored niche at the intersection of agentic systems and broad information gathering. The taxonomy structure confirms that while fact-checking and entity extraction are well-populated areas, systematic evaluation of wide-context collection by autonomous agents remains sparse. However, this assessment is constrained by the top-K semantic matching approach and does not capture potential relevant work outside the examined candidate set or in adjacent communities such as web search evaluation or information retrieval benchmarking.

Taxonomy

This LLM-generated taxonomy tree may contain errors and therefore requires manual review; it could include omissions or duplicates.
Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: large-scale multi-entity information gathering and verification. The field encompasses a diverse set of approaches organized into six main branches. Fact-Checking and Claim Verification Systems focus on automated methods for assessing the truthfulness of statements, often leveraging knowledge bases and retrieval mechanisms (e.g., Computational Fact Checking[4], ChatGPT Fact-Checking[11]). Information Gathering Frameworks and Benchmarks develop systematic approaches and evaluation protocols for broad information seeking tasks. Knowledge Representation and Entity Relation Extraction emphasizes structured extraction of entities and their relationships from text (e.g., Beyond Entities[8], Entity Relation LLM[26]). Misinformation Detection and Tracking targets the identification and monitoring of false information spread (e.g., MuMiN[10], Patient Zero Tracking[39]). Multi-Entity Coordination and Cooperation Systems explore how multiple agents or entities can work together, sometimes using distributed ledger technologies (e.g., Distributed Ledger Cooperation[7]). Specialized Data Collection and Verification Systems address domain-specific challenges in gathering and validating information across varied contexts. Several active lines of work reveal key trade-offs between automation and accuracy, as well as between breadth and depth of verification. Fact-checking systems must balance the need for comprehensive evidence retrieval with computational efficiency, while misinformation detection efforts grapple with rapidly evolving tactics and cross-platform propagation. WideSearch[0] sits within the Information Gathering Frameworks branch, specifically in the Agentic Broad Information Seeking cluster, emphasizing autonomous exploration across multiple entities at scale. Its approach contrasts with more narrowly scoped verification tools like Fact-audit[3], which focuses on auditing specific claims, and differs from retrieval-augmented methods such as LLM Retrieval Augmented[12] by prioritizing breadth of entity coverage. The work aligns closely with emerging frameworks that treat information gathering as an open-ended exploration problem rather than a targeted fact-checking task, reflecting ongoing questions about how to design systems that can handle the complexity and scale of real-world multi-entity scenarios.

Claimed Contributions

WideSearch benchmark for evaluating agentic broad information-seeking

The authors present WideSearch, the first benchmark specifically designed to evaluate LLM-based search agents on wide-context information gathering tasks. The benchmark contains 200 manually curated questions requiring agents to collect large-scale atomic information and arrange it into structured, verifiable outputs.

10 retrieved papers
Five-stage quality control pipeline for benchmark construction

The authors develop a multi-stage data curation and validation process that transforms real-world queries into standardized tasks. This pipeline includes sourcing and refinement, gold standard annotation, parametric knowledge filtering, difficulty-based pruning, and iterative refinement to ensure task quality and alignment between automated and human evaluation.

10 retrieved papers
Can Refute
Hybrid automated evaluation framework with table alignment

The authors introduce an automated evaluation system that combines deterministic rule-based checks with semantic judgments from an LLM-as-a-judge. The framework performs syntax validation, table alignment using primary keys, and hybrid item-level scoring across multiple categories including exact match, numerical approximation, and URL matching.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

WideSearch benchmark for evaluating agentic broad information-seeking

The authors present WideSearch, the first benchmark specifically designed to evaluate LLM-based search agents on wide-context information gathering tasks. The benchmark contains 200 manually curated questions requiring agents to collect large-scale atomic information and arrange it into structured, verifiable outputs.

Contribution

Five-stage quality control pipeline for benchmark construction

The authors develop a multi-stage data curation and validation process that transforms real-world queries into standardized tasks. This pipeline includes sourcing and refinement, gold standard annotation, parametric knowledge filtering, difficulty-based pruning, and iterative refinement to ensure task quality and alignment between automated and human evaluation.

Contribution

Hybrid automated evaluation framework with table alignment

The authors introduce an automated evaluation system that combines deterministic rule-based checks with semantic judgments from an LLM-as-a-judge. The framework performs syntax validation, table alignment using primary keys, and hybrid item-level scoring across multiple categories including exact match, numerical approximation, and URL matching.

WideSearch: Benchmarking Agentic Broad Info-Seeking | Novelty Validation