Demystifying Deep Search: A Holistic Evaluation with Hint-free Multi-Hop Questions and Factorised Metrics

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Deep Search Agent

RAG (Retrieval-Augmented Generation) systems and web agents are increasingly evaluated on multi-hop deep search tasks, yet current practice suffers from two major limitations. First, most benchmarks leak the reasoning path in the question text, allowing models to follow surface cues rather than discover reasoning chains autonomously. Second, evaluation is typically reduced to a single pass rate, which collapses diverse behaviors into one score and obscures whether failures stem from inadequate search, poor knowledge use, or inappropriate refusal. To address these issues, we present WebDetective, a benchmark of hint-free multi-hop questions paired with a controlled Wikipedia sandbox that ensures full traceability of model actions, and a holistic evaluation framework that separates search sufficiency, knowledge utilization, and refusal behavior. Our evaluation of 25 state-of-the-art models reveals systematic weaknesses across all architectures: models struggle with knowledge utilization despite having sufficient evidence and demonstrate near-absent appropriate refusal when evidence is lacking. These patterns expose a fundamental gap—today's systems excel at executing given reasoning paths but fail when required to discover them. We develop an agentic workflow EvidenceLoop that explicitly targets the challenges our benchmark identifies, incorporating verification loops and systematic evidence tracking that improve both search and synthesis capabilities. This baseline demonstrates that WebDetective's diagnostic framework can guide concrete architectural improvements, establishing our benchmark as a critical tool for developing genuinely autonomous reasoning systems rather than pattern-following agents.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces WebDetective, a benchmark for evaluating multi-hop question answering without reasoning path hints, alongside a holistic evaluation framework that separates search sufficiency, knowledge utilization, and refusal behavior. It resides in the 'Hint-Free and Path-Discovery Evaluation' leaf, which contains only two papers total (including this one and MoreHopQA). This represents a sparse research direction within the broader Benchmark Construction and Dataset Design branch, suggesting the paper addresses a relatively underexplored evaluation paradigm compared to the more populated Static Benchmark Construction leaf (seven papers).

The taxonomy reveals neighboring work in Question Generation and Difficulty Control (four papers on automated question generation) and Static Benchmark Construction (seven papers on manually curated benchmarks). WebDetective diverges from these by emphasizing hint-free evaluation and controlled Wikipedia sandboxes for traceability, whereas sibling work MoreHopQA focuses on compositional understanding. The broader Retrieval and Evidence Gathering branch (nine papers across four leaves) and Reasoning and Inference branch (ten papers across four leaves) address complementary challenges in evidence collection and answer derivation, but do not directly tackle the evaluation methodology gap that WebDetective targets.

Among fourteen candidates examined, none clearly refute the three core contributions. The WebDetective benchmark contribution examined three candidates with zero refutations, the holistic evaluation framework examined one candidate with zero refutations, and the EvidenceLoop workflow examined ten candidates with zero refutations. This suggests that within the limited search scope, the specific combination of hint-free questions, controlled Wikipedia sandboxes, and factorized evaluation metrics (search/synthesis/refusal) appears distinct from prior work. The evaluation framework's decomposition into three behavioral dimensions seems particularly novel among the candidates reviewed.

Based on top-fourteen semantic matches and the sparse taxonomy leaf (two papers), the work appears to occupy a relatively novel position in evaluation methodology. However, the limited search scope means this assessment reflects only a narrow slice of potentially relevant literature. The analysis does not cover exhaustive examination of all multi-hop QA benchmarks or evaluation frameworks, particularly those outside the top semantic matches or citation network.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Evaluating multi-hop question answering without reasoning path hints. The field of multi-hop question answering has evolved into a rich landscape organized around six major branches. Benchmark Construction and Dataset Design focuses on creating evaluation resources that test genuine multi-step reasoning without providing intermediate supervision, including works that emphasize hint-free evaluation and path-discovery challenges such as MoreHopQA[2] and Knowledge Crosswords[7]. Retrieval and Evidence Gathering addresses the challenge of locating relevant information across multiple documents or knowledge sources, with approaches ranging from graph-based methods like Wikipedia Graph Retrieval[9] to dense retrieval techniques such as Beam Dense Retrieval[14]. The Reasoning and Inference branch explores how systems combine retrieved evidence to produce answers, encompassing both neural architectures and prompting strategies like Decomposed Prompting[39]. Robustness and Generalization Analysis examines whether models truly perform multi-hop reasoning or exploit shortcuts, as investigated in Avoiding Reasoning Shortcuts[13]. Training and Optimization develops learning strategies that encourage genuine reasoning behavior, while Explainability and Interpretability seeks to make the reasoning process transparent. Recent work reveals tension between retrieval complexity and reasoning depth, with some studies emphasizing iterative evidence gathering while others focus on end-to-end inference. Lost in Retrieval[3] highlights challenges when systems must navigate large document collections without explicit path guidance, while Relational Chain Reasoning[5] explores structured approaches to multi-step inference. Deep Search Evaluation[0] situates itself within the hint-free evaluation paradigm alongside Cofca[34], both examining how well systems discover reasoning paths autonomously rather than following provided scaffolding. Compared to Cofca[34], which emphasizes compositional understanding, Deep Search Evaluation[0] appears to stress the evaluation methodology itself, probing whether models can identify necessary evidence hops without intermediate supervision signals that might artificially simplify the task.

Claimed Contributions

WebDetective benchmark with hint-free multi-hop questions

3 retrieved papers

The authors present WebDetective, a new benchmark that removes both path-hinting and specification-hinting from multi-hop questions, pairing hint-free questions with a controlled Wikipedia sandbox environment. This design forces agents to autonomously discover reasoning chains rather than execute prescribed paths or filter attributes.

3 retrieved papers

Holistic evaluation framework factorising search, synthesis, and refusal

1 retrieved paper

The authors develop an evaluation framework that separates three distinct capabilities: whether agents search sufficiently, whether they synthesise retrieved information effectively, and whether they appropriately refuse when evidence is lacking. This factorisation replaces single aggregate pass rates with diagnostic metrics that reveal specific failure modes.

1 retrieved paper

EvidenceLoop agentic workflow baseline

10 retrieved papers

The authors introduce EvidenceLoop, an agentic workflow designed to address the challenges identified by their benchmark. It incorporates verification loops, systematic evidence tracking, context retention, and memory management to improve both search and synthesis capabilities in hint-free deep search scenarios.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[34] Cofca: A Step-Wise Counterfactual Multi-hop QA benchmark PDF

Wu Jian, Jian Wu, Yang Lin-yi, Linyi Yang, Wang Zhen, Manabu Okumura, Okumura, Manabu, Yue Zhang, Zhang, Yue (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

WebDetective benchmark with hint-free multi-hop questions

[52] MORTAR: Metamorphic multi-turn testing for LLM-based dialogue systems PDF

Cannot Refute

[53] Mmdu: A multi-turn multi-image dialog understanding benchmark and instruction-tuning dataset for lvlms PDF

Cannot Refute

[54] Visual-Guided Reasoning Path Generation for Visual Question Answering PDF

Cannot Refute

Contribution

Holistic evaluation framework factorising search, synthesis, and refusal

[41] After Retrieval, Before Generation: Enhancing the Trustworthiness of Large Language Models in RAG PDF

Cannot Refute

Contribution

EvidenceLoop agentic workflow baseline

[42] El Agente: An autonomous agent for quantum chemistry PDF

Cannot Refute

[43] RepoAudit: An Autonomous LLM-Agent for Repository-Level Code Auditing PDF

Cannot Refute

[44] Real AI Agents with Fake Memories: Fatal Context Manipulation Attacks on Web3 Agents PDF

Cannot Refute

[45] State and Memory is All You Need for Robust and Reliable AI Agents PDF

Cannot Refute

[46] Evaluating llm-based agents for multi-turn conversations: A survey PDF

Cannot Refute

[47] Multi-Agent Collaborative Framework for Intelligent IT Operations: An AOI System with Context-Aware Compression and Dynamic Task Scheduling PDF

Cannot Refute

[48] Towards the Autonomous Optimization of Urban Logistics: Training Generative AI with Scientific Tools via Agentic Digital Twins and Model Context Protocol PDF

Cannot Refute

[49] Memory OS of AI Agent PDF

Cannot Refute

[50] MetaAgent: Toward Self-Evolving Agent via Tool Meta-Learning PDF

Cannot Refute

[51] A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts PDF

Cannot Refute

Demystifying Deep Search: A Holistic Evaluation with Hint-free Multi-Hop Questions and Factorised Metrics

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[34] Cofca: A Step-Wise Counterfactual Multi-hop QA benchmark PDF

Contribution Analysis

WebDetective benchmark with hint-free multi-hop questions

[52] MORTAR: Metamorphic multi-turn testing for LLM-based dialogue systems PDF

[53] Mmdu: A multi-turn multi-image dialog understanding benchmark and instruction-tuning dataset for lvlms PDF

[54] Visual-Guided Reasoning Path Generation for Visual Question Answering PDF

Holistic evaluation framework factorising search, synthesis, and refusal

[41] After Retrieval, Before Generation: Enhancing the Trustworthiness of Large Language Models in RAG PDF

EvidenceLoop agentic workflow baseline

[42] El Agente: An autonomous agent for quantum chemistry PDF

[43] RepoAudit: An Autonomous LLM-Agent for Repository-Level Code Auditing PDF

[44] Real AI Agents with Fake Memories: Fatal Context Manipulation Attacks on Web3 Agents PDF

[45] State and Memory is All You Need for Robust and Reliable AI Agents PDF

[46] Evaluating llm-based agents for multi-turn conversations: A survey PDF

[47] Multi-Agent Collaborative Framework for Intelligent IT Operations: An AOI System with Context-Aware Compression and Dynamic Task Scheduling PDF

[48] Towards the Autonomous Optimization of Urban Logistics: Training Generative AI with Scientific Tools via Agentic Digital Twins and Model Context Protocol PDF

[49] Memory OS of AI Agent PDF

[50] MetaAgent: Toward Self-Evolving Agent via Tool Meta-Learning PDF

[51] A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts PDF

Table of Contents