Abstract:

RAG (Retrieval-Augmented Generation) systems and web agents are increasingly evaluated on multi-hop deep search tasks, yet current practice suffers from two major limitations. First, most benchmarks leak the reasoning path in the question text, allowing models to follow surface cues rather than discover reasoning chains autonomously. Second, evaluation is typically reduced to a single pass rate, which collapses diverse behaviors into one score and obscures whether failures stem from inadequate search, poor knowledge use, or inappropriate refusal. To address these issues, we present WebDetective, a benchmark of hint-free multi-hop questions paired with a controlled Wikipedia sandbox that ensures full traceability of model actions, and a holistic evaluation framework that separates search sufficiency, knowledge utilization, and refusal behavior. Our evaluation of 25 state-of-the-art models reveals systematic weaknesses across all architectures: models struggle with knowledge utilization despite having sufficient evidence and demonstrate near-absent appropriate refusal when evidence is lacking. These patterns expose a fundamental gap—today's systems excel at executing given reasoning paths but fail when required to discover them. We develop an agentic workflow EvidenceLoop that explicitly targets the challenges our benchmark identifies, incorporating verification loops and systematic evidence tracking that improve both search and synthesis capabilities. This baseline demonstrates that WebDetective's diagnostic framework can guide concrete architectural improvements, establishing our benchmark as a critical tool for developing genuinely autonomous reasoning systems rather than pattern-following agents.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces WebDetective, a benchmark for evaluating multi-hop question answering without reasoning path hints, alongside a holistic evaluation framework that separates search sufficiency, knowledge utilization, and refusal behavior. It resides in the 'Hint-Free and Path-Discovery Evaluation' leaf, which contains only two papers total (including this one and MoreHopQA). This represents a sparse research direction within the broader Benchmark Construction and Dataset Design branch, suggesting the paper addresses a relatively underexplored evaluation paradigm compared to the more populated Static Benchmark Construction leaf (seven papers).

The taxonomy reveals neighboring work in Question Generation and Difficulty Control (four papers on automated question generation) and Static Benchmark Construction (seven papers on manually curated benchmarks). WebDetective diverges from these by emphasizing hint-free evaluation and controlled Wikipedia sandboxes for traceability, whereas sibling work MoreHopQA focuses on compositional understanding. The broader Retrieval and Evidence Gathering branch (nine papers across four leaves) and Reasoning and Inference branch (ten papers across four leaves) address complementary challenges in evidence collection and answer derivation, but do not directly tackle the evaluation methodology gap that WebDetective targets.

Among fourteen candidates examined, none clearly refute the three core contributions. The WebDetective benchmark contribution examined three candidates with zero refutations, the holistic evaluation framework examined one candidate with zero refutations, and the EvidenceLoop workflow examined ten candidates with zero refutations. This suggests that within the limited search scope, the specific combination of hint-free questions, controlled Wikipedia sandboxes, and factorized evaluation metrics (search/synthesis/refusal) appears distinct from prior work. The evaluation framework's decomposition into three behavioral dimensions seems particularly novel among the candidates reviewed.

Based on top-fourteen semantic matches and the sparse taxonomy leaf (two papers), the work appears to occupy a relatively novel position in evaluation methodology. However, the limited search scope means this assessment reflects only a narrow slice of potentially relevant literature. The analysis does not cover exhaustive examination of all multi-hop QA benchmarks or evaluation frameworks, particularly those outside the top semantic matches or citation network.

Taxonomy

Core-task Taxonomy Papers
40
3
Claimed Contributions
14
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Evaluating multi-hop question answering without reasoning path hints. The field of multi-hop question answering has evolved into a rich landscape organized around six major branches. Benchmark Construction and Dataset Design focuses on creating evaluation resources that test genuine multi-step reasoning without providing intermediate supervision, including works that emphasize hint-free evaluation and path-discovery challenges such as MoreHopQA[2] and Knowledge Crosswords[7]. Retrieval and Evidence Gathering addresses the challenge of locating relevant information across multiple documents or knowledge sources, with approaches ranging from graph-based methods like Wikipedia Graph Retrieval[9] to dense retrieval techniques such as Beam Dense Retrieval[14]. The Reasoning and Inference branch explores how systems combine retrieved evidence to produce answers, encompassing both neural architectures and prompting strategies like Decomposed Prompting[39]. Robustness and Generalization Analysis examines whether models truly perform multi-hop reasoning or exploit shortcuts, as investigated in Avoiding Reasoning Shortcuts[13]. Training and Optimization develops learning strategies that encourage genuine reasoning behavior, while Explainability and Interpretability seeks to make the reasoning process transparent. Recent work reveals tension between retrieval complexity and reasoning depth, with some studies emphasizing iterative evidence gathering while others focus on end-to-end inference. Lost in Retrieval[3] highlights challenges when systems must navigate large document collections without explicit path guidance, while Relational Chain Reasoning[5] explores structured approaches to multi-step inference. Deep Search Evaluation[0] situates itself within the hint-free evaluation paradigm alongside Cofca[34], both examining how well systems discover reasoning paths autonomously rather than following provided scaffolding. Compared to Cofca[34], which emphasizes compositional understanding, Deep Search Evaluation[0] appears to stress the evaluation methodology itself, probing whether models can identify necessary evidence hops without intermediate supervision signals that might artificially simplify the task.

Claimed Contributions

WebDetective benchmark with hint-free multi-hop questions

The authors present WebDetective, a new benchmark that removes both path-hinting and specification-hinting from multi-hop questions, pairing hint-free questions with a controlled Wikipedia sandbox environment. This design forces agents to autonomously discover reasoning chains rather than execute prescribed paths or filter attributes.

3 retrieved papers
Holistic evaluation framework factorising search, synthesis, and refusal

The authors develop an evaluation framework that separates three distinct capabilities: whether agents search sufficiently, whether they synthesise retrieved information effectively, and whether they appropriately refuse when evidence is lacking. This factorisation replaces single aggregate pass rates with diagnostic metrics that reveal specific failure modes.

1 retrieved paper
EvidenceLoop agentic workflow baseline

The authors introduce EvidenceLoop, an agentic workflow designed to address the challenges identified by their benchmark. It incorporates verification loops, systematic evidence tracking, context retention, and memory management to improve both search and synthesis capabilities in hint-free deep search scenarios.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

WebDetective benchmark with hint-free multi-hop questions

The authors present WebDetective, a new benchmark that removes both path-hinting and specification-hinting from multi-hop questions, pairing hint-free questions with a controlled Wikipedia sandbox environment. This design forces agents to autonomously discover reasoning chains rather than execute prescribed paths or filter attributes.

Contribution

Holistic evaluation framework factorising search, synthesis, and refusal

The authors develop an evaluation framework that separates three distinct capabilities: whether agents search sufficiently, whether they synthesise retrieved information effectively, and whether they appropriately refuse when evidence is lacking. This factorisation replaces single aggregate pass rates with diagnostic metrics that reveal specific failure modes.

Contribution

EvidenceLoop agentic workflow baseline

The authors introduce EvidenceLoop, an agentic workflow designed to address the challenges identified by their benchmark. It incorporates verification loops, systematic evidence tracking, context retention, and memory management to improve both search and synthesis capabilities in hint-free deep search scenarios.