DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

ICLR 2026 Conference SubmissionAnonymous Authors
LLM based AgentEvaluationDeep Research
Abstract:

Deep Research Agents (DRAs) are emerging as one of the most practical classes of LLM-based agents. Given an open-ended research task, they find, analyze, and synthesize large numbers of online sources to produce a comprehensive report at the level of a research analyst. This can compress hours of manual desk research into minutes. However, a comprehensive benchmark for systematically evaluating the capabilities of these agents remains absent. To bridge this gap, we introduce DeepResearch Bench, a benchmark consisting of 100 PhD-level research tasks, each meticulously crafted by domain experts across 22 distinct fields. To evaluate DRAs comprehensively, we propose two complementary and fully automated methodologies. The first is a reference-based method with adaptive criteria to assess the quality of generated research reports. The second evaluates a DRA’s information‑retrieval and collection capabilities by assessing its effective citation count and overall citation accuracy. By conducting extensive human consistency experiments, we demonstrate that our evaluation methods are highly aligned with expert judges and faithfully reflect human judgments of quality differences among DRA-generated content. We are open-sourcing DeepResearch Bench and key components of these frameworks to accelerate the development of practical LLM-based agents.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces DeepResearch Bench, a benchmark comprising 100 PhD-level research tasks across 22 fields, alongside two automated evaluation methodologies (RACE for report quality, FACT for citation accuracy). It resides in the 'General Deep Research Benchmarks' leaf, which contains six papers total, indicating a moderately populated research direction. This leaf sits within the broader 'Benchmark Design and Construction' branch, suggesting the paper contributes to an active area focused on standardizing evaluation infrastructure for deep research agents.

The taxonomy reveals neighboring leaves addressing domain-specific benchmarks (medicine, finance, scientific research) and machine learning experimentation tasks, as well as a separate branch for evaluation methodologies. DeepResearch Bench bridges benchmark construction and evaluation methods by proposing both a dataset and assessment frameworks. Its emphasis on general-purpose, cross-domain tasks distinguishes it from domain-specific benchmarks, while its automated evaluation approach connects to the 'Automated Evaluation Frameworks' leaf under 'Evaluation Methodologies,' though it remains classified primarily as a benchmark contribution.

Among 30 candidates examined, none clearly refute the three core contributions. The DeepResearch Bench dataset examined 10 candidates with zero refutable overlaps; similarly, the RACE and FACT evaluation frameworks each examined 10 candidates with no refutations. This suggests that within the limited search scope, the specific combination of PhD-level task design, adaptive reference-based evaluation, and citation-accuracy metrics appears relatively novel. However, the presence of five sibling papers in the same taxonomy leaf indicates that general deep research benchmarking is an established direction with existing proposals.

Based on the top-30 semantic matches and taxonomy structure, the work appears to offer a distinct contribution to a moderately crowded benchmark landscape. The lack of refutable overlaps across all three contributions within this limited scope suggests differentiation from prior work, though the analysis does not cover exhaustive literature or adjacent evaluation methodology papers outside the examined candidates. The taxonomy context indicates the paper extends an active research thread rather than opening an entirely new direction.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Evaluating deep research agents. The field has organized itself around several complementary dimensions. Benchmark Design and Construction focuses on creating standardized testbeds that capture the complexity of research tasks, ranging from general deep research challenges to domain-specific scenarios in medicine, finance, and machine learning. Evaluation Methodologies develops principled frameworks for assessing agent capabilities, including rubric-based scoring and comparative analysis approaches. Agent Training and Optimization explores reinforcement learning and other techniques to improve agent performance, while Agent Architectures and Systems examines the structural designs that enable effective research behavior. Conceptual Foundations and Surveys provide theoretical grounding and landscape overviews, such as Deep Research Survey[4] and Characterizing Deep Research[7]. Improving Existing Systems targets incremental enhancements to deployed agents, and Non-Research Agent Deep Learning addresses related but distinct applications of deep learning in agentic contexts. Within Benchmark Design and Construction, a particularly active cluster has emerged around general deep research benchmarks that attempt to capture the full scope of research activities—from literature review and hypothesis generation to experimental design and report writing. Works like Deep Research Bench[2], ResearcherBench[5], and LiveResearchBench[13] each propose different task formulations and evaluation protocols, reflecting ongoing debates about what constitutes a faithful representation of research work. DeepResearch Bench[0] situates itself squarely in this general benchmark cluster, emphasizing comprehensive evaluation across multiple research stages. Compared to ResearcherBench[5], which may focus on particular research subtasks, and LiveResearchBench[13], which incorporates dynamic or real-time elements, DeepResearch Bench[0] appears to prioritize breadth and standardization in capturing the research process, contributing another perspective to the evolving question of how best to measure deep research agent capabilities.

Claimed Contributions

DeepResearch Bench benchmark dataset

A specialized benchmark for evaluating Deep Research Agents, constructed through large-scale analysis of over 96,000 real user queries and expert collaboration. The benchmark contains 100 tasks across 22 domains, designed to balance challenge while reflecting authentic user needs.

10 retrieved papers
RACE evaluation framework

A Reference-based and Adaptive Criteria-driven Evaluation framework with Dynamic Weighting that assesses research report quality. The framework dynamically generates task-specific weights and criteria, then employs reference-based scoring to evaluate reports across four dimensions: Comprehensiveness, Insight/Depth, Instruction-Following, and Readability.

10 retrieved papers
FACT evaluation framework

A framework for Factual Abundance and Citation Trustworthiness that evaluates Deep Research Agents' information-retrieval and collection capabilities by assessing effective citation count and overall citation accuracy.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

DeepResearch Bench benchmark dataset

A specialized benchmark for evaluating Deep Research Agents, constructed through large-scale analysis of over 96,000 real user queries and expert collaboration. The benchmark contains 100 tasks across 22 domains, designed to balance challenge while reflecting authentic user needs.

Contribution

RACE evaluation framework

A Reference-based and Adaptive Criteria-driven Evaluation framework with Dynamic Weighting that assesses research report quality. The framework dynamically generates task-specific weights and criteria, then employs reference-based scoring to evaluate reports across four dimensions: Comprehensiveness, Insight/Depth, Instruction-Following, and Readability.

Contribution

FACT evaluation framework

A framework for Factual Abundance and Citation Trustworthiness that evaluates Deep Research Agents' information-retrieval and collection capabilities by assessing effective citation count and overall citation accuracy.

DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents | Novelty Validation