DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

LLM based AgentEvaluationDeep Research

Deep Research Agents (DRAs) are emerging as one of the most practical classes of LLM-based agents. Given an open-ended research task, they find, analyze, and synthesize large numbers of online sources to produce a comprehensive report at the level of a research analyst. This can compress hours of manual desk research into minutes. However, a comprehensive benchmark for systematically evaluating the capabilities of these agents remains absent. To bridge this gap, we introduce DeepResearch Bench, a benchmark consisting of 100 PhD-level research tasks, each meticulously crafted by domain experts across 22 distinct fields. To evaluate DRAs comprehensively, we propose two complementary and fully automated methodologies. The first is a reference-based method with adaptive criteria to assess the quality of generated research reports. The second evaluates a DRA’s information‑retrieval and collection capabilities by assessing its effective citation count and overall citation accuracy. By conducting extensive human consistency experiments, we demonstrate that our evaluation methods are highly aligned with expert judges and faithfully reflect human judgments of quality differences among DRA-generated content. We are open-sourcing DeepResearch Bench and key components of these frameworks to accelerate the development of practical LLM-based agents.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces DeepResearch Bench, a benchmark comprising 100 PhD-level research tasks across 22 fields, alongside two automated evaluation methodologies (RACE for report quality, FACT for citation accuracy). It resides in the 'General Deep Research Benchmarks' leaf, which contains six papers total, indicating a moderately populated research direction. This leaf sits within the broader 'Benchmark Design and Construction' branch, suggesting the paper contributes to an active area focused on standardizing evaluation infrastructure for deep research agents.

The taxonomy reveals neighboring leaves addressing domain-specific benchmarks (medicine, finance, scientific research) and machine learning experimentation tasks, as well as a separate branch for evaluation methodologies. DeepResearch Bench bridges benchmark construction and evaluation methods by proposing both a dataset and assessment frameworks. Its emphasis on general-purpose, cross-domain tasks distinguishes it from domain-specific benchmarks, while its automated evaluation approach connects to the 'Automated Evaluation Frameworks' leaf under 'Evaluation Methodologies,' though it remains classified primarily as a benchmark contribution.

Among 30 candidates examined, none clearly refute the three core contributions. The DeepResearch Bench dataset examined 10 candidates with zero refutable overlaps; similarly, the RACE and FACT evaluation frameworks each examined 10 candidates with no refutations. This suggests that within the limited search scope, the specific combination of PhD-level task design, adaptive reference-based evaluation, and citation-accuracy metrics appears relatively novel. However, the presence of five sibling papers in the same taxonomy leaf indicates that general deep research benchmarking is an established direction with existing proposals.

Based on the top-30 semantic matches and taxonomy structure, the work appears to offer a distinct contribution to a moderately crowded benchmark landscape. The lack of refutable overlaps across all three contributions within this limited scope suggests differentiation from prior work, though the analysis does not cover exhaustive literature or adjacent evaluation methodology papers outside the examined candidates. The taxonomy context indicates the paper extends an active research thread rather than opening an entirely new direction.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Evaluating deep research agents. The field has organized itself around several complementary dimensions. Benchmark Design and Construction focuses on creating standardized testbeds that capture the complexity of research tasks, ranging from general deep research challenges to domain-specific scenarios in medicine, finance, and machine learning. Evaluation Methodologies develops principled frameworks for assessing agent capabilities, including rubric-based scoring and comparative analysis approaches. Agent Training and Optimization explores reinforcement learning and other techniques to improve agent performance, while Agent Architectures and Systems examines the structural designs that enable effective research behavior. Conceptual Foundations and Surveys provide theoretical grounding and landscape overviews, such as Deep Research Survey[4] and Characterizing Deep Research[7]. Improving Existing Systems targets incremental enhancements to deployed agents, and Non-Research Agent Deep Learning addresses related but distinct applications of deep learning in agentic contexts. Within Benchmark Design and Construction, a particularly active cluster has emerged around general deep research benchmarks that attempt to capture the full scope of research activities—from literature review and hypothesis generation to experimental design and report writing. Works like Deep Research Bench[2], ResearcherBench[5], and LiveResearchBench[13] each propose different task formulations and evaluation protocols, reflecting ongoing debates about what constitutes a faithful representation of research work. DeepResearch Bench[0] situates itself squarely in this general benchmark cluster, emphasizing comprehensive evaluation across multiple research stages. Compared to ResearcherBench[5], which may focus on particular research subtasks, and LiveResearchBench[13], which incorporates dynamic or real-time elements, DeepResearch Bench[0] appears to prioritize breadth and standardization in capturing the research process, contributing another perspective to the evolving question of how best to measure deep research agent capabilities.

Claimed Contributions

DeepResearch Bench benchmark dataset

10 retrieved papers

A specialized benchmark for evaluating Deep Research Agents, constructed through large-scale analysis of over 96,000 real user queries and expert collaboration. The benchmark contains 100 tasks across 22 domains, designed to balance challenge while reflecting authentic user needs.

10 retrieved papers

RACE evaluation framework

10 retrieved papers

A Reference-based and Adaptive Criteria-driven Evaluation framework with Dynamic Weighting that assesses research report quality. The framework dynamically generates task-specific weights and criteria, then employs reference-based scoring to evaluate reports across four dimensions: Comprehensiveness, Insight/Depth, Instruction-Following, and Readability.

10 retrieved papers

FACT evaluation framework

10 retrieved papers

A framework for Factual Abundance and Citation Trustworthiness that evaluates Deep Research Agents' information-retrieval and collection capabilities by assessing effective citation count and overall citation accuracy.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[2] Deep Research Bench: Evaluating AI Web Research Agents PDF

- -, Nikos I. Bosse, Bosse, Nikos I., Jon Evans, Evans Jon, Robert G. Gambee, Daniel Hnyk, Peter MÃ¼hlbacher, MÃ¼hlbacher, Peter, Lawrence Phillips, Phillips, Lawrence, Dan Schwarz, Jack Wildman (2025) • arXiv.org

[7] Characterizing deep research: A benchmark and formal definition PDF

Java, Abhinav, Abhinav Java, Ashmit Khandelwal, Halfaker, Aaron, S. Midigeshi, Deshpande, Amit, Aaron Halfaker, Goyal, Navin, Amit Deshpande, Gupta, Ankur, Navin Goyal, Natarajan, Nagarajan, Ankur Gupta, Sharma Amit, Nagarajan Natarajan, Amit Sharma (2025)

[8] ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents PDF

Manasi Sharma, Chen Bo Calvin Zhang, Chaithanya Bandi, Clinton Wang, Ankit Aich, Huy Nghiem, Tahseen Rabbani, Ye Htet, Brian Jang, Sumana Basu, Aishwarya Balwani, Denis Peskoff, Marcos Ayestaran, Sean M. Hendryx, Brad Kenstler, Bing Liu (2025)

[13] Liveresearchbench: A live benchmark for user-centric deep research in the wild PDF

Wang Jia-Yu, Ming, Yifei, Jiayu Wang, Yifei Ming, Chen Qing-lin, Riya Dulepet, Xu, Austin, Qinglin Chen, Ke, Zixuan, Austin Xu, Sala Frederic, Zixuan Ke, Albarghouthi, Aws, Frederic Sala, Xiong, Caiming, Aws Albarghouthi, Joty, Shafiq, Caiming Xiong, Shafiq Joty (2025)

[16] Reportbench: Evaluating deep research agents via academic survey tasks PDF

Li Ming-hao, Zeng Ying, Minghao Li, Cheng, Zhihao, Ying Zeng, Ma Cong, Zhihao Cheng, Jia Kai, Cong Ma, Kai Jia (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

DeepResearch Bench benchmark dataset

[8] ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents PDF

Cannot Refute

[36] DeepResearch Arena: The First Exam of LLMs' Research Abilities via Seminar-Grounded Tasks PDF

Cannot Refute

[67] Towards Personalized Deep Research: Benchmarks and Evaluations PDF

Cannot Refute

[68] Scicode: A research coding benchmark curated by scientists PDF

Cannot Refute

[69] Making the implicit explicit: Creating performance expectations for the dissertation PDF

Cannot Refute

[70] Researchbench: Benchmarking llms in scientific discovery via inspiration-based task decomposition PDF

Cannot Refute

[71] QMSum: A New Benchmark for Query-based Multi-domain Meeting Summarization PDF

Cannot Refute

[72] MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding PDF

Cannot Refute

[73] SentiGrad: A New Hindi-English Code Mixed Sentiment Analysis Dataset with Preliminary Results and Open Challenges PDF

Cannot Refute

[74] MSEarth: A Multimodal Scientific Dataset and Benchmark for Phenomena Uncovering in Earth Science PDF

Cannot Refute

Contribution

RACE evaluation framework

[5] Researcherbench: Evaluating deep ai research systems on the frontiers of scientific inquiry PDF

Cannot Refute

[16] Reportbench: Evaluating deep research agents via academic survey tasks PDF

Cannot Refute

[59] Knowledge Distillation and Transformer-Based Framework for Automatic Spine CT Report Generation PDF

Cannot Refute

[60] On the Evaluation of Machine-Generated Reports PDF

Cannot Refute

[61] YUNet_LLMClaimReport: An Enhanced Automobile Insurance Fraud Detection and Automated Claim Report Generation Using Large Language Models PDF

Cannot Refute

[62] Earnings2Insights: Analyst Report Generation for Investment Guidance PDF

Cannot Refute

[63] Improving the Factual Correctness of Radiology Report Generation with Semantic Rewards PDF

Cannot Refute

[64] Multimodal DeepResearcher: Generating Text-Chart Interleaved Reports From Scratch with Agentic Framework PDF

Cannot Refute

[65] Slit Lamp Report Generation and Question Answering: Development and Validation of a Multimodal Transformer Model with Large Language Model Integration PDF

Cannot Refute

[66] FinRpt: Dataset, Evaluation System and LLM-based Multi-agent Framework for Equity Research Report Generation PDF

Cannot Refute

Contribution

FACT evaluation framework

[5] Researcherbench: Evaluating deep ai research systems on the frontiers of scientific inquiry PDF

Cannot Refute

[17] PoU: Proof-of-Use to Counter Tool-Call Hacking in DeepResearch Agents PDF

Cannot Refute

[51] Statistical biases in information retrieval metrics for recommender systems PDF

Cannot Refute

[52] CiteFix: Enhancing RAG Accuracy Through Post-Processing Citation Correction PDF

Cannot Refute

[53] Application of Generative Artificial Intelligence Models for Accurate Prescription Label Identification and Information Retrieval for the Elderly in Northern East of Thailand PDF

Cannot Refute

[54] AI chatbot accountability in the age of algorithmic gatekeeping: Comparing generative search engine political information retrieval across five languages PDF

Cannot Refute

[55] Investigations on using Evidence-Based GraphRag Pipeline using LLM Tailored for Answering USMLE Medical Exam Questions PDF

Cannot Refute

[56] Integrating Information Retrieval and LLMs: A Document Retrieval Chatbot in Education Settings PDF

Cannot Refute

[57] Enhancing Health Information Retrieval with RAG by prioritizing topical relevance and factual accuracy PDF

Cannot Refute

[58] TrustRAG: An Information Assistant with Retrieval Augmented Generation PDF

Cannot Refute

DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[2] Deep Research Bench: Evaluating AI Web Research Agents PDF

[7] Characterizing deep research: A benchmark and formal definition PDF

[8] ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents PDF

[13] Liveresearchbench: A live benchmark for user-centric deep research in the wild PDF

[16] Reportbench: Evaluating deep research agents via academic survey tasks PDF

Contribution Analysis

DeepResearch Bench benchmark dataset

[8] ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents PDF

[36] DeepResearch Arena: The First Exam of LLMs' Research Abilities via Seminar-Grounded Tasks PDF

[67] Towards Personalized Deep Research: Benchmarks and Evaluations PDF

[68] Scicode: A research coding benchmark curated by scientists PDF

[69] Making the implicit explicit: Creating performance expectations for the dissertation PDF

[70] Researchbench: Benchmarking llms in scientific discovery via inspiration-based task decomposition PDF

[71] QMSum: A New Benchmark for Query-based Multi-domain Meeting Summarization PDF

[72] MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding PDF

[73] SentiGrad: A New Hindi-English Code Mixed Sentiment Analysis Dataset with Preliminary Results and Open Challenges PDF

[74] MSEarth: A Multimodal Scientific Dataset and Benchmark for Phenomena Uncovering in Earth Science PDF

RACE evaluation framework

[5] Researcherbench: Evaluating deep ai research systems on the frontiers of scientific inquiry PDF

[16] Reportbench: Evaluating deep research agents via academic survey tasks PDF

[59] Knowledge Distillation and Transformer-Based Framework for Automatic Spine CT Report Generation PDF

[60] On the Evaluation of Machine-Generated Reports PDF

[61] YUNet_LLMClaimReport: An Enhanced Automobile Insurance Fraud Detection and Automated Claim Report Generation Using Large Language Models PDF

[62] Earnings2Insights: Analyst Report Generation for Investment Guidance PDF

[63] Improving the Factual Correctness of Radiology Report Generation with Semantic Rewards PDF

[64] Multimodal DeepResearcher: Generating Text-Chart Interleaved Reports From Scratch with Agentic Framework PDF

[65] Slit Lamp Report Generation and Question Answering: Development and Validation of a Multimodal Transformer Model with Large Language Model Integration PDF

[66] FinRpt: Dataset, Evaluation System and LLM-based Multi-agent Framework for Equity Research Report Generation PDF

FACT evaluation framework

[5] Researcherbench: Evaluating deep ai research systems on the frontiers of scientific inquiry PDF

[17] PoU: Proof-of-Use to Counter Tool-Call Hacking in DeepResearch Agents PDF

[51] Statistical biases in information retrieval metrics for recommender systems PDF

[52] CiteFix: Enhancing RAG Accuracy Through Post-Processing Citation Correction PDF

[53] Application of Generative Artificial Intelligence Models for Accurate Prescription Label Identification and Information Retrieval for the Elderly in Northern East of Thailand PDF

[54] AI chatbot accountability in the age of algorithmic gatekeeping: Comparing generative search engine political information retrieval across five languages PDF

[55] Investigations on using Evidence-Based GraphRag Pipeline using LLM Tailored for Answering USMLE Medical Exam Questions PDF

[56] Integrating Information Retrieval and LLMs: A Document Retrieval Chatbot in Education Settings PDF

[57] Enhancing Health Information Retrieval with RAG by prioritizing topical relevance and factual accuracy PDF

[58] TrustRAG: An Information Assistant with Retrieval Augmented Generation PDF

Table of Contents