DeepTRACE: Auditing Deep Research AI Systems for Tracking Reliability Across Citations and Evidence

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

deep researchgenerative search enginesNLPaudit frameworksociotechnical evaluationlarge language models

Generative search engines and deep research LLM agents promise trustworthy, source-grounded synthesis, yet users regularly encounter overconfidence, weak sourcing, and confusing citation practices. We introduce DeepTRACE, a novel sociotechnically grounded audit framework that turns prior community-identified failure cases into eight measurable dimensions spanning answer text, sources, and citations. DeepTRACE uses statement-level analysis (decomposition, confidence scoring) and builds citation and factual-support matrices to audit how systems reason with and attribute evidence end-to-end. Using automated extraction pipelines for popular public models (e.g., GPT-4.5/5, You.com, Perplexity, Copilot/Bing, Gemini) and an LLM-judge with validated agreement to human raters, we evaluate both web-search engines and deep-research configurations. Our findings show that generative search engines and deep research agents frequently produce one-sided, highly confident responses on debate queries and include large fractions of statements unsupported by their own listed sources. Deep-research configurations reduce overconfidence and can attain high citation thoroughness, but they remain highly one-sided on debate queries and still exhibit large fractions of unsupported statements, with citation accuracy ranging from 40–80% across systems.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

DeepTRACE introduces an audit framework spanning answer text, sources, and citations through statement-level decomposition and citation-support matrices. It resides in the Citation Quality Assessment in Generative Search leaf, which contains only two papers including this one. The sibling work, Verifiability Generative Search, established foundational citation metrics, making this a relatively sparse research direction within the broader Verifiability and Attribution Evaluation branch. The limited population suggests this specific angle—end-to-end citation quality auditing for generative search—remains underexplored compared to adjacent areas like automated attribution methods or adversarial robustness testing.

The taxonomy reveals neighboring leaves focused on automated attribution evaluation methods and attribution mechanism surveys, both examining source-claim alignment but through different lenses. DeepTRACE diverges by emphasizing sociotechnically grounded failure cases translated into measurable dimensions, whereas automated attribution methods prioritize benchmark development for verification algorithms. The broader Verifiability branch sits alongside Adversarial Robustness and Bias Auditing branches, indicating the field organizes reliability concerns into complementary evaluation paradigms. DeepTRACE's focus on routine citation quality contrasts with adversarial stress-testing approaches like Robustness Adversarial Questions, highlighting distinct threat models within the reliability landscape.

Among thirty candidates examined, the DeepTRACE audit framework contribution shows one refutable candidate from ten examined, while the dataset and end-to-end evaluation contributions each examined ten candidates with zero refutations. The framework contribution's single overlap suggests some prior work addresses similar audit dimensions, though the limited search scope prevents definitive claims about exhaustive coverage. The dataset and evaluation contributions appear more distinctive within this candidate pool, potentially reflecting novel application of citation-support matrices to popular public models. These statistics indicate moderate novelty for the framework component and stronger novelty signals for the empirical contributions, contingent on the top-thirty semantic search scope.

Based on the limited literature search covering thirty candidates, DeepTRACE appears to occupy a sparsely populated research direction with modest framework-level overlap and stronger empirical differentiation. The taxonomy structure confirms citation quality assessment remains less crowded than adjacent areas like automated attribution benchmarking or adversarial robustness testing. However, the analysis cannot rule out relevant work outside the top-thirty semantic matches or in parallel evaluation paradigms not captured by the search strategy.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: auditing reliability of generative search engines and deep research agents. The field has organized itself around several complementary concerns. Verifiability and Attribution Evaluation examines whether generated outputs can be traced back to credible sources, with works like Verifiability Generative Search[1] and AttributionBench[21] establishing foundational metrics for citation quality. Adversarial Robustness and Safety Evaluation probes how systems respond to challenging or malicious queries, as seen in Robustness Adversarial Questions[2] and SafeSearch[11]. Bias, Fairness, and Political Accountability Auditing investigates systematic distortions in information presentation, including media source preferences and political slant. Deep Research Agent Benchmarking focuses on end-to-end evaluation of multi-step reasoning systems like PaperQA[5] and DeepResearch Bench[6], while a parallel branch on System Design and Improvement explores architectural innovations such as Skill-Graph Self-Improvement[26]. Additional branches address User Interaction and Human Oversight, Domain-Specific Applications, Conspiratorial Content Generation risks, and Foundational Frameworks that enable adaptation across retrieval paradigms. Particularly active lines of work center on the tension between automation and verifiability in complex research tasks. Deep research agents promise comprehensive literature synthesis but raise questions about citation accuracy and hallucination rates that benchmarks like Deep Research Bench[9] and AIRA[10] attempt to quantify. Within the Verifiability branch, DeepTRACE[0] sits alongside Verifiability Generative Search[1] in focusing specifically on citation quality assessment, examining whether attributed sources genuinely support generated claims. While Verifiability Generative Search[1] established early frameworks for this problem, DeepTRACE[0] extends the inquiry into deeper traceability mechanisms for research-oriented outputs. This cluster contrasts with adversarial evaluation efforts like Robustness Adversarial Factoid[3], which stress-test factual consistency under attack rather than routine attribution fidelity. The interplay between these evaluation paradigms reflects ongoing uncertainty about whether reliability audits should prioritize typical-case verifiability or worst-case robustness.

Claimed Contributions

DeepTRACE audit framework

Can Refute

10 retrieved papers

The authors propose DeepTRACE, an evaluation framework that translates user-identified failure modes from prior community research into eight quantifiable metrics. These metrics assess generative search engines and deep research agents across answer quality, source usage, and citation practices using statement-level decomposition and automated extraction pipelines.

10 retrieved papers

Can Refute

DeepTRACE dataset

10 retrieved papers

The authors create and release a dataset of 303 queries divided into debate questions and expertise questions. This dataset enables systematic evaluation of generative search and deep research systems on real-world, practitioner-relevant queries identified through community-driven research.

10 retrieved papers

End-to-end evaluation of GSE and DR systems

10 retrieved papers

The framework evaluates complete system behavior rather than isolated components, examining how generative search engines and deep research agents retrieve sources, generate citations, express confidence, and ground statements in evidence. This approach complements existing factuality metrics by focusing on sourcing transparency and traceability in user-facing systems.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] Evaluating verifiability in generative search engines PDF

Nelson Liu, Tianyi Zhang, Percy Liang (2023)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

DeepTRACE audit framework

[1] Evaluating verifiability in generative search engines PDF

Can Refute

[8] AI chatbot accountability in the age of algorithmic gatekeeping: Comparing generative search engine political information retrieval across five languages PDF

Cannot Refute

[50] Evaluation of retrieval-augmented generation: A survey PDF

Cannot Refute

[51] ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems PDF

Cannot Refute

[52] Ragchecker: A fine-grained framework for diagnosing retrieval-augmented generation PDF

Cannot Refute

[53] Role-Augmented Intent-Driven Generative Search Engine Optimization PDF

Cannot Refute

[54] Evaluating generative ad hoc information retrieval PDF

Cannot Refute

[55] A blueprint for auditing generative AI PDF

Cannot Refute

[56] The power of generative ai: A review of requirements, models, inputâoutput formats, evaluation metrics, and challenges PDF

Cannot Refute

[57] Geo: Generative engine optimization PDF

Cannot Refute

Contribution

DeepTRACE dataset

[40] G-retriever: Retrieval-augmented generation for textual graph understanding and question answering PDF

Cannot Refute

[41] Debateqa: Evaluating question answering on debatable knowledge PDF

Cannot Refute

[42] Iqa-eval: Automatic evaluation of human-model interactive question answering PDF

Cannot Refute

[43] ExpertQA: Expert-Curated Questions and Attributed Answers PDF

Cannot Refute

[44] Repliqa: A question-answering dataset for benchmarking llms on unseen reference content PDF

Cannot Refute

[45] On scalable oversight with weak llms judging strong llms PDF

Cannot Refute

[46] Performance of large language models in numerical versus semantic medical knowledge: cross-sectional benchmarking study on evidence-based questions â¦ PDF

Cannot Refute

[47] Debating for Better Reasoning: An Unsupervised Multimodal Approach PDF

Cannot Refute

[48] DebateBench: A Challenging Long Context Reasoning Benchmark For Large Language Models PDF

Cannot Refute

[49] Metaqa: Combining expert agents for multi-skill question answering PDF

Cannot Refute

Contribution

End-to-end evaluation of GSE and DR systems

[30] VECTR: Towards a Reliability Framework for Agentic AI in Drug Development PDF

Cannot Refute

[31] CiteEval: Principle-Driven Citation Evaluation for Source Attribution PDF

Cannot Refute

[32] Facilitating human-llm collaboration through factuality scores and source attributions PDF

Cannot Refute

[33] CiteBART: learning to generate citations for local citation recommendation PDF

Cannot Refute

[34] The origins and veracity of references 'cited'by generative artificial intelligence applications: Implications for the quality of responses PDF

Cannot Refute

[35] The extractive-abstractive spectrum: Uncovering verifiability trade-offs in llm generations PDF

Cannot Refute

[36] Towards fine-grained citation evaluation in generated text: A comparative analysis of faithfulness metrics PDF

Cannot Refute

[37] Learning fine-grained grounded citations for attributed large language models PDF

Cannot Refute

[38] Towards verifiable generation: A benchmark for knowledge-aware language model attribution PDF

Cannot Refute

[39] GreenIQ: A Deep Search Platform for Comprehensive Carbon Market Analysis and Automated Report Generation PDF

Cannot Refute

DeepTRACE: Auditing Deep Research AI Systems for Tracking Reliability Across Citations and Evidence

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] Evaluating verifiability in generative search engines PDF

Contribution Analysis

DeepTRACE audit framework

[1] Evaluating verifiability in generative search engines PDF

[8] AI chatbot accountability in the age of algorithmic gatekeeping: Comparing generative search engine political information retrieval across five languages PDF

[50] Evaluation of retrieval-augmented generation: A survey PDF

[51] ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems PDF

[52] Ragchecker: A fine-grained framework for diagnosing retrieval-augmented generation PDF

[53] Role-Augmented Intent-Driven Generative Search Engine Optimization PDF

[54] Evaluating generative ad hoc information retrieval PDF

[55] A blueprint for auditing generative AI PDF

[56] The power of generative ai: A review of requirements, models, inputâoutput formats, evaluation metrics, and challenges PDF

[57] Geo: Generative engine optimization PDF

DeepTRACE dataset

[40] G-retriever: Retrieval-augmented generation for textual graph understanding and question answering PDF

[41] Debateqa: Evaluating question answering on debatable knowledge PDF

[42] Iqa-eval: Automatic evaluation of human-model interactive question answering PDF

[43] ExpertQA: Expert-Curated Questions and Attributed Answers PDF

[44] Repliqa: A question-answering dataset for benchmarking llms on unseen reference content PDF

[45] On scalable oversight with weak llms judging strong llms PDF

[46] Performance of large language models in numerical versus semantic medical knowledge: cross-sectional benchmarking study on evidence-based questions â¦ PDF

[47] Debating for Better Reasoning: An Unsupervised Multimodal Approach PDF

[48] DebateBench: A Challenging Long Context Reasoning Benchmark For Large Language Models PDF

[49] Metaqa: Combining expert agents for multi-skill question answering PDF

End-to-end evaluation of GSE and DR systems

[30] VECTR: Towards a Reliability Framework for Agentic AI in Drug Development PDF

[31] CiteEval: Principle-Driven Citation Evaluation for Source Attribution PDF

[32] Facilitating human-llm collaboration through factuality scores and source attributions PDF

[33] CiteBART: learning to generate citations for local citation recommendation PDF

[34] The origins and veracity of references 'cited'by generative artificial intelligence applications: Implications for the quality of responses PDF

[35] The extractive-abstractive spectrum: Uncovering verifiability trade-offs in llm generations PDF

[36] Towards fine-grained citation evaluation in generated text: A comparative analysis of faithfulness metrics PDF

[37] Learning fine-grained grounded citations for attributed large language models PDF

[38] Towards verifiable generation: A benchmark for knowledge-aware language model attribution PDF

[39] GreenIQ: A Deep Search Platform for Comprehensive Carbon Market Analysis and Automated Report Generation PDF

Table of Contents

[56] The power of generative ai: A review of requirements, models, inputâoutput formats, evaluation metrics, and challenges PDF

[46] Performance of large language models in numerical versus semantic medical knowledge: cross-sectional benchmarking study on evidence-based questions â¦ PDF