DeepTRACE: Auditing Deep Research AI Systems for Tracking Reliability Across Citations and Evidence

ICLR 2026 Conference SubmissionAnonymous Authors
deep researchgenerative search enginesNLPaudit frameworksociotechnical evaluationlarge language models
Abstract:

Generative search engines and deep research LLM agents promise trustworthy, source-grounded synthesis, yet users regularly encounter overconfidence, weak sourcing, and confusing citation practices. We introduce DeepTRACE, a novel sociotechnically grounded audit framework that turns prior community-identified failure cases into eight measurable dimensions spanning answer text, sources, and citations. DeepTRACE uses statement-level analysis (decomposition, confidence scoring) and builds citation and factual-support matrices to audit how systems reason with and attribute evidence end-to-end. Using automated extraction pipelines for popular public models (e.g., GPT-4.5/5, You.com, Perplexity, Copilot/Bing, Gemini) and an LLM-judge with validated agreement to human raters, we evaluate both web-search engines and deep-research configurations. Our findings show that generative search engines and deep research agents frequently produce one-sided, highly confident responses on debate queries and include large fractions of statements unsupported by their own listed sources. Deep-research configurations reduce overconfidence and can attain high citation thoroughness, but they remain highly one-sided on debate queries and still exhibit large fractions of unsupported statements, with citation accuracy ranging from 40–80% across systems.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

DeepTRACE introduces an audit framework spanning answer text, sources, and citations through statement-level decomposition and citation-support matrices. It resides in the Citation Quality Assessment in Generative Search leaf, which contains only two papers including this one. The sibling work, Verifiability Generative Search, established foundational citation metrics, making this a relatively sparse research direction within the broader Verifiability and Attribution Evaluation branch. The limited population suggests this specific angle—end-to-end citation quality auditing for generative search—remains underexplored compared to adjacent areas like automated attribution methods or adversarial robustness testing.

The taxonomy reveals neighboring leaves focused on automated attribution evaluation methods and attribution mechanism surveys, both examining source-claim alignment but through different lenses. DeepTRACE diverges by emphasizing sociotechnically grounded failure cases translated into measurable dimensions, whereas automated attribution methods prioritize benchmark development for verification algorithms. The broader Verifiability branch sits alongside Adversarial Robustness and Bias Auditing branches, indicating the field organizes reliability concerns into complementary evaluation paradigms. DeepTRACE's focus on routine citation quality contrasts with adversarial stress-testing approaches like Robustness Adversarial Questions, highlighting distinct threat models within the reliability landscape.

Among thirty candidates examined, the DeepTRACE audit framework contribution shows one refutable candidate from ten examined, while the dataset and end-to-end evaluation contributions each examined ten candidates with zero refutations. The framework contribution's single overlap suggests some prior work addresses similar audit dimensions, though the limited search scope prevents definitive claims about exhaustive coverage. The dataset and evaluation contributions appear more distinctive within this candidate pool, potentially reflecting novel application of citation-support matrices to popular public models. These statistics indicate moderate novelty for the framework component and stronger novelty signals for the empirical contributions, contingent on the top-thirty semantic search scope.

Based on the limited literature search covering thirty candidates, DeepTRACE appears to occupy a sparsely populated research direction with modest framework-level overlap and stronger empirical differentiation. The taxonomy structure confirms citation quality assessment remains less crowded than adjacent areas like automated attribution benchmarking or adversarial robustness testing. However, the analysis cannot rule out relevant work outside the top-thirty semantic matches or in parallel evaluation paradigms not captured by the search strategy.

Taxonomy

Core-task Taxonomy Papers
29
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: auditing reliability of generative search engines and deep research agents. The field has organized itself around several complementary concerns. Verifiability and Attribution Evaluation examines whether generated outputs can be traced back to credible sources, with works like Verifiability Generative Search[1] and AttributionBench[21] establishing foundational metrics for citation quality. Adversarial Robustness and Safety Evaluation probes how systems respond to challenging or malicious queries, as seen in Robustness Adversarial Questions[2] and SafeSearch[11]. Bias, Fairness, and Political Accountability Auditing investigates systematic distortions in information presentation, including media source preferences and political slant. Deep Research Agent Benchmarking focuses on end-to-end evaluation of multi-step reasoning systems like PaperQA[5] and DeepResearch Bench[6], while a parallel branch on System Design and Improvement explores architectural innovations such as Skill-Graph Self-Improvement[26]. Additional branches address User Interaction and Human Oversight, Domain-Specific Applications, Conspiratorial Content Generation risks, and Foundational Frameworks that enable adaptation across retrieval paradigms. Particularly active lines of work center on the tension between automation and verifiability in complex research tasks. Deep research agents promise comprehensive literature synthesis but raise questions about citation accuracy and hallucination rates that benchmarks like Deep Research Bench[9] and AIRA[10] attempt to quantify. Within the Verifiability branch, DeepTRACE[0] sits alongside Verifiability Generative Search[1] in focusing specifically on citation quality assessment, examining whether attributed sources genuinely support generated claims. While Verifiability Generative Search[1] established early frameworks for this problem, DeepTRACE[0] extends the inquiry into deeper traceability mechanisms for research-oriented outputs. This cluster contrasts with adversarial evaluation efforts like Robustness Adversarial Factoid[3], which stress-test factual consistency under attack rather than routine attribution fidelity. The interplay between these evaluation paradigms reflects ongoing uncertainty about whether reliability audits should prioritize typical-case verifiability or worst-case robustness.

Claimed Contributions

DeepTRACE audit framework

The authors propose DeepTRACE, an evaluation framework that translates user-identified failure modes from prior community research into eight quantifiable metrics. These metrics assess generative search engines and deep research agents across answer quality, source usage, and citation practices using statement-level decomposition and automated extraction pipelines.

10 retrieved papers
Can Refute
DeepTRACE dataset

The authors create and release a dataset of 303 queries divided into debate questions and expertise questions. This dataset enables systematic evaluation of generative search and deep research systems on real-world, practitioner-relevant queries identified through community-driven research.

10 retrieved papers
End-to-end evaluation of GSE and DR systems

The framework evaluates complete system behavior rather than isolated components, examining how generative search engines and deep research agents retrieve sources, generate citations, express confidence, and ground statements in evidence. This approach complements existing factuality metrics by focusing on sourcing transparency and traceability in user-facing systems.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

DeepTRACE audit framework

The authors propose DeepTRACE, an evaluation framework that translates user-identified failure modes from prior community research into eight quantifiable metrics. These metrics assess generative search engines and deep research agents across answer quality, source usage, and citation practices using statement-level decomposition and automated extraction pipelines.

Contribution

DeepTRACE dataset

The authors create and release a dataset of 303 queries divided into debate questions and expertise questions. This dataset enables systematic evaluation of generative search and deep research systems on real-world, practitioner-relevant queries identified through community-driven research.

Contribution

End-to-end evaluation of GSE and DR systems

The framework evaluates complete system behavior rather than isolated components, examining how generative search engines and deep research agents retrieve sources, generate citations, express confidence, and ground statements in evidence. This approach complements existing factuality metrics by focusing on sourcing transparency and traceability in user-facing systems.