Fathom-DeepResearch: Unlocking Long Horizon Information Retrieval and Synthesis for SLMs

ICLR 2026 Conference SubmissionAnonymous Authors
DeepResearchReasoningagentic reasoning
Abstract:

Tool-integrated reasoning has emerged as a key focus for enabling agentic applications. Among these, DeepResearch Agents have gained significant attention for their strong performance on complex, open-ended information-seeking tasks. We introduce Fathom-DeepResearch, an agentic system composed of two specialized models. The first is Fathom-Search-4B, a DeepSearch model trained from Qwen3-4B and optimized for evidence-based investigation through live web search and targeted webpage querying. Its training combines three advances: (i) DUETQA, a ∼5K-sample dataset generated via multi-agent self-play that enforces strict web-search dependence and heterogeneous source grounding; (ii) RAPO, a zero-overhead extension of GRPO that stabilizes multi-turn Reinforcement Learning with Verifiable Rewards through curriculum pruning, reward-aware advantage scaling, and per-prompt replay buffers; and (iii) a steerable step-level reward that classifies each tool call by cognitive behavior and marginal utility, enabling explicit control over search trajectory breadth, depth, and horizon. These improvements enable reliable extension of tool-calling beyond 20 calls when warranted. The second is Fathom-Synthesizer-4B, trained from Qwen3-4B, which converts multi-turn DeepSearch traces into structured, citation-dense DeepResearch Reports for comprehensive synthesis. Evaluated on DeepSearch benchmarks (SimpleQA, FRAMES, WebWalker, Seal0, MuSiQue) and DeepResearch-Bench, the system achieves state-of-the-art performance in the open-weights category while closely rivaling proprietary closed systems, while also demonstrating strong performance in general reasoning benchmarks: HLE, AIME-25, GPQA-Diamond, and MedQA.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Fathom-DeepResearch, a unified deep research agent combining a specialized 4B search model with reinforcement learning and dataset innovations. It resides in the 'Unified Deep Research Agents' leaf, which contains five papers including the original work. This leaf sits within the broader 'Deep Research Agent Architectures and Frameworks' branch, which also includes multi-agent collaborative systems and hierarchical planning frameworks. The taxonomy reveals a moderately populated research direction, with unified agents representing one of three architectural paradigms for long-horizon web search.

The taxonomy structure shows that unified agents occupy a middle ground between multi-agent collaborative systems (four papers) and hierarchical planning frameworks (three papers). Neighboring leaves address training methodologies, context management, and retrieval-augmented generation techniques. The scope note clarifies that unified agents integrate reasoning, exploration, and synthesis within a cohesive architecture, distinguishing them from multi-agent systems that distribute these functions across specialized components. This positioning suggests the paper contributes to an active but not overcrowded research direction focused on streamlined agentic architectures.

Among seven candidates examined across three contributions, the analysis found limited prior work overlap. The RAPO contribution examined zero candidates, suggesting either a novel algorithmic direction or insufficient semantic matches in the search. The DUETQA dataset examined four candidates with one potential refutation, indicating some precedent for multi-agent self-play data generation. The steerable step-level reward examined three candidates with two refutations, suggesting more substantial prior work on reward shaping for search trajectory control. These statistics reflect a constrained search scope rather than exhaustive coverage.

Based on the limited search of seven candidates, the work appears to introduce meaningful innovations in training methodology and reward design, though the scale of the literature search prevents definitive conclusions. The taxonomy context suggests the paper addresses an active research area with established architectural patterns, while the contribution-level analysis indicates varying degrees of novelty across its three main claims.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
7
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: Long horizon information retrieval and synthesis using web search tools. This field addresses the challenge of orchestrating multi-step search and reasoning processes to answer complex queries that require gathering, filtering, and synthesizing information from diverse web sources. The taxonomy reveals a landscape organized around several complementary themes: Deep Research Agent Architectures and Frameworks focus on end-to-end systems that coordinate search, retrieval, and synthesis (e.g., Mindsearch[11], Webthinker[1]); Training Methodologies explore how agents learn effective search strategies through reinforcement learning and other paradigms; Context Management and Scalability Techniques tackle the problem of handling long interaction histories and large retrieved corpora; Retrieval-Augmented Generation Techniques integrate external knowledge into language model outputs; Evaluation Benchmarks provide standardized testbeds; and Domain-Specific branches examine applications in medicine, education, and other specialized areas. Foundational Theories and Conceptual Frameworks offer broader perspectives on information seeking, while Specialized Retrieval and Search Techniques address technical innovations in query formulation and result ranking. Recent work has concentrated on building unified deep research agents capable of autonomous multi-turn exploration, reflecting a shift from simple retrieval pipelines to more sophisticated agentic reasoning. Fathom-DeepResearch[0] exemplifies this trend, sitting within the Unified Deep Research Agents cluster alongside systems like Webresearcher[34] and Agentic Reasoning[12]. These approaches emphasize iterative query refinement, dynamic planning, and synthesis across many retrieved documents, contrasting with earlier methods such as Demonstrate-Search-Predict[2] that relied more heavily on fixed retrieval patterns. A key tension in this space involves balancing exploration breadth with computational cost: some frameworks adopt multi-agent architectures to parallelize search (e.g., Multi-Agent Proactive Information Seeking[3]), while others streamline the process through simpler orchestration (SimpleDeepSearcher[9]). Open questions remain around how to best evaluate long-horizon synthesis quality, manage context windows effectively, and generalize learned search strategies across diverse information needs.

Claimed Contributions

RAPO: Reward Aware Policy Optimization

A modified policy optimization algorithm that extends GRPO with three mechanisms—dataset pruning, advantage scaling, and replay buffers—to stabilize multi-turn reinforcement learning in tool-augmented environments and enable reliable long-horizon tool use beyond 20 calls.

0 retrieved papers
DUETQA dataset via multi-agent self-play

A synthetic dataset of 5,000 question-answer pairs created through a multi-agent self-play pipeline that ensures questions are unanswerable without live web search, require diverse source domains, and support multi-hop reasoning with verifiable correctness.

4 retrieved papers
Can Refute
Steerable step-level reward for search trajectory control

A reward function that assigns labels to individual tool calls based on their cognitive behavior (exploration vs. verification) and marginal utility, providing explicit control over the agent's search strategy and preventing reward hacking through redundant tool use.

3 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

RAPO: Reward Aware Policy Optimization

A modified policy optimization algorithm that extends GRPO with three mechanisms—dataset pruning, advantage scaling, and replay buffers—to stabilize multi-turn reinforcement learning in tool-augmented environments and enable reliable long-horizon tool use beyond 20 calls.

Contribution

DUETQA dataset via multi-agent self-play

A synthetic dataset of 5,000 question-answer pairs created through a multi-agent self-play pipeline that ensures questions are unanswerable without live web search, require diverse source domains, and support multi-hop reasoning with verifiable correctness.

Contribution

Steerable step-level reward for search trajectory control

A reward function that assigns labels to individual tool calls based on their cognitive behavior (exploration vs. verification) and marginal utility, providing explicit control over the agent's search strategy and preventing reward hacking through redundant tool use.