WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization

ICLR 2026 Conference SubmissionAnonymous Authors
agentinformation seekingdata synthesisllm
Abstract:

The advent of Large Language Model (LLM)-powered agents has revolutionized artificial intelligence by enabling solutions to complex, open-ended tasks through web-based information-seeking (IS) capabilities. The scarcity of high-quality training data has limited the development of IS agents. Existing data synthesis approaches typically adopt an information-driven paradigm that first collects information and then refines question-answer pairs through retrieval. However, this may lead to inconsistency between information structure and reasoning structure, as well as between the question and the corresponding answer. To mitigate, we propose a formalization-driven IS data synthesis framework WebShaper, which systematically formalizes IS tasks using set-theoretic constructs. Central to the formalization is the concept of Knowledge Projections (KP), which enables precise control over reasoning structure by KP operation compositions. During synthesis, we begin by creating seed tasks, then use a multi-step expansion process. At each step, an agentic Expander expands the current formal question more complex through retrieval and validation tools grounded in our formalization. We train our model on the synthesized dataset. Experiment results demonstrate that WebShaper achieves state-of-the-art performance among open-sourced IS agents on competitive benchmarks.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces WebShaper, a formalization-driven framework for synthesizing training data for information-seeking agents, centered on Knowledge Projections (KP) and set-theoretic constructs. It resides in the 'Synthetic Trajectory and Reasoning Data for Agents' leaf, which contains six papers total, including the original work. This leaf sits within the broader 'Information-Seeking Agent Training and Evaluation' branch, indicating a moderately populated research direction focused on generating reasoning traces and web interaction data for agent training. The taxonomy reveals this is an active but not overcrowded area, with sibling works like Webdancer and Websailor-v2 addressing similar trajectory synthesis challenges.

The taxonomy tree shows neighboring leaves addressing 'Deep Research and Multi-Step Reasoning Benchmarks' (two papers) and 'General Agent Frameworks and Environments' (three papers), suggesting the field distinguishes between trajectory synthesis methods, evaluation frameworks, and broader agent architectures. Adjacent branches cover 'Synthetic Data Generation for Information Retrieval' (nine papers across three leaves) and 'Synthetic Data for Question Answering Systems' (seventeen papers across four leaves), which focus on static QA pairs rather than interactive agent trajectories. WebShaper's formalization-driven approach diverges from these by emphasizing reasoning structure consistency through KP operations, contrasting with information-driven paradigms common in retrieval-focused synthesis.

Among seventeen candidates examined, no contributions were clearly refuted by prior work. The WebShaper framework itself was compared against five candidates with zero refutable overlaps; the Knowledge Projections formalization examined ten candidates with no refutations; and the Agentic Expander strategy reviewed two candidates, also finding no clear prior work. This suggests that within the limited search scope, the formalization-driven paradigm and KP-based reasoning control appear relatively novel. However, the modest candidate pool (seventeen total) means the analysis captures top semantic matches and immediate citations, not an exhaustive field survey.

Based on the limited literature search, the work appears to occupy a distinct position within trajectory synthesis for information-seeking agents, particularly through its set-theoretic formalization and KP operation compositions. The analysis covers top-seventeen semantic matches and does not claim comprehensive coverage of all related agent training or data synthesis methods. The absence of refutable candidates within this scope suggests differentiation from examined prior work, though broader field exploration might reveal additional overlaps.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
17
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: data synthesis for information-seeking agents. The field encompasses methods for generating synthetic training data to improve agents that retrieve, reason over, and answer questions from diverse information sources. The taxonomy reveals several major branches: Synthetic Data Generation for Information Retrieval focuses on creating query-document pairs and retrieval corpora, often leveraging techniques like InPars[3] and its extensions; Synthetic Data for Question Answering Systems targets QA datasets across domains and languages; Information-Seeking Agent Training and Evaluation emphasizes trajectory data and reasoning traces for interactive agents; General-Purpose Synthetic Data Generation explores broader data creation and quality assessment methods; Interactive and Active Information Seeking addresses user simulation and dynamic query refinement; and Evaluation and Consistency in Question Answering examines robustness and reliability. These branches collectively address the challenge of scaling agent capabilities without extensive human annotation, with works ranging from domain-specific solutions like Synthetic Financial QA[16] to general frameworks such as Open Data Synthesis[7]. Within Information-Seeking Agent Training and Evaluation, a particularly active line of work centers on generating synthetic trajectories and reasoning chains that capture multi-step agent behavior. WebShaper[0] sits squarely in this cluster, focusing on synthesizing realistic web navigation and interaction data to train agents for complex information-seeking tasks. Nearby efforts like Webdancer[1] and Websailor-v2[50] similarly emphasize trajectory generation for web-based agents, while InfoAgent[5] explores broader reasoning trace synthesis. A key tension across these works involves balancing the diversity and realism of synthetic trajectories against the computational cost of generation and the risk of introducing artifacts that harm downstream performance. WebShaper[0] addresses this by shaping trajectories to reflect authentic user behavior patterns, contrasting with approaches like SimpleDeepSearcher[15] that prioritize simplicity and efficiency. The ongoing challenge remains ensuring that synthetic data faithfully represents the complexity of real-world information-seeking without overfitting to narrow task distributions.

Claimed Contributions

WebShaper formalization-driven data synthesis framework

The authors introduce WebShaper, a novel framework that formalizes information-seeking tasks using set theory and Knowledge Projections (KP). Unlike prior information-driven approaches, this formalization-driven paradigm enables precise control over reasoning structures, broader task coverage, and improved structural consistency between questions and answers.

5 retrieved papers
Knowledge Projections (KP) and set-theoretic formalization

The authors propose Knowledge Projections as the basic unit for formalizing information-seeking tasks, defined using set-theoretic constructs with R-Union and Intersection operations. This formalization allows systematic representation of complex reasoning structures and enables controllable task generation.

10 retrieved papers
Agentic Expander with layer-wise expansion strategy

The authors develop an autonomous agent called Expander that iteratively generates complex questions by expanding seed tasks through multi-step processes. It employs a layer-wise expansion strategy to minimize redundancy and prevent reasoning shortcuts, using retrieval and validation tools aligned with the formal task representation.

2 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

WebShaper formalization-driven data synthesis framework

The authors introduce WebShaper, a novel framework that formalizes information-seeking tasks using set theory and Knowledge Projections (KP). Unlike prior information-driven approaches, this formalization-driven paradigm enables precise control over reasoning structures, broader task coverage, and improved structural consistency between questions and answers.

Contribution

Knowledge Projections (KP) and set-theoretic formalization

The authors propose Knowledge Projections as the basic unit for formalizing information-seeking tasks, defined using set-theoretic constructs with R-Union and Intersection operations. This formalization allows systematic representation of complex reasoning structures and enables controllable task generation.

Contribution

Agentic Expander with layer-wise expansion strategy

The authors develop an autonomous agent called Expander that iteratively generates complex questions by expanding seed tasks through multi-step processes. It employs a layer-wise expansion strategy to minimize redundancy and prevent reasoning shortcuts, using retrieval and validation tools aligned with the formal task representation.

WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization | Novelty Validation