WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

agentinformation seekingdata synthesisllm

The advent of Large Language Model (LLM)-powered agents has revolutionized artificial intelligence by enabling solutions to complex, open-ended tasks through web-based information-seeking (IS) capabilities. The scarcity of high-quality training data has limited the development of IS agents. Existing data synthesis approaches typically adopt an information-driven paradigm that first collects information and then refines question-answer pairs through retrieval. However, this may lead to inconsistency between information structure and reasoning structure, as well as between the question and the corresponding answer. To mitigate, we propose a formalization-driven IS data synthesis framework WebShaper, which systematically formalizes IS tasks using set-theoretic constructs. Central to the formalization is the concept of Knowledge Projections (KP), which enables precise control over reasoning structure by KP operation compositions. During synthesis, we begin by creating seed tasks, then use a multi-step expansion process. At each step, an agentic Expander expands the current formal question more complex through retrieval and validation tools grounded in our formalization. We train our model on the synthesized dataset. Experiment results demonstrate that WebShaper achieves state-of-the-art performance among open-sourced IS agents on competitive benchmarks.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces WebShaper, a formalization-driven framework for synthesizing training data for information-seeking agents, centered on Knowledge Projections (KP) and set-theoretic constructs. It resides in the 'Synthetic Trajectory and Reasoning Data for Agents' leaf, which contains six papers total, including the original work. This leaf sits within the broader 'Information-Seeking Agent Training and Evaluation' branch, indicating a moderately populated research direction focused on generating reasoning traces and web interaction data for agent training. The taxonomy reveals this is an active but not overcrowded area, with sibling works like Webdancer and Websailor-v2 addressing similar trajectory synthesis challenges.

The taxonomy tree shows neighboring leaves addressing 'Deep Research and Multi-Step Reasoning Benchmarks' (two papers) and 'General Agent Frameworks and Environments' (three papers), suggesting the field distinguishes between trajectory synthesis methods, evaluation frameworks, and broader agent architectures. Adjacent branches cover 'Synthetic Data Generation for Information Retrieval' (nine papers across three leaves) and 'Synthetic Data for Question Answering Systems' (seventeen papers across four leaves), which focus on static QA pairs rather than interactive agent trajectories. WebShaper's formalization-driven approach diverges from these by emphasizing reasoning structure consistency through KP operations, contrasting with information-driven paradigms common in retrieval-focused synthesis.

Among seventeen candidates examined, no contributions were clearly refuted by prior work. The WebShaper framework itself was compared against five candidates with zero refutable overlaps; the Knowledge Projections formalization examined ten candidates with no refutations; and the Agentic Expander strategy reviewed two candidates, also finding no clear prior work. This suggests that within the limited search scope, the formalization-driven paradigm and KP-based reasoning control appear relatively novel. However, the modest candidate pool (seventeen total) means the analysis captures top semantic matches and immediate citations, not an exhaustive field survey.

Based on the limited literature search, the work appears to occupy a distinct position within trajectory synthesis for information-seeking agents, particularly through its set-theoretic formalization and KP operation compositions. The analysis covers top-seventeen semantic matches and does not claim comprehensive coverage of all related agent training or data synthesis methods. The absence of refutable candidates within this scope suggests differentiation from examined prior work, though broader field exploration might reveal additional overlaps.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: data synthesis for information-seeking agents. The field encompasses methods for generating synthetic training data to improve agents that retrieve, reason over, and answer questions from diverse information sources. The taxonomy reveals several major branches: Synthetic Data Generation for Information Retrieval focuses on creating query-document pairs and retrieval corpora, often leveraging techniques like InPars[3] and its extensions; Synthetic Data for Question Answering Systems targets QA datasets across domains and languages; Information-Seeking Agent Training and Evaluation emphasizes trajectory data and reasoning traces for interactive agents; General-Purpose Synthetic Data Generation explores broader data creation and quality assessment methods; Interactive and Active Information Seeking addresses user simulation and dynamic query refinement; and Evaluation and Consistency in Question Answering examines robustness and reliability. These branches collectively address the challenge of scaling agent capabilities without extensive human annotation, with works ranging from domain-specific solutions like Synthetic Financial QA[16] to general frameworks such as Open Data Synthesis[7]. Within Information-Seeking Agent Training and Evaluation, a particularly active line of work centers on generating synthetic trajectories and reasoning chains that capture multi-step agent behavior. WebShaper[0] sits squarely in this cluster, focusing on synthesizing realistic web navigation and interaction data to train agents for complex information-seeking tasks. Nearby efforts like Webdancer[1] and Websailor-v2[50] similarly emphasize trajectory generation for web-based agents, while InfoAgent[5] explores broader reasoning trace synthesis. A key tension across these works involves balancing the diversity and realism of synthetic trajectories against the computational cost of generation and the risk of introducing artifacts that harm downstream performance. WebShaper[0] addresses this by shaping trajectories to reflect authentic user behavior patterns, contrasting with approaches like SimpleDeepSearcher[15] that prioritize simplicity and efficiency. The ongoing challenge remains ensuring that synthetic data faithfully represents the complexity of real-world information-seeking without overfitting to narrow task distributions.

Claimed Contributions

WebShaper formalization-driven data synthesis framework

5 retrieved papers

The authors introduce WebShaper, a novel framework that formalizes information-seeking tasks using set theory and Knowledge Projections (KP). Unlike prior information-driven approaches, this formalization-driven paradigm enables precise control over reasoning structures, broader task coverage, and improved structural consistency between questions and answers.

5 retrieved papers

Knowledge Projections (KP) and set-theoretic formalization

10 retrieved papers

The authors propose Knowledge Projections as the basic unit for formalizing information-seeking tasks, defined using set-theoretic constructs with R-Union and Intersection operations. This formalization allows systematic representation of complex reasoning structures and enables controllable task generation.

10 retrieved papers

Agentic Expander with layer-wise expansion strategy

2 retrieved papers

The authors develop an autonomous agent called Expander that iteratively generates complex questions by expanding seed tasks through multi-step processes. It employs a layer-wise expansion strategy to minimize redundancy and prevent reasoning shortcuts, using retrieval and validation tools aligned with the formal task representation.

2 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] Webdancer: Towards autonomous information seeking agency PDF

Wu Jialong, Yin, Wenbiao, Zhang Liwen, Tao, Zhengwei, Fu Gang, Jiang Yong, Xie, Pengjun, Huang Fei, Zhou, Jingren (2025)

[5] InfoAgent: Advancing Autonomous Information-Seeking Agents PDF

zhu jialiang, Gongrui Zhang, Yang Ruiqi, Jialiang Zhu, Qiu Kai, Ruiqi Yang, Zhang, Miaosen, Kai Qiu, Wu, Zhirong, Miaosen Zhang, Dai, Qi, Zhirong Wu, Liu, Bei, Qi Dai, Luo Chong, Bei Liu, Yang, Zhengyuan, Chong Luo, Li, Linjie, Zhengyuan Yang, Wang Li-juan, Linjie Li, Chen, Weizhu, Lijuan Wang, Zhang Yuan, Weizhu Chen, LI Xin, Yuan Zhang, Liu Zhaoyi, Xin Li, Geng Xin, Zhaoyi Liu, Guo, Baining, Xin Geng, Baining Guo (2025)

[15] SimpleDeepSearcher: Deep information seeking via web-powered reasoning trajectory synthesis PDF

Sun Shuang, Wang, Yuhao, Ren, Ruiyang, Jiang, Jinhao, Zhang JunâJie, Bai Fei, Deng Jia, Zhao, Wayne Xin, Liu Zheng, Fang Lei, Zhongyuan, Wen, Ji-Rong (2025)

[17] Repurposing synthetic data for fine-grained search agent supervision PDF

Zhao, Yida, Li Kuan, Wu Xixi, Zhang Liwen, Song Maojia, Chen Zhuo, Wang Chen-xi, Wang, Xinyu, Tu, Kewei, Xie, Pengjun, Zhou, Jingren, Jiang Yong (2025)

[50] Websailor-v2: Bridging the chasm to proprietary agents via synthetic data and scalable reinforcement learning PDF

Li Kuan, Zhang, Zhongwang, Kuan Li, Yin Hui-feng, Zhongwang Zhang, Ye Rui, Huifeng Yin, Zhao, Yida, Rui Ye, Zhang Liwen, Yida Zhao, Ou, Litu, Liwen Zhang, Litu Ou, Wu Xixi, Dingchu Zhang, Wu Jialong, Xixi Wu, Wang, Xinyu, Jialong Wu, Qiao, Zile, Xinyu Wang, Zhang Zhen, Zile Qiao, Jiang Yong, Zhen Zhang, Xie, Pengjun, Yong Jiang, Huang Fei, Pengjun Xie, Zhou, Jingren, Fei Huang, Jingren Zhou (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

WebShaper formalization-driven data synthesis framework

[51] Generative Retrieval via Term Set Generation PDF

Cannot Refute

[52] Evolving Information Retrieval: From Traditional Models to Emerging Paradigms PDF

Cannot Refute

[53] BOX EMBEDDINGS AS SET-THEORETIC REPRESENTATIONS FOR INFORMATION RETRIEVAL & RECOMMENDER SYSTEMS PDF

Cannot Refute

[54] Formalizing association semantics in terminologies PDF

Cannot Refute

[55] Simple Search Engine Model: Adaptive Properties for Doubleton PDF

Cannot Refute

Contribution

Knowledge Projections (KP) and set-theoretic formalization

[58] Reinforcing compositional retrieval: Retrieving step-by-step for composing informative contexts PDF

Cannot Refute

[59] Decompositional Reasoning for Graph Retrieval with Large Language Models PDF

Cannot Refute

[60] Multimodal Reasoning Agent for Zero-Shot Composed Image Retrieval PDF

Cannot Refute

[61] Compositional attention: Disentangling search and retrieval PDF

Cannot Refute

[62] Set of diverse queries with uncertainty regularization for composed image retrieval PDF

Cannot Refute

[63] Align and Aggregate: Compositional Reasoning with Video Alignment and Answer Aggregation for Video Question-Answering PDF

Cannot Refute

[64] RAVU: Retrieval Augmented Video Understanding with Compositional Reasoning over Graph PDF

Cannot Refute

[65] Neuro-symbolic fact verification PDF

Cannot Refute

[66] Semantic Sentence Composition Reasoning for Multi-Hop Question Answering PDF

Cannot Refute

[67] Comprehensive Relationship Reasoning for Composed Query Based Image Retrieval PDF

Cannot Refute

Contribution

Agentic Expander with layer-wise expansion strategy

[56] Decoupled Planning and Execution: A Hierarchical Reasoning Framework for Deep Search PDF

Cannot Refute

[57] Hierarchical Sequence Iteration for Heterogeneous Question Answering PDF

Cannot Refute

WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] Webdancer: Towards autonomous information seeking agency PDF

[5] InfoAgent: Advancing Autonomous Information-Seeking Agents PDF

[15] SimpleDeepSearcher: Deep information seeking via web-powered reasoning trajectory synthesis PDF

[17] Repurposing synthetic data for fine-grained search agent supervision PDF

[50] Websailor-v2: Bridging the chasm to proprietary agents via synthetic data and scalable reinforcement learning PDF

Contribution Analysis

WebShaper formalization-driven data synthesis framework

[51] Generative Retrieval via Term Set Generation PDF

[52] Evolving Information Retrieval: From Traditional Models to Emerging Paradigms PDF

[53] BOX EMBEDDINGS AS SET-THEORETIC REPRESENTATIONS FOR INFORMATION RETRIEVAL & RECOMMENDER SYSTEMS PDF

[54] Formalizing association semantics in terminologies PDF

[55] Simple Search Engine Model: Adaptive Properties for Doubleton PDF

Knowledge Projections (KP) and set-theoretic formalization

[58] Reinforcing compositional retrieval: Retrieving step-by-step for composing informative contexts PDF

[59] Decompositional Reasoning for Graph Retrieval with Large Language Models PDF

[60] Multimodal Reasoning Agent for Zero-Shot Composed Image Retrieval PDF

[61] Compositional attention: Disentangling search and retrieval PDF

[62] Set of diverse queries with uncertainty regularization for composed image retrieval PDF

[63] Align and Aggregate: Compositional Reasoning with Video Alignment and Answer Aggregation for Video Question-Answering PDF

[64] RAVU: Retrieval Augmented Video Understanding with Compositional Reasoning over Graph PDF

[65] Neuro-symbolic fact verification PDF

[66] Semantic Sentence Composition Reasoning for Multi-Hop Question Answering PDF

[67] Comprehensive Relationship Reasoning for Composed Query Based Image Retrieval PDF

Agentic Expander with layer-wise expansion strategy

[56] Decoupled Planning and Execution: A Hierarchical Reasoning Framework for Deep Search PDF

[57] Hierarchical Sequence Iteration for Heterogeneous Question Answering PDF

Table of Contents