Scaling Generalist Data-Analytic Agents

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Data AnalysisLLM AgentsAgent Training

Data-analytic agents are emerging as a key catalyst for automated scientific discovery and for the vision of Innovating AI. Current approaches, however, rely heavily on prompt engineering over proprietary models, while open-source models struggle to face diverse-format, large-scale data files and long-horizon, multi-step reasoning that real-world analytics demands. This paper introduces DataMind, a scalable data synthesis and agent training recipe designed to build generalist data-analytic agents. DataMind tackles three key challenges in building open-source data-analytic agents, including insufficient data resources, improper training strategy, and unstable code-based multi-turn rollout. Concretely, DataMind applies 1) a fine-grained task taxonomy and a recursive easy-to-hard task composition mechanism to increase the diversity and difficulty of synthesized queries; 2) a knowledge-augmented trajectory sampling strategy followed by model-based and rule-based filtering; 3) a dynamically adjustable training objective combining both SFT and RL losses; 4) a memory-frugal and stable code-based multi-turn rollout framework. Built on DataMind, we curate DataMind-12K, a high-quality trajectory set spanning diverse domains, task categories, and data file formats for data-analytic tasks. Trained on DataMind-12K, our DataMind-14B achieves state-of-the-art with an average score of 71.16% on multiple data analysis benchmarks, outperforming the strongest proprietary baselines DeepSeek-V3.1 and GPT-5. Our DataMind-7B also performs best among all open-source models with a score of 68.10%. We also incorporate some empirical insights gained from our exploratory trials into the analysis experiments, aiming to provide actionable insights about agentic training for the community. We will release DataMind-12K and DataMind-7B,14B for the community's future research.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

DataMind proposes a scalable data synthesis and agent training recipe for building generalist data-analytic agents, addressing challenges in open-source model development for diverse-format data and long-horizon reasoning. The paper resides in the 'Generalist Data-Analytic Agents with Scalable Training' leaf, which contains only two papers total. This sparse population suggests the specific combination of scalable training pipelines and generalist data analytics remains relatively underexplored, positioning the work in an emerging rather than saturated research direction within the broader taxonomy of agent systems.

The taxonomy reveals neighboring research directions that contextualize DataMind's positioning. Database-Centric Analytics focuses on SQL and structured query processing, while Agentic Data Systems emphasizes autonomous analysis of heterogeneous sources. Computer-Use Agents tackle GUI interaction rather than data analytics, and Specialized Training Pipelines explore progressive difficulty enhancement for web or research agents. DataMind bridges scalable training methodology with data-analytic capabilities, diverging from database-specific systems by targeting diverse data formats and from computer-use agents by emphasizing analytical reasoning over interface manipulation.

Among thirty candidates examined, none clearly refuted any of DataMind's three core contributions: the scalable training recipe, the DataMind-12K trajectory dataset, or the resulting 7B/14B models. Each contribution was assessed against ten candidates with zero refutable overlaps identified. This suggests that within the limited search scope, the specific combination of fine-grained task taxonomy, knowledge-augmented trajectory sampling, hybrid SFT-RL training, and memory-frugal rollout appears relatively novel. However, the analysis explicitly covers top-K semantic matches rather than exhaustive prior work, leaving open the possibility of relevant work outside this candidate pool.

Based on the limited literature search of thirty candidates, DataMind appears to occupy a sparsely populated research direction with no clear prior work overlap detected. The taxonomy structure confirms that scalable training for generalist data-analytic agents remains less crowded than adjacent areas like computer-use or game-based agents. These signals suggest meaningful novelty within the examined scope, though the analysis does not claim exhaustive coverage of all potentially relevant prior work in data analytics or agent training.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Building generalist data-analytic agents through scalable training. The field of generalist agents has diversified into several distinct branches, each addressing different interaction modalities and problem settings. Data-Analytic and Query-Driven Agent Systems focus on agents that process structured and unstructured data, often interfacing with databases or analytical workflows. Computer-Use and GUI Interaction Agents tackle desktop and web environments, enabling agents to navigate graphical interfaces much like human users. Embodied and 3D World Generalist Agents extend capabilities into spatial reasoning and physical simulation, while Game-Based Generalist Agents and World Models leverage interactive game environments to train versatile policies. Generalist Agent Frameworks with Minimal Predefinition emphasize open-ended architectures that avoid task-specific engineering, and Specialized Training Pipelines and Data Synthesis explore methods for generating diverse training data at scale. Domain-Specific Applications and Infrastructure round out the taxonomy by addressing deployment challenges in real-world sectors such as retail or biology. Within this landscape, a particularly active line of work centers on scalable training regimes that combine synthetic data generation with iterative refinement. For instance, AgentSynth[6] and Agentohana[9] illustrate how large-scale data synthesis can bootstrap agent capabilities, while Agent s2[3] and Alita[5] explore different strategies for balancing generalization with task-specific fine-tuning. Scaling Data-Analytic Agents[0] sits squarely in the Data-Analytic and Query-Driven branch, emphasizing scalable training pipelines tailored to analytical workflows. Compared to Agentohana[9], which also targets data-centric tasks, Scaling Data-Analytic Agents[0] places greater emphasis on end-to-end scalability and the integration of diverse data modalities. This positioning highlights an ongoing tension in the field: whether to pursue broad generalist frameworks or to specialize training pipelines for high-stakes analytical domains, a question that remains central to advancing both robustness and practical deployment.

Claimed Contributions

DATAMIND scalable data synthesis and agent training recipe

10 retrieved papers

The authors propose DATAMIND, a comprehensive pipeline that addresses key challenges in building open-source data-analytic agents through fine-grained task taxonomy, recursive task composition, knowledge-augmented trajectory sampling, dynamically adjustable training objectives combining SFT and RL losses, and a memory-frugal code-based multi-turn rollout framework.

10 retrieved papers

DATAMIND-12K high-quality trajectory dataset

10 retrieved papers

The authors create DATAMIND-12K, a curated training dataset that covers diverse task categories and data file formats for data analysis tasks, enabling the training of generalist data-analytic agents.

10 retrieved papers

DATAMIND-7B and DATAMIND-14B state-of-the-art models

10 retrieved papers

The authors develop DATAMIND-7B and DATAMIND-14B models trained on their curated dataset, achieving state-of-the-art performance on data analysis benchmarks and outperforming both proprietary and open-source baselines.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[9] Agentohana: Design unified data and training pipeline for effective agent learning PDF

Zhang Jian-guo, Jianguo Zhang, Lan Tian, Tian Lan, Murthy, Rithesh, Rithesh Murthy, Liu ZhiWei, Zhiwei Liu, Yao Weiran, Weiran Yao, Zhu Ming, Juntao Tan, Tan, Juntao, Thai Hoang, Hoang Thai, Liangwei Yang, Liu Zu-xin, Yihao Feng, Yang, Liangwei, Zuxin Liu, Feng, Yihao, Tulika Manoj Awalgaonkar, Juan Carlos Niebles, T. Awalgaonkar, Silvio Savarese, Niebles Juan Carlos, Shelby Heinecke, Savarese, Silvio, Huan Wang, Heinecke, Shelby, Caiming Xiong, Wang Huan, Xiong, Caiming (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

DATAMIND scalable data synthesis and agent training recipe

[6] AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents PDF

Cannot Refute

[11] CoddLLM: Empowering Large Language Models for Data Analytics PDF

Cannot Refute

[29] Robocasa: Large-scale simulation of everyday tasks for generalist robots PDF

Cannot Refute

[30] TARGA: Targeted Synthetic Data Generation for Practical Reasoning over Structured Data PDF

Cannot Refute

[31] Mag-v: A multi-agent framework for synthetic data generation and verification PDF

Cannot Refute

[32] On the diversity of synthetic data and its impact on training large language models PDF

Cannot Refute

[33] Repurposing synthetic data for fine-grained search agent supervision PDF

Cannot Refute

[34] Synthesize-on-Graph: Knowledgeable Synthetic Data Generation for Continue Pre-training of Large Language Models PDF

Cannot Refute

[35] Synthetic Data RL: Task Definition Is All You Need PDF

Cannot Refute

[36] Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning PDF

Cannot Refute

Contribution

DATAMIND-12K high-quality trajectory dataset

[37] Deep learning for trajectory data management and mining: A survey and beyond PDF

Cannot Refute

[38] Towards scalable multi-domain conversational agents: The schema-guided dialogue dataset PDF

Cannot Refute

[39] OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling PDF

Cannot Refute

[40] A federated pedestrian trajectory prediction model with data privacy protection PDF

Cannot Refute

[41] On collaborative multi-UAV trajectory planning for data collection PDF

Cannot Refute

[42] OnSiteVRU: A High-Resolution Trajectory Dataset for High-Density Vulnerable Road Users PDF

Cannot Refute

[43] Trajectory Data Collection with Local Differential Privacy PDF

Cannot Refute

[44] Trajectory generation: a survey on methods and techniques PDF

Cannot Refute

[45] Trajectory design for UAV-based Internet of Things data collection: A deep reinforcement learning approach PDF

Cannot Refute

[46] MATRIX: multi-agent trajectory generation with diverse contexts PDF

Cannot Refute

Contribution

DATAMIND-7B and DATAMIND-14B state-of-the-art models

[47] The cell tracking challenge: 10 years of objective benchmarking PDF

Cannot Refute

[48] When do neural nets outperform boosted trees on tabular data? PDF

Cannot Refute

[49] Simpo: Simple preference optimization with a reference-free reward PDF

Cannot Refute

[50] Evaluating vision and pathology foundation models for computational pathology: a comprehensive benchmark study PDF

Cannot Refute

[51] ChatGPT vs state-of-the-art models: a benchmarking study in keyphrase generation task PDF

Cannot Refute

[52] Ultrafeedback: Boosting language models with high-quality feedback PDF

Cannot Refute

[53] Evaluating the Performance of Large Language Models on GAOKAO Benchmark PDF

Cannot Refute

[54] Qlora: Efficient finetuning of quantized llms PDF

Cannot Refute

[55] Exploring large language models for qualitative data analysis PDF

Cannot Refute

[56] Predictive performance of presenceâonly species distribution models: a benchmark study with reproducible code PDF

Cannot Refute

Scaling Generalist Data-Analytic Agents

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[9] Agentohana: Design unified data and training pipeline for effective agent learning PDF

Contribution Analysis

DATAMIND scalable data synthesis and agent training recipe

[6] AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents PDF

[11] CoddLLM: Empowering Large Language Models for Data Analytics PDF

[29] Robocasa: Large-scale simulation of everyday tasks for generalist robots PDF

[30] TARGA: Targeted Synthetic Data Generation for Practical Reasoning over Structured Data PDF

[31] Mag-v: A multi-agent framework for synthetic data generation and verification PDF

[32] On the diversity of synthetic data and its impact on training large language models PDF

[33] Repurposing synthetic data for fine-grained search agent supervision PDF

[34] Synthesize-on-Graph: Knowledgeable Synthetic Data Generation for Continue Pre-training of Large Language Models PDF

[35] Synthetic Data RL: Task Definition Is All You Need PDF

[36] Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning PDF

DATAMIND-12K high-quality trajectory dataset

[37] Deep learning for trajectory data management and mining: A survey and beyond PDF

[38] Towards scalable multi-domain conversational agents: The schema-guided dialogue dataset PDF

[39] OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling PDF

[40] A federated pedestrian trajectory prediction model with data privacy protection PDF

[41] On collaborative multi-UAV trajectory planning for data collection PDF

[42] OnSiteVRU: A High-Resolution Trajectory Dataset for High-Density Vulnerable Road Users PDF

[43] Trajectory Data Collection with Local Differential Privacy PDF

[44] Trajectory generation: a survey on methods and techniques PDF

[45] Trajectory design for UAV-based Internet of Things data collection: A deep reinforcement learning approach PDF

[46] MATRIX: multi-agent trajectory generation with diverse contexts PDF

DATAMIND-7B and DATAMIND-14B state-of-the-art models

[47] The cell tracking challenge: 10 years of objective benchmarking PDF

[48] When do neural nets outperform boosted trees on tabular data? PDF

[49] Simpo: Simple preference optimization with a reference-free reward PDF

[50] Evaluating vision and pathology foundation models for computational pathology: a comprehensive benchmark study PDF

[51] ChatGPT vs state-of-the-art models: a benchmarking study in keyphrase generation task PDF

[52] Ultrafeedback: Boosting language models with high-quality feedback PDF

[53] Evaluating the Performance of Large Language Models on GAOKAO Benchmark PDF

[54] Qlora: Efficient finetuning of quantized llms PDF

[55] Exploring large language models for qualitative data analysis PDF

[56] Predictive performance of presenceâonly species distribution models: a benchmark study with reproducible code PDF

Table of Contents

[56] Predictive performance of presenceâonly species distribution models: a benchmark study with reproducible code PDF