Scaling Generalist Data-Analytic Agents

ICLR 2026 Conference SubmissionAnonymous Authors
Data AnalysisLLM AgentsAgent Training
Abstract:

Data-analytic agents are emerging as a key catalyst for automated scientific discovery and for the vision of Innovating AI. Current approaches, however, rely heavily on prompt engineering over proprietary models, while open-source models struggle to face diverse-format, large-scale data files and long-horizon, multi-step reasoning that real-world analytics demands. This paper introduces DataMind, a scalable data synthesis and agent training recipe designed to build generalist data-analytic agents. DataMind tackles three key challenges in building open-source data-analytic agents, including insufficient data resources, improper training strategy, and unstable code-based multi-turn rollout. Concretely, DataMind applies 1) a fine-grained task taxonomy and a recursive easy-to-hard task composition mechanism to increase the diversity and difficulty of synthesized queries; 2) a knowledge-augmented trajectory sampling strategy followed by model-based and rule-based filtering; 3) a dynamically adjustable training objective combining both SFT and RL losses; 4) a memory-frugal and stable code-based multi-turn rollout framework. Built on DataMind, we curate DataMind-12K, a high-quality trajectory set spanning diverse domains, task categories, and data file formats for data-analytic tasks. Trained on DataMind-12K, our DataMind-14B achieves state-of-the-art with an average score of 71.16% on multiple data analysis benchmarks, outperforming the strongest proprietary baselines DeepSeek-V3.1 and GPT-5. Our DataMind-7B also performs best among all open-source models with a score of 68.10%. We also incorporate some empirical insights gained from our exploratory trials into the analysis experiments, aiming to provide actionable insights about agentic training for the community. We will release DataMind-12K and DataMind-7B,14B for the community's future research.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

DataMind proposes a scalable data synthesis and agent training recipe for building generalist data-analytic agents, addressing challenges in open-source model development for diverse-format data and long-horizon reasoning. The paper resides in the 'Generalist Data-Analytic Agents with Scalable Training' leaf, which contains only two papers total. This sparse population suggests the specific combination of scalable training pipelines and generalist data analytics remains relatively underexplored, positioning the work in an emerging rather than saturated research direction within the broader taxonomy of agent systems.

The taxonomy reveals neighboring research directions that contextualize DataMind's positioning. Database-Centric Analytics focuses on SQL and structured query processing, while Agentic Data Systems emphasizes autonomous analysis of heterogeneous sources. Computer-Use Agents tackle GUI interaction rather than data analytics, and Specialized Training Pipelines explore progressive difficulty enhancement for web or research agents. DataMind bridges scalable training methodology with data-analytic capabilities, diverging from database-specific systems by targeting diverse data formats and from computer-use agents by emphasizing analytical reasoning over interface manipulation.

Among thirty candidates examined, none clearly refuted any of DataMind's three core contributions: the scalable training recipe, the DataMind-12K trajectory dataset, or the resulting 7B/14B models. Each contribution was assessed against ten candidates with zero refutable overlaps identified. This suggests that within the limited search scope, the specific combination of fine-grained task taxonomy, knowledge-augmented trajectory sampling, hybrid SFT-RL training, and memory-frugal rollout appears relatively novel. However, the analysis explicitly covers top-K semantic matches rather than exhaustive prior work, leaving open the possibility of relevant work outside this candidate pool.

Based on the limited literature search of thirty candidates, DataMind appears to occupy a sparsely populated research direction with no clear prior work overlap detected. The taxonomy structure confirms that scalable training for generalist data-analytic agents remains less crowded than adjacent areas like computer-use or game-based agents. These signals suggest meaningful novelty within the examined scope, though the analysis does not claim exhaustive coverage of all potentially relevant prior work in data analytics or agent training.

Taxonomy

Core-task Taxonomy Papers
28
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Building generalist data-analytic agents through scalable training. The field of generalist agents has diversified into several distinct branches, each addressing different interaction modalities and problem settings. Data-Analytic and Query-Driven Agent Systems focus on agents that process structured and unstructured data, often interfacing with databases or analytical workflows. Computer-Use and GUI Interaction Agents tackle desktop and web environments, enabling agents to navigate graphical interfaces much like human users. Embodied and 3D World Generalist Agents extend capabilities into spatial reasoning and physical simulation, while Game-Based Generalist Agents and World Models leverage interactive game environments to train versatile policies. Generalist Agent Frameworks with Minimal Predefinition emphasize open-ended architectures that avoid task-specific engineering, and Specialized Training Pipelines and Data Synthesis explore methods for generating diverse training data at scale. Domain-Specific Applications and Infrastructure round out the taxonomy by addressing deployment challenges in real-world sectors such as retail or biology. Within this landscape, a particularly active line of work centers on scalable training regimes that combine synthetic data generation with iterative refinement. For instance, AgentSynth[6] and Agentohana[9] illustrate how large-scale data synthesis can bootstrap agent capabilities, while Agent s2[3] and Alita[5] explore different strategies for balancing generalization with task-specific fine-tuning. Scaling Data-Analytic Agents[0] sits squarely in the Data-Analytic and Query-Driven branch, emphasizing scalable training pipelines tailored to analytical workflows. Compared to Agentohana[9], which also targets data-centric tasks, Scaling Data-Analytic Agents[0] places greater emphasis on end-to-end scalability and the integration of diverse data modalities. This positioning highlights an ongoing tension in the field: whether to pursue broad generalist frameworks or to specialize training pipelines for high-stakes analytical domains, a question that remains central to advancing both robustness and practical deployment.

Claimed Contributions

DATAMIND scalable data synthesis and agent training recipe

The authors propose DATAMIND, a comprehensive pipeline that addresses key challenges in building open-source data-analytic agents through fine-grained task taxonomy, recursive task composition, knowledge-augmented trajectory sampling, dynamically adjustable training objectives combining SFT and RL losses, and a memory-frugal code-based multi-turn rollout framework.

10 retrieved papers
DATAMIND-12K high-quality trajectory dataset

The authors create DATAMIND-12K, a curated training dataset that covers diverse task categories and data file formats for data analysis tasks, enabling the training of generalist data-analytic agents.

10 retrieved papers
DATAMIND-7B and DATAMIND-14B state-of-the-art models

The authors develop DATAMIND-7B and DATAMIND-14B models trained on their curated dataset, achieving state-of-the-art performance on data analysis benchmarks and outperforming both proprietary and open-source baselines.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

DATAMIND scalable data synthesis and agent training recipe

The authors propose DATAMIND, a comprehensive pipeline that addresses key challenges in building open-source data-analytic agents through fine-grained task taxonomy, recursive task composition, knowledge-augmented trajectory sampling, dynamically adjustable training objectives combining SFT and RL losses, and a memory-frugal code-based multi-turn rollout framework.

Contribution

DATAMIND-12K high-quality trajectory dataset

The authors create DATAMIND-12K, a curated training dataset that covers diverse task categories and data file formats for data analysis tasks, enabling the training of generalist data-analytic agents.

Contribution

DATAMIND-7B and DATAMIND-14B state-of-the-art models

The authors develop DATAMIND-7B and DATAMIND-14B models trained on their curated dataset, achieving state-of-the-art performance on data analysis benchmarks and outperforming both proprietary and open-source baselines.