AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite

ICLR 2026 Conference SubmissionAnonymous Authors
Agentsevaluationbenchmarksscientific research
Abstract:

AI agents hold the potential to revolutionize scientific productivity by automating literature reviews, replicating experiments, analyzing data, and even proposing new directions of inquiry; indeed, there are now many such agents, ranging from general-purpose "deep research" systems to specialized science-specific agents, such as AI Scientist and AIGS. Rigorous evaluation of these agents is critical for progress. Yet existing benchmarks fall short on several fronts: they often (1) lack reproducible agent tools necessary for a controlled comparison of core agentic capabilities; (2) do not account for confounding variables such as model cost and tool access; (3) do not provide standardized interfaces for quick agent prototyping and evaluation; (4) fail to provide holistic, product-informed measures of real-world use cases such as science research; and (5) lack comprehensive baseline agents necessary to identify true advances. In response, we define principles and tooling for more rigorously benchmarking agents. Using these, we present AstaBench, a suite that provides a holistic measure of agentic ability to perform scientific research, comprising 2400+ problems spanning the entire scientific discovery process and multiple scientific domains, and including many problems inspired by actual user requests to deployed Asta agents. Our suite comes with the first scientific research environment with production-grade search tools that enable controlled, reproducible evaluation, better accounting for confounders. Alongside, we provide a comprehensive suite of nine science-optimized classes of Asta agents and numerous baselines. Our extensive evaluation of 57 agents across 22 agent classes reveals several interesting findings, most importantly that despite meaningful progress on certain individual aspects, AI remains far from solving the challenge of science research assistance.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes AstaBench, a comprehensive benchmark suite for evaluating AI agents across the full scientific research workflow, comprising 2400+ problems. It resides in the 'Comprehensive Multi-Task Research Benchmarks' leaf alongside six sibling papers, including SciEval, MLR-Bench, and ScienceAgentBench. This leaf represents a moderately populated research direction within a taxonomy of 50 papers across 21 leaf nodes, indicating active but not overcrowded interest in holistic, multi-task evaluation frameworks that assess end-to-end research capabilities rather than isolated subtasks.

The taxonomy reveals neighboring leaves focused on 'Specialized Task Benchmarks' (targeting specific subtasks like code reproduction or hypothesis validation) and 'Domain-Specific Scientific Benchmarks' (evaluating agents within particular scientific domains). AstaBench's positioning emphasizes breadth across research stages—literature review, experimental design, data analysis—distinguishing it from narrower efforts like SciReplicate-Bench (experimental reproducibility) or domain-focused benchmarks such as NewtonBench. The taxonomy structure shows an ongoing tension between generalist multi-task evaluations and specialist assessments, with AstaBench aligning with the former approach.

Among 29 candidates examined, the 'AstaBench benchmark suite' contribution shows one refutable candidate out of nine examined, suggesting some prior work overlap in comprehensive research benchmarking. The 'Asta Environment with production-grade search tools' contribution examined 10 candidates with none clearly refuting it, indicating relative novelty in providing standardized, reproducible agent tooling. The 'agent-eval Toolkit and comprehensive agents suite' contribution similarly examined 10 candidates with no clear refutations, suggesting this infrastructure component addresses a less-explored gap in baseline agent provision and rapid prototyping interfaces.

Based on this limited search scope of 29 semantically similar papers, the work appears to offer incremental advances in benchmark comprehensiveness and tooling standardization within an active research area. The analysis covers top-K semantic matches and does not constitute an exhaustive literature review, leaving open the possibility of additional relevant prior work in adjacent communities or recent preprints not captured by the search strategy.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
29
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: benchmarking AI agents for scientific research assistance. The field has evolved into a rich ecosystem organized around eight major branches. Benchmark Design and Evaluation Frameworks focuses on creating comprehensive multi-task testbeds that assess agents across diverse research activities, from literature review to experimental design, as exemplified by works like AstaBench[0] and SciEval[6]. AI Agent Architectures and Systems explores the underlying technical implementations, including multi-agent collaboration and tool integration strategies seen in systems such as DeepResearcher[5] and AI Co-Scientist[4]. Human-AI Collaboration and Interaction examines how researchers and AI systems work together, while Adoption, Usage, and Impact Studies track real-world deployment patterns. Research Quality and Evaluation Methodologies address the challenge of assessing scientific output validity, and Domain-Specific Applications span areas from drug discovery to materials science. Educational and Training Applications consider how these tools support learning, and Foundational Concepts provide theoretical grounding for agent capabilities and limitations. A particularly active tension runs between holistic benchmarks that test end-to-end research workflows versus narrower evaluations targeting specific subtasks like literature synthesis or experimental replication. Works such as MLR-Bench[11] and ScienceAgentBench[14] illustrate this spectrum, with some emphasizing breadth across research stages and others drilling into reproducibility or domain expertise. AstaBench[0] sits within the Comprehensive Multi-Task Research Benchmarks cluster, sharing with neighbors like SciEval[6] and MLGym[12] an emphasis on evaluating agents across multiple interconnected research activities rather than isolated skills. Compared to more specialized efforts like SciReplicate-Bench[8], which targets experimental reproducibility, or domain-focused benchmarks such as NewtonBench[9], AstaBench[0] adopts a broader scope that mirrors the multifaceted nature of real scientific inquiry. This positioning reflects an ongoing debate about whether generalist or specialist evaluation paradigms better capture the capabilities needed for meaningful research assistance.

Claimed Contributions

AstaBench benchmark suite for scientific research agents

The authors introduce AstaBench, a comprehensive benchmark suite designed to holistically evaluate AI agents' capabilities in scientific research. It includes over 2400 problems covering the full research pipeline across multiple domains, with many tasks inspired by real user requests from deployed Asta agents.

9 retrieved papers
Can Refute
Asta Environment with production-grade search tools

The authors develop the Asta Environment, which provides the first realistic and reproducible scientific research environment for agents. It features production-grade search tools with date-restricted access to scientific literature, enabling controlled comparison of agents while accounting for confounding variables.

10 retrieved papers
agent-eval Toolkit and comprehensive agents suite

The authors present the agent-eval toolkit for standardized agent evaluation with time-invariant cost tracking, alongside the agent-baselines suite containing nine science-optimized Asta agent classes and numerous baselines. This represents the most comprehensive standardized agents suite for scientific research tasks.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

AstaBench benchmark suite for scientific research agents

The authors introduce AstaBench, a comprehensive benchmark suite designed to holistically evaluate AI agents' capabilities in scientific research. It includes over 2400 problems covering the full research pipeline across multiple domains, with many tasks inspired by real user requests from deployed Asta agents.

Contribution

Asta Environment with production-grade search tools

The authors develop the Asta Environment, which provides the first realistic and reproducible scientific research environment for agents. It features production-grade search tools with date-restricted access to scientific literature, enabling controlled comparison of agents while accounting for confounding variables.

Contribution

agent-eval Toolkit and comprehensive agents suite

The authors present the agent-eval toolkit for standardized agent evaluation with time-invariant cost tracking, alongside the agent-baselines suite containing nine science-optimized Asta agent classes and numerous baselines. This represents the most comprehensive standardized agents suite for scientific research tasks.

AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite | Novelty Validation