AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 7.0 Download Report PDF

Agentsevaluationbenchmarksscientific research

AI agents hold the potential to revolutionize scientific productivity by automating literature reviews, replicating experiments, analyzing data, and even proposing new directions of inquiry; indeed, there are now many such agents, ranging from general-purpose "deep research" systems to specialized science-specific agents, such as AI Scientist and AIGS. Rigorous evaluation of these agents is critical for progress. Yet existing benchmarks fall short on several fronts: they often (1) lack reproducible agent tools necessary for a controlled comparison of core agentic capabilities; (2) do not account for confounding variables such as model cost and tool access; (3) do not provide standardized interfaces for quick agent prototyping and evaluation; (4) fail to provide holistic, product-informed measures of real-world use cases such as science research; and (5) lack comprehensive baseline agents necessary to identify true advances. In response, we define principles and tooling for more rigorously benchmarking agents. Using these, we present AstaBench, a suite that provides a holistic measure of agentic ability to perform scientific research, comprising 2400+ problems spanning the entire scientific discovery process and multiple scientific domains, and including many problems inspired by actual user requests to deployed Asta agents. Our suite comes with the first scientific research environment with production-grade search tools that enable controlled, reproducible evaluation, better accounting for confounders. Alongside, we provide a comprehensive suite of nine science-optimized classes of Asta agents and numerous baselines. Our extensive evaluation of 57 agents across 22 agent classes reveals several interesting findings, most importantly that despite meaningful progress on certain individual aspects, AI remains far from solving the challenge of science research assistance.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes AstaBench, a comprehensive benchmark suite for evaluating AI agents across the full scientific research workflow, comprising 2400+ problems. It resides in the 'Comprehensive Multi-Task Research Benchmarks' leaf alongside six sibling papers, including SciEval, MLR-Bench, and ScienceAgentBench. This leaf represents a moderately populated research direction within a taxonomy of 50 papers across 21 leaf nodes, indicating active but not overcrowded interest in holistic, multi-task evaluation frameworks that assess end-to-end research capabilities rather than isolated subtasks.

The taxonomy reveals neighboring leaves focused on 'Specialized Task Benchmarks' (targeting specific subtasks like code reproduction or hypothesis validation) and 'Domain-Specific Scientific Benchmarks' (evaluating agents within particular scientific domains). AstaBench's positioning emphasizes breadth across research stages—literature review, experimental design, data analysis—distinguishing it from narrower efforts like SciReplicate-Bench (experimental reproducibility) or domain-focused benchmarks such as NewtonBench. The taxonomy structure shows an ongoing tension between generalist multi-task evaluations and specialist assessments, with AstaBench aligning with the former approach.

Among 29 candidates examined, the 'AstaBench benchmark suite' contribution shows one refutable candidate out of nine examined, suggesting some prior work overlap in comprehensive research benchmarking. The 'Asta Environment with production-grade search tools' contribution examined 10 candidates with none clearly refuting it, indicating relative novelty in providing standardized, reproducible agent tooling. The 'agent-eval Toolkit and comprehensive agents suite' contribution similarly examined 10 candidates with no clear refutations, suggesting this infrastructure component addresses a less-explored gap in baseline agent provision and rapid prototyping interfaces.

Based on this limited search scope of 29 semantically similar papers, the work appears to offer incremental advances in benchmark comprehensiveness and tooling standardization within an active research area. The analysis covers top-K semantic matches and does not constitute an exhaustive literature review, leaving open the possibility of additional relevant prior work in adjacent communities or recent preprints not captured by the search strategy.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: benchmarking AI agents for scientific research assistance. The field has evolved into a rich ecosystem organized around eight major branches. Benchmark Design and Evaluation Frameworks focuses on creating comprehensive multi-task testbeds that assess agents across diverse research activities, from literature review to experimental design, as exemplified by works like AstaBench[0] and SciEval[6]. AI Agent Architectures and Systems explores the underlying technical implementations, including multi-agent collaboration and tool integration strategies seen in systems such as DeepResearcher[5] and AI Co-Scientist[4]. Human-AI Collaboration and Interaction examines how researchers and AI systems work together, while Adoption, Usage, and Impact Studies track real-world deployment patterns. Research Quality and Evaluation Methodologies address the challenge of assessing scientific output validity, and Domain-Specific Applications span areas from drug discovery to materials science. Educational and Training Applications consider how these tools support learning, and Foundational Concepts provide theoretical grounding for agent capabilities and limitations. A particularly active tension runs between holistic benchmarks that test end-to-end research workflows versus narrower evaluations targeting specific subtasks like literature synthesis or experimental replication. Works such as MLR-Bench[11] and ScienceAgentBench[14] illustrate this spectrum, with some emphasizing breadth across research stages and others drilling into reproducibility or domain expertise. AstaBench[0] sits within the Comprehensive Multi-Task Research Benchmarks cluster, sharing with neighbors like SciEval[6] and MLGym[12] an emphasis on evaluating agents across multiple interconnected research activities rather than isolated skills. Compared to more specialized efforts like SciReplicate-Bench[8], which targets experimental reproducibility, or domain-focused benchmarks such as NewtonBench[9], AstaBench[0] adopts a broader scope that mirrors the multifaceted nature of real scientific inquiry. This positioning reflects an ongoing debate about whether generalist or specialist evaluation paradigms better capture the capabilities needed for meaningful research assistance.

Claimed Contributions

AstaBench benchmark suite for scientific research agents

Can Refute

9 retrieved papers

The authors introduce AstaBench, a comprehensive benchmark suite designed to holistically evaluate AI agents' capabilities in scientific research. It includes over 2400 problems covering the full research pipeline across multiple domains, with many tasks inspired by real user requests from deployed Asta agents.

9 retrieved papers

Can Refute

Asta Environment with production-grade search tools

10 retrieved papers

The authors develop the Asta Environment, which provides the first realistic and reproducible scientific research environment for agents. It features production-grade search tools with date-restricted access to scientific literature, enabling controlled comparison of agents while accounting for confounding variables.

10 retrieved papers

agent-eval Toolkit and comprehensive agents suite

10 retrieved papers

The authors present the agent-eval toolkit for standardized agent evaluation with time-invariant cost tracking, alongside the agent-baselines suite containing nine science-optimized Asta agent classes and numerous baselines. This represents the most comprehensive standardized agents suite for scientific research tasks.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] Eaira: Establishing a methodology for evaluating ai models as scientific research assistants PDF

CAPPELLO, FRANCK, Madireddy, Sandeep, Franck Cappello, Underwood, Robert, Sandeep Madireddy, Getty, Neil, Robert Underwood, N. Getty, Ramachandra, Nesar, Nicholas Chia, Nesar Ramachandra, Keceli, Murat, Josh Nguyen, Mallick, Tanwi, Murat Keceli, Li, Zilinghan, Tanwi Mallick, Ngom, Marieme, Zilinghan Li, Zhang, Chenhui, M. Ngom, Yanguas-Gil, Angel, Chenhui Zhang, A. Yanguas-Gil, Kailkhura, Bhavya, Evan R. Antoniuk, Tian, Minyang, B. Kailkhura, Du Yufeng, Minyang Tian, Ting, Yuan-Sen, Yu Du, Yuan-Sen Ting, Nicolae, Bogdan, Azton Wells, Bogdan Nicolae, Rafique M. Mustafa, Avinash Maurya, M. Rafique, Li Bo, E. Huerta, Foster, Ian, Bo Li, Stevens, Rick, Ian T. Foster, Rick Stevens (2025)

[6] Scieval: A multi-level large language model evaluation benchmark for scientific research PDF

Sun, Liangtai, Han, Yang, Liangtai Sun, Zhao Zihan, Yang Han, Ma Da, Zihan Zhao, Da Ma, Chen Bao-cai, Zhe-Wei Shen, Chen Lu, Baocai Chen, Yu Kai, Lu Chen, Kai Yu (2024)

[11] MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research PDF

Chen Hui, Xiong Miao, Hui Chen, Lu Yujie, Miao Xiong, Han Wei, Yujie Lu, Deng, Ailin, Wei Han, He Yufei, Ailin Deng, Wu Jia-Ying, Yufei He, Li Yibo, Jiaying Wu, Liu Yu-e, Yibo Li, Hooi, Bryan, Yue Liu, Bryan Hooi (2025) • arXiv.org

[12] Mlgym: A new framework and benchmark for advancing ai research agents PDF

Nathani, Deepak, Madaan, Lovish, Deepak Nathani, Roberts, Nicholas, Lovish Madaan, Nicholas Roberts, Menon Ajay, Niko-lay Bashlykov, Moens Vincent, A. Menon, Budhiraja, Amar, Vincent Moens, Amar Budhiraja, Despoina Magka, Chaurasia, Gaurav, Vladislav Vorotilov, Hupkes, Dieuwke, Gaurav Chaurasia, Cabral, Ricardo Silveira, Dieuwke Hupkes, Shavrina, Tatiana, Ricardo Silveira Cabral, Foerster, Jakob, Tatiana Shavrina, Bachrach, Yoram, Jakob Foerster, Wang, William Yang, Yoram Bachrach, Raileanu, Roberta, William Yang Wang, R. Raileanu (2025)

[14] Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery PDF

Chen Ziru, Chen Shijie, Ziru Chen, Ning, Yuting, Shijie Chen, Zhang, Qianheng, Yuting Ning, Wang, Boshi, Qianheng Zhang, Yu, Botao, Boshi Wang, Li Yifei, Botao Yu, Liao, Zeyi, Yifei Li, Wei Chen, Zeyi Liao, Lu, Zitong, Chen Wei, Dey, Vishal, Zitong Lu, Vishal Dey, Mingyi Xue, Frazier N. Baker, Benjamin Burns, Huang XuHui, Daniel Adu-Ampratwum, Ning Xia, Xuhui Huang, Gao Song, Xia Ning, Su Yu, Song Gao, Sun Huan, Yu Su, Huan Sun (2024)

[17] Benchmarking Large Language Models As AI Research Agents PDF

Huang Qian, Qian Huang, Vora, Jian, Jian Vora, Liang, Percy, Percy Liang, Leskovec Jure, Jure Leskovec, J. Leskovec (2023) • arXiv.org

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

AstaBench benchmark suite for scientific research agents

[14] Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery PDF

Can Refute

[2] AI Agents for Deep Scientific Research PDF

Cannot Refute

[4] Towards an AI co-scientist PDF

Cannot Refute

[12] Mlgym: A new framework and benchmark for advancing ai research agents PDF

Cannot Refute

[51] Towards an AI co-scientist: A multi-agent system for scientific discovery PDF

Cannot Refute

[52] Survey on evaluation of llm-based agents PDF

Cannot Refute

[53] Discoveryworld: A virtual environment for developing and evaluating automated scientific discovery agents PDF

Cannot Refute

[54] Mlagentbench: Evaluating language agents on machine learning experimentation PDF

Cannot Refute

[56] Auto-Bench: An Automated Benchmark for Scientific Discovery in LLMs PDF

Cannot Refute

Contribution

Asta Environment with production-grade search tools

[57] LITERAS: Biomedical literature review and citation retrieval agents PDF

Cannot Refute

[58] Evaluating Retrieval-Augmented Generation Agents for Autonomous Scientific Discovery in Astrophysics PDF

Cannot Refute

[59] Spar: Scholar paper retrieval with llm-based agents for enhanced academic search PDF

Cannot Refute

[60] Nanostructured material design via a retrieval-augmented generation (rag) approach: Bridging laboratory practice and scientific literature PDF

Cannot Refute

[61] Mcp-bench: Benchmarking tool-using llm agents with complex real-world tasks via mcp servers PDF

Cannot Refute

[62] EvoPat: A Multi-LLM-based Patents Summarization and Analysis Agent PDF

Cannot Refute

[63] Open-Source Agentic Hybrid RAG Framework for Scientific Literature Review PDF

Cannot Refute

[64] TourSynbio-Search: A Large Language Model Driven Agent Framework for Unified Search Method for Protein Engineering PDF

Cannot Refute

[65] PaSa: An LLM Agent for Comprehensive Academic Paper Search PDF

Cannot Refute

[66] Kwaiagents: Generalized information-seeking agent system with large language models PDF

Cannot Refute

Contribution

agent-eval Toolkit and comprehensive agents suite

[14] Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery PDF

Cannot Refute

[67] Multi-Agent Penetration Testing AI for the Web PDF

Cannot Refute

[68] Surfer-H Meets Holo1: Cost-Efficient Web Agent Powered by Open Weights PDF

Cannot Refute

[69] Evaluating LLM Agent Adherence to Hierarchical Safety Principles: A Lightweight Benchmark for Probing Foundational Controllability Components PDF

Cannot Refute

[70] Optimizing Control of Wastewater Treatment Plant With Reinforcement Learning: Technical Evaluation of Twin-Delayed Deep Deterministic Policy Gradient Agent PDF

Cannot Refute

[71] 360REA: Towards A Reusable Experience Accumulation with 360{\deg} Assessment for Multi-Agent System PDF

Cannot Refute

[72] CostNav: A Navigation Benchmark for Cost-Aware Evaluation of Embodied Agents PDF

Cannot Refute

[73] Autonomous Evaluation and Refinement of Digital Agents PDF

Cannot Refute

[74] Benchmarking and management accounting: A framework for research. PDF

Cannot Refute

[75] MDPs with a State Sensing Cost PDF

Cannot Refute

AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] Eaira: Establishing a methodology for evaluating ai models as scientific research assistants PDF

[6] Scieval: A multi-level large language model evaluation benchmark for scientific research PDF

[11] MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research PDF

[12] Mlgym: A new framework and benchmark for advancing ai research agents PDF

[14] Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery PDF

[17] Benchmarking Large Language Models As AI Research Agents PDF

Contribution Analysis

AstaBench benchmark suite for scientific research agents

[14] Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery PDF

[2] AI Agents for Deep Scientific Research PDF

[4] Towards an AI co-scientist PDF

[12] Mlgym: A new framework and benchmark for advancing ai research agents PDF

[51] Towards an AI co-scientist: A multi-agent system for scientific discovery PDF

[52] Survey on evaluation of llm-based agents PDF

[53] Discoveryworld: A virtual environment for developing and evaluating automated scientific discovery agents PDF

[54] Mlagentbench: Evaluating language agents on machine learning experimentation PDF

[56] Auto-Bench: An Automated Benchmark for Scientific Discovery in LLMs PDF

Asta Environment with production-grade search tools

[57] LITERAS: Biomedical literature review and citation retrieval agents PDF

[58] Evaluating Retrieval-Augmented Generation Agents for Autonomous Scientific Discovery in Astrophysics PDF

[59] Spar: Scholar paper retrieval with llm-based agents for enhanced academic search PDF

[60] Nanostructured material design via a retrieval-augmented generation (rag) approach: Bridging laboratory practice and scientific literature PDF

[61] Mcp-bench: Benchmarking tool-using llm agents with complex real-world tasks via mcp servers PDF

[62] EvoPat: A Multi-LLM-based Patents Summarization and Analysis Agent PDF

[63] Open-Source Agentic Hybrid RAG Framework for Scientific Literature Review PDF

[64] TourSynbio-Search: A Large Language Model Driven Agent Framework for Unified Search Method for Protein Engineering PDF

[65] PaSa: An LLM Agent for Comprehensive Academic Paper Search PDF

[66] Kwaiagents: Generalized information-seeking agent system with large language models PDF

agent-eval Toolkit and comprehensive agents suite

[14] Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery PDF

[67] Multi-Agent Penetration Testing AI for the Web PDF

[68] Surfer-H Meets Holo1: Cost-Efficient Web Agent Powered by Open Weights PDF

[69] Evaluating LLM Agent Adherence to Hierarchical Safety Principles: A Lightweight Benchmark for Probing Foundational Controllability Components PDF

[70] Optimizing Control of Wastewater Treatment Plant With Reinforcement Learning: Technical Evaluation of Twin-Delayed Deep Deterministic Policy Gradient Agent PDF

[71] 360REA: Towards A Reusable Experience Accumulation with 360{\deg} Assessment for Multi-Agent System PDF

[72] CostNav: A Navigation Benchmark for Cost-Aware Evaluation of Embodied Agents PDF

[73] Autonomous Evaluation and Refinement of Digital Agents PDF

[74] Benchmarking and management accounting: A framework for research. PDF

[75] MDPs with a State Sensing Cost PDF

Table of Contents