EXP-Bench: Can AI Conduct AI Research Experiments?

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

AI Agents

Automating AI research holds immense potential for accelerating scientific progress, yet current AI agents struggle with the complexities of rigorous, end-to-end experimentation. We introduce EXP-Bench, a novel benchmark designed to systematically evaluate AI agents on complete research experiments sourced from influential AI publications. Given a research question and incomplete starter code, EXP-Bench challenges AI agents to formulate hypotheses, design and implement experimental procedures, execute them, and analyze results. To enable the creation of such intricate and authentic tasks with high-fidelity, we design a semi-autonomous pipeline to extract and structure crucial experimental details from these research papers and their associated open-source code. With the pipeline, EXP-Bench curated 461 AI research tasks from 51 top-tier AI research papers. Evaluations of leading AI agents, such as OpenHands and IterativeAgent on EXP-Bench demonstrate partial capabilities: while scores on individual experimental aspects such as design or implementation correctness reach 20-35%, the success rate for complete, executable experiments was a mere 0.5%. By identifying these bottlenecks and providing realistic step-by-step experiment procedures, EXP-Bench serves as a vital tool for future AI agents to improve their ability to conduct AI research experiments.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

EXP-Bench introduces a benchmark for evaluating AI agents on complete machine learning research experiments, from hypothesis formulation through result analysis. The paper resides in the 'Machine Learning Research Experimentation' leaf, which contains seven papers total, indicating a moderately populated research direction. This leaf sits within the broader 'End-to-End Research Experiment Automation' branch, distinguishing itself from partial automation approaches by requiring agents to handle the full research lifecycle rather than isolated subtasks like code execution or literature review.

The taxonomy reveals several neighboring research directions that contextualize this work. The sibling leaf 'Research Paper Replication and Reproduction' focuses on recreating published results rather than conducting novel experiments, while 'Autonomous Research Discovery and Hypothesis Generation' emphasizes open-ended scientific discovery without predefined workflows. Adjacent branches include 'Research Task Decomposition and Workflow Automation,' which examines how agents break down complex research into manageable subtasks, and 'Research Agent Evaluation Frameworks,' which provides standardized assessment methodologies. EXP-Bench bridges these areas by requiring both complete experimental execution and systematic evaluation across multiple capability dimensions.

Among thirty candidates examined, the core benchmark contribution shows some prior overlap: one of ten candidates examined appears to provide refutable prior work, suggesting existing benchmarks in this space. The semi-automated pipeline for extracting research tasks from papers examined ten candidates with none clearly refuting the approach, indicating relative novelty in methodology. Similarly, the multi-metric evaluation framework examined ten candidates without clear refutation. The statistics suggest that while the benchmark concept has precedent in the limited search scope, the specific extraction pipeline and evaluation methodology may represent more distinctive contributions within the examined literature.

Based on the top-thirty semantic matches examined, the work appears to build incrementally on established benchmark paradigms while introducing methodological refinements in task extraction and evaluation granularity. The analysis does not cover the full breadth of machine learning benchmarking literature, and the single refutable candidate for the core contribution warrants closer examination to understand the precise overlap. The relatively crowded taxonomy leaf suggests active research interest in this direction, though the specific combination of contributions may still offer value to the community.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Evaluating AI agents on conducting complete AI research experiments. The field has organized itself into several major branches that reflect different scopes and emphases. End-to-End Research Experiment Automation focuses on systems that handle the full lifecycle of research tasks, from hypothesis generation through experimental execution to result interpretation, as seen in works like MLAgentBench[2] and MLR-Bench[6]. Domain-Specific Research Automation targets particular scientific areas such as biomedical discovery (BioDiscoveryAgent[12], Empowering biomedical discovery with[8]) or traffic modeling (Automating Traffic Model Enhancement[14]), tailoring evaluation to specialized workflows. Research Task Decomposition and Workflow Automation examines how agents break down complex research into manageable subtasks, often leveraging multi-agent collaboration (Chain-of-Agents[33]) or structured pipelines (AFlow[26]). Research Agent Evaluation Frameworks and Methodologies provide standardized testbeds and metrics, including benchmarks like PaperBench[3] and ResearcherBench[34]. General-Purpose AI Agent Systems and Platforms explore broader architectures that can adapt across diverse research contexts, while Specialized Benchmarks and Evaluation Domains offer targeted assessments in areas like code generation (SciCode[32]) or web navigation (WebVoyager[20]). Within the End-to-End Research Experiment Automation branch, a particularly active line of work centers on machine learning research experimentation, where agents must navigate the full cycle of dataset selection, model training, hyperparameter tuning, and result analysis. EXP-Bench[0] situates itself in this cluster alongside MLAgentBench[2] and MLR-Bench[6], but emphasizes comprehensive evaluation across multiple dimensions of experimental competence. Compared to MLAgentBench[2], which pioneered agent-driven ML workflows, EXP-Bench[0] appears to broaden the scope of tasks and evaluation criteria. Meanwhile, works like AIDE[38] and MLGym[22] explore complementary angles—AIDE[38] focusing on iterative debugging and refinement, and MLGym[22] on interactive learning environments. A central tension across these efforts involves balancing task realism with reproducibility: fully open-ended research scenarios can be difficult to score objectively, while overly constrained benchmarks may not capture the creative problem-solving that defines genuine research.

Claimed Contributions

EXP-Bench benchmark for evaluating AI agents on complete research experiments

Can Refute

10 retrieved papers

The authors present EXP-Bench, a benchmark that evaluates AI agents on their ability to conduct end-to-end AI research experimentation. Given a research question and incomplete starter code, agents must formulate hypotheses, design and implement experimental procedures, execute them, and analyze results. The benchmark comprises 461 tasks from 51 top-tier AI papers.

10 retrieved papers

Can Refute

Semi-automated pipeline for extracting research tasks from papers and code

10 retrieved papers

The authors develop a semi-automated dataset curation pipeline that systematically extracts and structures experimental tasks from research papers and their codebases. The pipeline combines multi-modal extraction (from papers, supplementary materials, and code) with implementation extraction and execution-based validation, enabling scalable construction of high-fidelity research tasks.

10 retrieved papers

Multi-metric evaluation framework for assessing experimental capabilities

10 retrieved papers

The authors introduce a comprehensive evaluation framework that assesses AI agents across multiple dimensions of the research process, including experimental design correctness, implementation completeness, code execution success, and conclusion validity. This conjunctive evaluation approach reveals specific bottlenecks in agent capabilities and provides fine-grained feedback for improvement.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] Benchmarking large language models as ai research agents PDF

Huang Qian, Qian Huang, Vora, Jian, Jian Vora, Liang, Percy, Percy Liang, Leskovec Jure, Jure Leskovec, J. Leskovec (2023)

[2] Mlagentbench: Evaluating language agents on machine learning experimentation PDF

Huang Qian, Qian Huang, Vora, Jian, Jian Vora, Liang, Percy, Percy Liang, Leskovec Jure, Jure Leskovec (2023)

[6] MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research PDF

Chen Hui, Xiong Miao, Hui Chen, Lu Yujie, Miao Xiong, Han Wei, Yujie Lu, Deng, Ailin, Wei Han, He Yufei, Ailin Deng, Wu Jia-Ying, Yufei He, Li Yibo, Jiaying Wu, Liu Yu-e, Yibo Li, Hooi, Bryan, Yue Liu, Bryan Hooi (2025)

[22] Mlgym: A new framework and benchmark for advancing ai research agents PDF

Nathani, Deepak, Madaan, Lovish, Deepak Nathani, Roberts, Nicholas, Lovish Madaan, Nicholas Roberts, Menon Ajay, Niko-lay Bashlykov, Moens Vincent, A. Menon, Budhiraja, Amar, Vincent Moens, Amar Budhiraja, Despoina Magka, Chaurasia, Gaurav, Vladislav Vorotilov, Hupkes, Dieuwke, Gaurav Chaurasia, Cabral, Ricardo Silveira, Dieuwke Hupkes, Shavrina, Tatiana, Ricardo Silveira Cabral, Foerster, Jakob, Tatiana Shavrina, Bachrach, Yoram, Jakob Foerster, Wang, William Yang, Yoram Bachrach, Raileanu, Roberta, William Yang Wang, R. Raileanu (2025)

[40] AI Research Agents for Machine Learning: Search, Exploration, and Generalization in MLE-bench PDF

Hambardzumyan, Karen, Josifoski, Martin, Hazra, Rishi, Kuchnik, Michael, Jiang, Minqi, Lupu, Andrei, Raileanu, Roberta, Shavrina, Tatiana, Gagnon-Audet, Jean-Christophe, Shvartsman, Sodhani, Shagun, Miller, Alexander H., Dunfield Derek, Wu, Carole-Jean, Stenetorp, Pontus, Cancedda, Nicola, Foerster, Jakob Nicolaus, Bachrach, Yoram (2025)

[49] Ml research benchmark PDF

Matthew Kenney (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

EXP-Bench benchmark for evaluating AI agents on complete research experiments

[3] PaperBench: Evaluating AI's Ability to Replicate AI Research PDF

Can Refute

[2] Mlagentbench: Evaluating language agents on machine learning experimentation PDF

Cannot Refute

[17] Deep Research Bench: Evaluating AI Web Research Agents PDF

Cannot Refute

[22] Mlgym: A new framework and benchmark for advancing ai research agents PDF

Cannot Refute

[25] Astabench: Rigorous benchmarking of ai agents with a scientific research suite PDF

Cannot Refute

[27] Scienceboard: Evaluating multimodal autonomous agents in realistic scientific workflows PDF

Cannot Refute

[51] GAIA: a benchmark for General AI Assistants PDF

Cannot Refute

[52] AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments PDF

Cannot Refute

[53] Bearcubs: A benchmark for computer-using web agents PDF

Cannot Refute

[54] Finance Agent Benchmark: Benchmarking LLMs on Real-world Financial Research Tasks PDF

Cannot Refute

Contribution

Semi-automated pipeline for extracting research tasks from papers and code

[65] Round Trip: An Automated Pipeline for Experimental Design, Execution, and Analysis. PDF

Cannot Refute

[66] Codescientist: End-to-end semi-automated scientific discovery with code-based experimentation PDF

Cannot Refute

[67] FREEDA: An automated computational pipeline guides experimental testing of protein innovation PDF

Cannot Refute

[68] ChemActor: Enhancing Automated Extraction of Chemical Synthesis Actions with LLM-Generated Data PDF

Cannot Refute

[69] Automating Data Lineage and Pipeline Extraction PDF

Cannot Refute

[70] Automated extraction of chemical synthesis actions from experimental procedures PDF

Cannot Refute

[71] Reproducible experiments for generating pre-processing pipelines for AutoETL PDF

Cannot Refute

[72] Introducing RELAX (the Reduction of Electroencephalographic Artifacts): A fully automated pre-processing pipeline for cleaning EEG data-Part 1: Algorithm and â¦ PDF

Cannot Refute

[73] Experiment Specification, Capture and Laboratory Automation Technology (ESCALATE): a software pipeline for automated chemical experimentation and data â¦ PDF

Cannot Refute

[74] Enteroflow: Automated Pipeline for In Silico Characterization of PDF

Cannot Refute

Contribution

Multi-metric evaluation framework for assessing experimental capabilities

[55] Evaluation framework to guide implementation of AI systems into healthcare settings PDF

Cannot Refute

[56] Benchmarking simulation-based inference PDF

Cannot Refute

[57] A Multidisciplinary Survey and Framework for Design and Evaluation of Explainable AI Systems PDF

Cannot Refute

[58] Holistic evaluation of language models PDF

Cannot Refute

[59] Optimization of Clinical Trial Strategies for Anti-HER2 Drugs Based on Bayesian Optimization and Deep Learning PDF

Cannot Refute

[60] From intention to implementation: automating biomedical research via LLMs PDF

Cannot Refute

[61] EMaC: dynamic VM consolidation framework for energy-efficiency and multi-metric SLA compliance in cloud data centers PDF

Cannot Refute

[62] Revisiting Reward Design and Evaluation for Robust Humanoid Standing and Walking PDF

Cannot Refute

[63] Task-driven assessment of experimental designs in diffusion MRI: A computational framework PDF

Cannot Refute

[64] Benchmarking frameworks and comparative studies of Controller Area Network (CAN) intrusion detection systems: A review PDF

Cannot Refute

EXP-Bench: Can AI Conduct AI Research Experiments?

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] Benchmarking large language models as ai research agents PDF

[2] Mlagentbench: Evaluating language agents on machine learning experimentation PDF

[6] MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research PDF

[22] Mlgym: A new framework and benchmark for advancing ai research agents PDF

[40] AI Research Agents for Machine Learning: Search, Exploration, and Generalization in MLE-bench PDF

[49] Ml research benchmark PDF

Contribution Analysis

EXP-Bench benchmark for evaluating AI agents on complete research experiments

[3] PaperBench: Evaluating AI's Ability to Replicate AI Research PDF

[2] Mlagentbench: Evaluating language agents on machine learning experimentation PDF

[17] Deep Research Bench: Evaluating AI Web Research Agents PDF

[22] Mlgym: A new framework and benchmark for advancing ai research agents PDF

[25] Astabench: Rigorous benchmarking of ai agents with a scientific research suite PDF

[27] Scienceboard: Evaluating multimodal autonomous agents in realistic scientific workflows PDF

[51] GAIA: a benchmark for General AI Assistants PDF

[52] AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments PDF

[53] Bearcubs: A benchmark for computer-using web agents PDF

[54] Finance Agent Benchmark: Benchmarking LLMs on Real-world Financial Research Tasks PDF

Semi-automated pipeline for extracting research tasks from papers and code

[65] Round Trip: An Automated Pipeline for Experimental Design, Execution, and Analysis. PDF

[66] Codescientist: End-to-end semi-automated scientific discovery with code-based experimentation PDF

[67] FREEDA: An automated computational pipeline guides experimental testing of protein innovation PDF

[68] ChemActor: Enhancing Automated Extraction of Chemical Synthesis Actions with LLM-Generated Data PDF

[69] Automating Data Lineage and Pipeline Extraction PDF

[70] Automated extraction of chemical synthesis actions from experimental procedures PDF

[71] Reproducible experiments for generating pre-processing pipelines for AutoETL PDF

[72] Introducing RELAX (the Reduction of Electroencephalographic Artifacts): A fully automated pre-processing pipeline for cleaning EEG data-Part 1: Algorithm and â¦ PDF

[73] Experiment Specification, Capture and Laboratory Automation Technology (ESCALATE): a software pipeline for automated chemical experimentation and data â¦ PDF

[74] Enteroflow: Automated Pipeline for In Silico Characterization of PDF

Multi-metric evaluation framework for assessing experimental capabilities

[55] Evaluation framework to guide implementation of AI systems into healthcare settings PDF

[56] Benchmarking simulation-based inference PDF

[57] A Multidisciplinary Survey and Framework for Design and Evaluation of Explainable AI Systems PDF

[58] Holistic evaluation of language models PDF

[59] Optimization of Clinical Trial Strategies for Anti-HER2 Drugs Based on Bayesian Optimization and Deep Learning PDF

[60] From intention to implementation: automating biomedical research via LLMs PDF

[61] EMaC: dynamic VM consolidation framework for energy-efficiency and multi-metric SLA compliance in cloud data centers PDF

[62] Revisiting Reward Design and Evaluation for Robust Humanoid Standing and Walking PDF

[63] Task-driven assessment of experimental designs in diffusion MRI: A computational framework PDF

[64] Benchmarking frameworks and comparative studies of Controller Area Network (CAN) intrusion detection systems: A review PDF

Table of Contents

[72] Introducing RELAX (the Reduction of Electroencephalographic Artifacts): A fully automated pre-processing pipeline for cleaning EEG data-Part 1: Algorithm and â¦ PDF

[73] Experiment Specification, Capture and Laboratory Automation Technology (ESCALATE): a software pipeline for automated chemical experimentation and data â¦ PDF