EXP-Bench: Can AI Conduct AI Research Experiments?
Overview
Overall Novelty Assessment
EXP-Bench introduces a benchmark for evaluating AI agents on complete machine learning research experiments, from hypothesis formulation through result analysis. The paper resides in the 'Machine Learning Research Experimentation' leaf, which contains seven papers total, indicating a moderately populated research direction. This leaf sits within the broader 'End-to-End Research Experiment Automation' branch, distinguishing itself from partial automation approaches by requiring agents to handle the full research lifecycle rather than isolated subtasks like code execution or literature review.
The taxonomy reveals several neighboring research directions that contextualize this work. The sibling leaf 'Research Paper Replication and Reproduction' focuses on recreating published results rather than conducting novel experiments, while 'Autonomous Research Discovery and Hypothesis Generation' emphasizes open-ended scientific discovery without predefined workflows. Adjacent branches include 'Research Task Decomposition and Workflow Automation,' which examines how agents break down complex research into manageable subtasks, and 'Research Agent Evaluation Frameworks,' which provides standardized assessment methodologies. EXP-Bench bridges these areas by requiring both complete experimental execution and systematic evaluation across multiple capability dimensions.
Among thirty candidates examined, the core benchmark contribution shows some prior overlap: one of ten candidates examined appears to provide refutable prior work, suggesting existing benchmarks in this space. The semi-automated pipeline for extracting research tasks from papers examined ten candidates with none clearly refuting the approach, indicating relative novelty in methodology. Similarly, the multi-metric evaluation framework examined ten candidates without clear refutation. The statistics suggest that while the benchmark concept has precedent in the limited search scope, the specific extraction pipeline and evaluation methodology may represent more distinctive contributions within the examined literature.
Based on the top-thirty semantic matches examined, the work appears to build incrementally on established benchmark paradigms while introducing methodological refinements in task extraction and evaluation granularity. The analysis does not cover the full breadth of machine learning benchmarking literature, and the single refutable candidate for the core contribution warrants closer examination to understand the precise overlap. The relatively crowded taxonomy leaf suggests active research interest in this direction, though the specific combination of contributions may still offer value to the community.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors present EXP-Bench, a benchmark that evaluates AI agents on their ability to conduct end-to-end AI research experimentation. Given a research question and incomplete starter code, agents must formulate hypotheses, design and implement experimental procedures, execute them, and analyze results. The benchmark comprises 461 tasks from 51 top-tier AI papers.
The authors develop a semi-automated dataset curation pipeline that systematically extracts and structures experimental tasks from research papers and their codebases. The pipeline combines multi-modal extraction (from papers, supplementary materials, and code) with implementation extraction and execution-based validation, enabling scalable construction of high-fidelity research tasks.
The authors introduce a comprehensive evaluation framework that assesses AI agents across multiple dimensions of the research process, including experimental design correctness, implementation completeness, code execution success, and conclusion validity. This conjunctive evaluation approach reveals specific bottlenecks in agent capabilities and provides fine-grained feedback for improvement.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[1] Benchmarking large language models as ai research agents PDF
[2] Mlagentbench: Evaluating language agents on machine learning experimentation PDF
[6] MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research PDF
[22] Mlgym: A new framework and benchmark for advancing ai research agents PDF
[40] AI Research Agents for Machine Learning: Search, Exploration, and Generalization in MLE-bench PDF
[49] Ml research benchmark PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
EXP-Bench benchmark for evaluating AI agents on complete research experiments
The authors present EXP-Bench, a benchmark that evaluates AI agents on their ability to conduct end-to-end AI research experimentation. Given a research question and incomplete starter code, agents must formulate hypotheses, design and implement experimental procedures, execute them, and analyze results. The benchmark comprises 461 tasks from 51 top-tier AI papers.
[3] PaperBench: Evaluating AI's Ability to Replicate AI Research PDF
[2] Mlagentbench: Evaluating language agents on machine learning experimentation PDF
[17] Deep Research Bench: Evaluating AI Web Research Agents PDF
[22] Mlgym: A new framework and benchmark for advancing ai research agents PDF
[25] Astabench: Rigorous benchmarking of ai agents with a scientific research suite PDF
[27] Scienceboard: Evaluating multimodal autonomous agents in realistic scientific workflows PDF
[51] GAIA: a benchmark for General AI Assistants PDF
[52] AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments PDF
[53] Bearcubs: A benchmark for computer-using web agents PDF
[54] Finance Agent Benchmark: Benchmarking LLMs on Real-world Financial Research Tasks PDF
Semi-automated pipeline for extracting research tasks from papers and code
The authors develop a semi-automated dataset curation pipeline that systematically extracts and structures experimental tasks from research papers and their codebases. The pipeline combines multi-modal extraction (from papers, supplementary materials, and code) with implementation extraction and execution-based validation, enabling scalable construction of high-fidelity research tasks.
[65] Round Trip: An Automated Pipeline for Experimental Design, Execution, and Analysis. PDF
[66] Codescientist: End-to-end semi-automated scientific discovery with code-based experimentation PDF
[67] FREEDA: An automated computational pipeline guides experimental testing of protein innovation PDF
[68] ChemActor: Enhancing Automated Extraction of Chemical Synthesis Actions with LLM-Generated Data PDF
[69] Automating Data Lineage and Pipeline Extraction PDF
[70] Automated extraction of chemical synthesis actions from experimental procedures PDF
[71] Reproducible experiments for generating pre-processing pipelines for AutoETL PDF
[72] Introducing RELAX (the Reduction of Electroencephalographic Artifacts): A fully automated pre-processing pipeline for cleaning EEG data-Part 1: Algorithm and ⦠PDF
[73] Experiment Specification, Capture and Laboratory Automation Technology (ESCALATE): a software pipeline for automated chemical experimentation and data ⦠PDF
[74] Enteroflow: Automated Pipeline for In Silico Characterization of PDF
Multi-metric evaluation framework for assessing experimental capabilities
The authors introduce a comprehensive evaluation framework that assesses AI agents across multiple dimensions of the research process, including experimental design correctness, implementation completeness, code execution success, and conclusion validity. This conjunctive evaluation approach reveals specific bottlenecks in agent capabilities and provides fine-grained feedback for improvement.