Abstract:

Automating AI research holds immense potential for accelerating scientific progress, yet current AI agents struggle with the complexities of rigorous, end-to-end experimentation. We introduce EXP-Bench, a novel benchmark designed to systematically evaluate AI agents on complete research experiments sourced from influential AI publications. Given a research question and incomplete starter code, EXP-Bench challenges AI agents to formulate hypotheses, design and implement experimental procedures, execute them, and analyze results. To enable the creation of such intricate and authentic tasks with high-fidelity, we design a semi-autonomous pipeline to extract and structure crucial experimental details from these research papers and their associated open-source code. With the pipeline, EXP-Bench curated 461 AI research tasks from 51 top-tier AI research papers. Evaluations of leading AI agents, such as OpenHands and IterativeAgent on EXP-Bench demonstrate partial capabilities: while scores on individual experimental aspects such as design or implementation correctness reach 20-35%, the success rate for complete, executable experiments was a mere 0.5%. By identifying these bottlenecks and providing realistic step-by-step experiment procedures, EXP-Bench serves as a vital tool for future AI agents to improve their ability to conduct AI research experiments.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

EXP-Bench introduces a benchmark for evaluating AI agents on complete machine learning research experiments, from hypothesis formulation through result analysis. The paper resides in the 'Machine Learning Research Experimentation' leaf, which contains seven papers total, indicating a moderately populated research direction. This leaf sits within the broader 'End-to-End Research Experiment Automation' branch, distinguishing itself from partial automation approaches by requiring agents to handle the full research lifecycle rather than isolated subtasks like code execution or literature review.

The taxonomy reveals several neighboring research directions that contextualize this work. The sibling leaf 'Research Paper Replication and Reproduction' focuses on recreating published results rather than conducting novel experiments, while 'Autonomous Research Discovery and Hypothesis Generation' emphasizes open-ended scientific discovery without predefined workflows. Adjacent branches include 'Research Task Decomposition and Workflow Automation,' which examines how agents break down complex research into manageable subtasks, and 'Research Agent Evaluation Frameworks,' which provides standardized assessment methodologies. EXP-Bench bridges these areas by requiring both complete experimental execution and systematic evaluation across multiple capability dimensions.

Among thirty candidates examined, the core benchmark contribution shows some prior overlap: one of ten candidates examined appears to provide refutable prior work, suggesting existing benchmarks in this space. The semi-automated pipeline for extracting research tasks from papers examined ten candidates with none clearly refuting the approach, indicating relative novelty in methodology. Similarly, the multi-metric evaluation framework examined ten candidates without clear refutation. The statistics suggest that while the benchmark concept has precedent in the limited search scope, the specific extraction pipeline and evaluation methodology may represent more distinctive contributions within the examined literature.

Based on the top-thirty semantic matches examined, the work appears to build incrementally on established benchmark paradigms while introducing methodological refinements in task extraction and evaluation granularity. The analysis does not cover the full breadth of machine learning benchmarking literature, and the single refutable candidate for the core contribution warrants closer examination to understand the precise overlap. The relatively crowded taxonomy leaf suggests active research interest in this direction, though the specific combination of contributions may still offer value to the community.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: Evaluating AI agents on conducting complete AI research experiments. The field has organized itself into several major branches that reflect different scopes and emphases. End-to-End Research Experiment Automation focuses on systems that handle the full lifecycle of research tasks, from hypothesis generation through experimental execution to result interpretation, as seen in works like MLAgentBench[2] and MLR-Bench[6]. Domain-Specific Research Automation targets particular scientific areas such as biomedical discovery (BioDiscoveryAgent[12], Empowering biomedical discovery with[8]) or traffic modeling (Automating Traffic Model Enhancement[14]), tailoring evaluation to specialized workflows. Research Task Decomposition and Workflow Automation examines how agents break down complex research into manageable subtasks, often leveraging multi-agent collaboration (Chain-of-Agents[33]) or structured pipelines (AFlow[26]). Research Agent Evaluation Frameworks and Methodologies provide standardized testbeds and metrics, including benchmarks like PaperBench[3] and ResearcherBench[34]. General-Purpose AI Agent Systems and Platforms explore broader architectures that can adapt across diverse research contexts, while Specialized Benchmarks and Evaluation Domains offer targeted assessments in areas like code generation (SciCode[32]) or web navigation (WebVoyager[20]). Within the End-to-End Research Experiment Automation branch, a particularly active line of work centers on machine learning research experimentation, where agents must navigate the full cycle of dataset selection, model training, hyperparameter tuning, and result analysis. EXP-Bench[0] situates itself in this cluster alongside MLAgentBench[2] and MLR-Bench[6], but emphasizes comprehensive evaluation across multiple dimensions of experimental competence. Compared to MLAgentBench[2], which pioneered agent-driven ML workflows, EXP-Bench[0] appears to broaden the scope of tasks and evaluation criteria. Meanwhile, works like AIDE[38] and MLGym[22] explore complementary angles—AIDE[38] focusing on iterative debugging and refinement, and MLGym[22] on interactive learning environments. A central tension across these efforts involves balancing task realism with reproducibility: fully open-ended research scenarios can be difficult to score objectively, while overly constrained benchmarks may not capture the creative problem-solving that defines genuine research.

Claimed Contributions

EXP-Bench benchmark for evaluating AI agents on complete research experiments

The authors present EXP-Bench, a benchmark that evaluates AI agents on their ability to conduct end-to-end AI research experimentation. Given a research question and incomplete starter code, agents must formulate hypotheses, design and implement experimental procedures, execute them, and analyze results. The benchmark comprises 461 tasks from 51 top-tier AI papers.

10 retrieved papers
Can Refute
Semi-automated pipeline for extracting research tasks from papers and code

The authors develop a semi-automated dataset curation pipeline that systematically extracts and structures experimental tasks from research papers and their codebases. The pipeline combines multi-modal extraction (from papers, supplementary materials, and code) with implementation extraction and execution-based validation, enabling scalable construction of high-fidelity research tasks.

10 retrieved papers
Multi-metric evaluation framework for assessing experimental capabilities

The authors introduce a comprehensive evaluation framework that assesses AI agents across multiple dimensions of the research process, including experimental design correctness, implementation completeness, code execution success, and conclusion validity. This conjunctive evaluation approach reveals specific bottlenecks in agent capabilities and provides fine-grained feedback for improvement.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

EXP-Bench benchmark for evaluating AI agents on complete research experiments

The authors present EXP-Bench, a benchmark that evaluates AI agents on their ability to conduct end-to-end AI research experimentation. Given a research question and incomplete starter code, agents must formulate hypotheses, design and implement experimental procedures, execute them, and analyze results. The benchmark comprises 461 tasks from 51 top-tier AI papers.

Contribution

Semi-automated pipeline for extracting research tasks from papers and code

The authors develop a semi-automated dataset curation pipeline that systematically extracts and structures experimental tasks from research papers and their codebases. The pipeline combines multi-modal extraction (from papers, supplementary materials, and code) with implementation extraction and execution-based validation, enabling scalable construction of high-fidelity research tasks.

Contribution

Multi-metric evaluation framework for assessing experimental capabilities

The authors introduce a comprehensive evaluation framework that assesses AI agents across multiple dimensions of the research process, including experimental design correctness, implementation completeness, code execution success, and conclusion validity. This conjunctive evaluation approach reveals specific bottlenecks in agent capabilities and provides fine-grained feedback for improvement.