Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.5 Download Report PDF

Natural Language ProcessingCode

Despite the rapid growth of machine learning research, corresponding code implementations are often unavailable, making it slow and labor-intensive for researchers to reproduce results and build upon prior work. In the meantime, recent Large Language Models (LLMs) excel at understanding scientific documents and generating high-quality code. Inspired by this, we introduce PaperCoder, a multi-agent LLM framework that transforms machine learning papers into operational code repositories. PaperCoder operates in three stages: planning, where it constructs a high-level roadmap, designs the system architecture with diagrams, identifies file dependencies, and generates configuration files; analysis, which focuses on interpreting implementation-specific details; and generation, where modular, dependency-aware code is produced. Moreover, each phase is instantiated through a set of specialized agents designed to collaborate effectively across the pipeline. We then evaluate PaperCoder on generating code implementations from machine learning papers based on both model-based and human evaluations, particularly from the authors of those papers, with author-released repositories as ground truth if available. Our results demonstrate the effectiveness of PaperCoder in creating high-quality, faithful implementations. Furthermore, it consistently shows strengths in the recently released PaperBench benchmark, surpassing strong baselines by substantial margins.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

PaperCoder introduces a multi-agent LLM framework for transforming machine learning papers into executable code repositories through planning, analysis, and generation stages. The taxonomy places this work in the 'Multi-Agent Framework Approaches' leaf under 'Scientific Paper-to-Code Synthesis', which contains only two papers total. This represents a relatively sparse research direction within the broader field of code generation, suggesting the specific combination of multi-agent architecture and scientific paper parsing remains an emerging area with limited prior exploration.

The taxonomy reveals that PaperCoder sits within a specialized branch distinct from broader 'Natural Language-to-Code Synthesis' approaches that handle general programming tasks. Neighboring leaves include 'Algorithm Reproduction and Replication' (three papers focused on benchmarking reproducibility) and 'Autonomous Scientific Discovery' (one paper on fully automated research). The scope note for the multi-agent leaf explicitly excludes single-model approaches, positioning PaperCoder's orchestrated agent collaboration as a defining architectural choice that differentiates it from monolithic end-to-end systems in adjacent research directions.

Among thirty candidates examined, the multi-agent framework contribution shows one refutable candidate from ten examined, while the three-stage approach shows two refutable candidates from ten examined. The Paper2Code benchmark contribution appears more novel, with zero refutable candidates among ten examined. These statistics reflect a limited semantic search scope rather than exhaustive coverage, indicating that within the top-thirty most similar works, the benchmark component encounters less direct prior overlap than the architectural contributions. The framework and staged approach face modest but non-negligible precedent in the examined literature.

Based on the limited search scope of thirty semantically similar candidates, PaperCoder appears to occupy a sparsely populated research niche combining multi-agent orchestration with scientific paper parsing. The benchmark contribution shows stronger novelty signals than the architectural components within the examined set, though the analysis does not cover the full breadth of related work in code generation or scientific automation.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Automating code generation from scientific papers in machine learning. The field encompasses a diverse set of approaches organized into several major branches. Scientific Paper-to-Code Synthesis focuses on translating research publications into executable implementations, often employing multi-agent frameworks and specialized parsing techniques. Natural Language-to-Code Synthesis addresses the broader challenge of converting textual descriptions into programs, leveraging transformer architectures and reinforcement learning methods such as CodeRL[1]. Domain-Specific Code Generation targets specialized contexts like quantum circuits, hardware description languages, and SQL queries, while Code Synthesis Benchmarking and Evaluation develops datasets and metrics to assess synthesis quality. AI-Assisted Development and Repair explores tools that support iterative coding workflows, and Surveys and Literature Reviews provide comprehensive overviews of synthesis techniques across programming paradigms. Tool Analysis and Methodological Studies examine the practical performance and limitations of generative AI systems in real-world coding scenarios. Several active research directions reveal key trade-offs and open questions. Multi-agent frameworks, exemplified by Paper2Code[0] and DeepCode Agentic[13], decompose complex synthesis tasks into collaborative sub-agents, contrasting with monolithic end-to-end models that prioritize simplicity over modularity. Distributed Code Synthesis[5] explores parallelization strategies for large-scale generation, while works like AI Scientist[9] push toward fully autonomous research automation. Paper2Code[0] sits within the multi-agent branch of Scientific Paper-to-Code Synthesis, emphasizing orchestrated collaboration among specialized agents to handle the unique challenges of parsing academic papers and generating corresponding ML code. Compared to DeepCode Agentic[13], which also adopts an agentic architecture, Paper2Code[0] appears more tightly focused on the scientific publication domain rather than general-purpose code synthesis. This positioning highlights ongoing debates about whether domain-specific pipelines or general-purpose frameworks better serve the goal of automating research-to-code workflows.

Claimed Contributions

PaperCoder multi-agent LLM framework

Can Refute

10 retrieved papers

The authors propose PaperCoder, a framework that automatically generates complete code repositories from machine learning papers. It operates through three stages: planning (roadmap and architecture design), analysis (interpreting implementation details), and generation (producing modular code), using specialized agents that collaborate across the pipeline.

10 retrieved papers

Can Refute

Paper2Code benchmark (Paper2CodeBench)

10 retrieved papers

The authors introduce Paper2CodeBench, a benchmark consisting of recent machine learning papers from top-tier conferences (ICLR, ICML, NeurIPS) for evaluating automated code generation from scientific papers.

10 retrieved papers

Three-stage structured code generation approach

Can Refute

10 retrieved papers

The authors develop a structured three-stage approach that mimics how human developers write repository-level code: planning creates roadmaps and architecture diagrams, analysis interprets implementation details, and generation synthesizes the complete codebase based on execution order and prior artifacts.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[13] DeepCode: Open Agentic Coding PDF

Zongwei Li, Zhonghang Li, Zirui Guo, Xubin Ren, Chao Huang (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

PaperCoder multi-agent LLM framework

[64] AutoP2C: An LLM-Based Agent Framework for Code Repository Generation from Multimodal Content in Academic Papers PDF

Can Refute

[61] Hydra-Reviewer: A holistic multi-agent system for automatic code review comment generation PDF

Cannot Refute

[62] AutoMisty: A Multi-Agent LLM Framework for Automated Code Generation in the Misty Social Robot PDF

Cannot Refute

[63] MAGE: A Multi-Agent Engine for Automated RTL Code Generation PDF

Cannot Refute

[65] Blueprint2Code: a multi-agent pipeline for reliable code generation via blueprint planning and repair PDF

Cannot Refute

[66] Anncoder: A mti-agent-based code generation and optimization model PDF

Cannot Refute

[67] SEW: Self-Evolving Agentic Workflows for Automated Code Generation PDF

Cannot Refute

[68] AutoSafeCoder: A Multi-Agent Framework for Securing LLM Code Generation through Static Analysis and Fuzz Testing PDF

Cannot Refute

[69] CodeCoR: An LLM-based self-reflective multi-agent framework for code generation PDF

Cannot Refute

[70] Marco: A multi-agent system for optimizing hpc code generation using large language models PDF

Cannot Refute

Contribution

Paper2Code benchmark (Paper2CodeBench)

[51] A survey on large language models for code generation PDF

Cannot Refute

[52] Codereval: A benchmark of pragmatic code generation with generative pre-trained models PDF

Cannot Refute

[53] Benchmarks and metrics for evaluations of code generation: A critical review PDF

Cannot Refute

[54] Evaluating large language models in class-level code generation PDF

Cannot Refute

[55] CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution PDF

Cannot Refute

[56] Assessing and advancing benchmarks for evaluating large language models in software engineering tasks PDF

Cannot Refute

[57] BioCoder: a benchmark for bioinformatics code generation with large language models PDF

Cannot Refute

[58] Codet: Code generation with generated tests PDF

Cannot Refute

[59] BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions PDF

Cannot Refute

[60] Pcebench: A multi-dimensional benchmark for evaluating large language models in parallel code generation PDF

Cannot Refute

Contribution

Three-stage structured code generation approach

[64] AutoP2C: An LLM-Based Agent Framework for Code Repository Generation from Multimodal Content in Academic Papers PDF

Can Refute

[76] CodeS: Natural Language to Code Repository via Multi-Layer Sketch PDF

Can Refute

[71] Teaching Code LLMs to Use Autocompletion Tools in Repository-Level Code Generation PDF

Cannot Refute

[72] Self-Collaboration Code Generation via ChatGPT PDF

Cannot Refute

[73] AlphaTrans: A Neuro-Symbolic Compositional Approach for Repository-Level Code Translation and Validation PDF

Cannot Refute

[74] RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation PDF

Cannot Refute

[75] Codeplan: Repository-level coding using llms and planning PDF

Cannot Refute

[77] Self-Organized Agents: A LLM Multi-Agent Framework toward Ultra Large-Scale Code Generation and Optimization PDF

Cannot Refute

[78] Repository-Level Code Understanding by LLMs via Hierarchical Summarization: Improving Code Search and Bug Localization PDF

Cannot Refute

[79] Code2MCP: Transforming Code Repositories into MCP Services PDF

Cannot Refute

Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[13] DeepCode: Open Agentic Coding PDF

Contribution Analysis

PaperCoder multi-agent LLM framework

[64] AutoP2C: An LLM-Based Agent Framework for Code Repository Generation from Multimodal Content in Academic Papers PDF

[61] Hydra-Reviewer: A holistic multi-agent system for automatic code review comment generation PDF

[62] AutoMisty: A Multi-Agent LLM Framework for Automated Code Generation in the Misty Social Robot PDF

[63] MAGE: A Multi-Agent Engine for Automated RTL Code Generation PDF

[65] Blueprint2Code: a multi-agent pipeline for reliable code generation via blueprint planning and repair PDF

[66] Anncoder: A mti-agent-based code generation and optimization model PDF

[67] SEW: Self-Evolving Agentic Workflows for Automated Code Generation PDF

[68] AutoSafeCoder: A Multi-Agent Framework for Securing LLM Code Generation through Static Analysis and Fuzz Testing PDF

[69] CodeCoR: An LLM-based self-reflective multi-agent framework for code generation PDF

[70] Marco: A multi-agent system for optimizing hpc code generation using large language models PDF

Paper2Code benchmark (Paper2CodeBench)

[51] A survey on large language models for code generation PDF

[52] Codereval: A benchmark of pragmatic code generation with generative pre-trained models PDF

[53] Benchmarks and metrics for evaluations of code generation: A critical review PDF

[54] Evaluating large language models in class-level code generation PDF

[55] CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution PDF

[56] Assessing and advancing benchmarks for evaluating large language models in software engineering tasks PDF

[57] BioCoder: a benchmark for bioinformatics code generation with large language models PDF

[58] Codet: Code generation with generated tests PDF

[59] BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions PDF

[60] Pcebench: A multi-dimensional benchmark for evaluating large language models in parallel code generation PDF

Three-stage structured code generation approach

[64] AutoP2C: An LLM-Based Agent Framework for Code Repository Generation from Multimodal Content in Academic Papers PDF

[76] CodeS: Natural Language to Code Repository via Multi-Layer Sketch PDF

[71] Teaching Code LLMs to Use Autocompletion Tools in Repository-Level Code Generation PDF

[72] Self-Collaboration Code Generation via ChatGPT PDF

[73] AlphaTrans: A Neuro-Symbolic Compositional Approach for Repository-Level Code Translation and Validation PDF

[74] RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation PDF

[75] Codeplan: Repository-level coding using llms and planning PDF

[77] Self-Organized Agents: A LLM Multi-Agent Framework toward Ultra Large-Scale Code Generation and Optimization PDF

[78] Repository-Level Code Understanding by LLMs via Hierarchical Summarization: Improving Code Search and Bug Localization PDF

[79] Code2MCP: Transforming Code Repositories into MCP Services PDF

Table of Contents