Unleashing Scientific Reasoning for Bio-experimental Protocol Generation via Structured Component-based Reward Mechanism

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

AI for biologyProtocol GenerationScientific ReasoningLarge Language ModelReinforcement Learning

The foundation of reproducible science lies in protocols that are precise, logically ordered, and executable. The autonomous generation of these protocols through natural language queries could greatly improve the efficiency of the reproduction process. However, current leading large language models (LLMs) often generate incomplete or inconsistent protocols, limiting their utility. To address this limitation, we first introduce SciRecipe, a large-scale dataset of over 12K structured protocols spanning 27 biological subfields and encompassing both comprehension and problem-solving tasks. To further improve protocol generation, we propose the "Sketch-and-Fill" paradigm, which separates analysis, structuring, and expression to ensure each step is explicit and verifiable. Complementing this, the structured component-based reward mechanism evaluates step granularity, action order, and semantic fidelity, aligning model optimization with experimental reliability. Building on these components, we develop Thoth, trained through a staged Knowledge-to-Action process that progresses from knowledge acquisition to operational reasoning and ultimately to robust, executable protocol generation. Across multiple benchmarks, Thoth consistently surpasses both proprietary and open-source LLMs, achieving significant improvements in step alignment, logical sequencing, and semantic accuracy. Our approach paves the way for reliable scientific assistants that bridge knowledge with experimental execution.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Thoth, a system for generating executable biological protocols from natural language queries, alongside the SciRecipe dataset spanning 27 biological subfields. Within the taxonomy, it resides in the 'LLM-Based Protocol Generation from Natural Language' leaf, which contains only three papers total. This is a relatively sparse research direction compared to broader branches like 'Autonomous Laboratory Systems' (11 papers) or 'Self-Driving Laboratory Infrastructures' (4 papers). The two sibling papers—BioPlanner and a survey on automating biomedical discovery—focus on planning and broader automation strategies, suggesting limited direct competition in reward-driven protocol refinement.

The taxonomy reveals that neighboring leaves address complementary challenges: 'Hardware-Specific Robotic Scripting' targets platform-specific execution (3 papers), while 'Protocol Formalization and Standardization Languages' emphasizes machine-readable representations (3 papers). The paper's Sketch-and-Fill paradigm bridges these concerns by decomposing protocol generation into analysis, structuring, and expression phases. Unlike the 'Expert-Level Protocol Translation Systems' branch (2 papers), which assumes human-readable inputs, this work starts from natural language queries. The structured reward mechanism connects to 'Experimental Design Optimization' themes but remains distinct by focusing on protocol correctness rather than parameter inference or hypothesis testing.

Among 29 candidates examined, the Sketch-and-Fill paradigm shows one refutable candidate from 10 examined, indicating some prior work in decomposed reasoning approaches. The SciRecipe dataset (0 refutations from 10 candidates) and SCORE mechanism (0 from 9 candidates) appear more novel within this limited search scope. The statistics suggest that while the dataset and reward framework may represent fresh contributions, the staged reasoning approach has at least one overlapping precedent among the top-30 semantic matches. The scale of examination—29 papers across the entire field—means these findings reflect proximity within a focused literature sample, not exhaustive coverage.

Given the sparse population of the LLM-based protocol generation leaf and the limited search scope, the work appears to occupy a relatively open niche. The combination of a large-scale dataset, decomposed reasoning, and component-based rewards distinguishes it from sibling papers focused on planning or surveys. However, the single refutation for Sketch-and-Fill warrants attention to how the staged decomposition differs from prior hierarchical or modular approaches. The analysis covers top-30 semantic matches and does not claim completeness across all protocol generation literature.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: autonomous generation of executable biological experimental protocols. The field encompasses efforts to translate high-level scientific goals into machine-readable instructions that robotic systems can execute. The taxonomy reveals five main branches: AI-Driven Protocol Generation and Translation focuses on leveraging large language models and structured representations to convert natural language descriptions into formal protocols, with works like BioPlanner[19] and Automating Biomedical Discovery[22] exemplifying LLM-based approaches. Autonomous Laboratory Systems and Platforms addresses the physical infrastructure and integration challenges, including liquid handling automation (Autonomous Liquid Handling[4]) and end-to-end self-driving labs (Self-Driving Labs Review[2], BioLab[3]). Experimental Design Optimization and Decision-Making explores adaptive strategies for selecting experiments, often using reinforcement learning or Bayesian methods (Deep RL Experimental Design[14], Autonomous Reaction Development[15]). Data Infrastructure and Knowledge Management tackles the challenge of organizing protocols and experimental metadata into reusable repositories (Genesis-DB[17], Biological Protocol Language[18]). Cross-Cutting Reviews and Methodological Surveys provide broader perspectives on autonomous discovery (Autonomous Molecular Discovery[6], AI Drug Discovery[10]). Recent activity highlights a tension between end-to-end automation and modular, human-in-the-loop designs. Some systems aim for fully autonomous cycles (AI-Native Biomolecular Lab[28], Modular Autonomous Experimentation[20]), while others emphasize expert-guided translation (Expert Protocol Translation[16]) or hierarchical decomposition (Hierarchical Protocol Design[12]). Structured Component Reward[0] sits within the LLM-Based Protocol Generation cluster alongside BioPlanner[19] and Automating Biomedical Discovery[22], but distinguishes itself by introducing a reward-based framework to refine protocol generation, addressing the challenge of ensuring executability and correctness. Compared to BioPlanner[19], which focuses on planning from natural language, and Automating Biomedical Discovery[22], which surveys broader automation strategies, Structured Component Reward[0] emphasizes learning from feedback to improve protocol quality, bridging the gap between initial generation and reliable execution.

Claimed Contributions

SciRecipe dataset for protocol generation

10 retrieved papers

The authors curate SciRecipe, a large-scale multi-task dataset containing over 12,000 structured experimental protocols across 27 biological subfields. The dataset covers both Protocol-Comprehension tasks (overview and specific analysis) and Problem-Solving tasks (retrieval, planning, troubleshooting, constraint, scaling, and safety), designed to serve as a foundation for training and evaluating protocol generation systems.

10 retrieved papers

Sketch-and-Fill reasoning paradigm

Can Refute

10 retrieved papers

The authors introduce a structured reasoning framework that decomposes protocol generation into three stages: reasoning (think), structuring key information into machine-readable JSON format (key), and expressing steps in natural language (orc). This paradigm ensures that each experimental step is explicit, verifiable, and maintains one-to-one correspondence between structured and natural language representations.

10 retrieved papers

Can Refute

SCORE mechanism for protocol evaluation and training

9 retrieved papers

The authors propose the Structured COmponent-based REward (SCORE) mechanism, which provides both a training reward signal and evaluation framework. SCORE jointly measures three dimensions: step granularity (controlling scale and avoiding redundancy), action ordering (ensuring logically consistent sequences), and semantic fidelity (verifying alignment of actions, objects, and parameters), moving beyond conventional text-based metrics to assess experimental executability.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[19] BioPlanner: automatic evaluation of LLMs on protocol planning in biology PDF

Odhran OâDonoghue, Aleksandar Shtedritski, John Ginger, Ralph Abboud, Ali Ghareeb, Ali Essa Ghareeb, Samuel Rodriques, Samuel G. Rodriques (2023)

[22] Automating Biomedical Discovery with AI Agents PDF

Y Qu (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

SciRecipe dataset for protocol generation

[10] AI Agents in Drug Discovery PDF

Cannot Refute

[19] BioPlanner: automatic evaluation of LLMs on protocol planning in biology PDF

Cannot Refute

[39] BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol Understanding and Reasoning PDF

Cannot Refute

[40] SciKnowEval: A Comprehensive Dataset for Evaluating Scientific Knowledge of Large Language Models PDF

Cannot Refute

[41] BioAutoMATED: an end-to-end automated machine learning tool for explanation and design of biological sequences PDF

Cannot Refute

[42] Deep learning for biology PDF

Cannot Refute

[43] Towards expert-level autonomous carotid ultrasonography with large-scale learning-based robotic system PDF

Cannot Refute

[44] A large and consistent phylogenomic dataset supports sponges as the sister group to all other animals PDF

Cannot Refute

[45] BioInformatics Agent (BIA): Unleashing the Power of Large Language Models to Reshape Bioinformatics Workflow PDF

Cannot Refute

[46] Lock3dface: A large-scale database of low-cost kinect 3d faces PDF

Cannot Refute

Contribution

Sketch-and-Fill reasoning paradigm

[48] Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning PDF

Can Refute

[47] StructGPT: A General Framework for Large Language Model to Reason over Structured Data PDF

Cannot Refute

[49] Elevating Legal LLM Responses: Harnessing Trainable Logical Structures and Semantic Knowledge with Legal Reasoning PDF

Cannot Refute

[50] Structured path guidance for logical coherence in large language model generation PDF

Cannot Refute

[51] Generating Structured Plan Representation of Procedures with LLMs PDF

Cannot Refute

[52] HiRA: A Hierarchical Reasoning Framework for Decoupled Planning and Execution in Deep Search PDF

Cannot Refute

[53] Structured prompting and feedback-guided reasoning with llms for data interpretation PDF

Cannot Refute

[54] RATT: A Thought Structure for Coherent and Correct LLM Reasoning PDF

Cannot Refute

[55] Continuum-interaction-driven intelligence: Human-aligned neural architecture via crystallized reasoning and fluid generation PDF

Cannot Refute

[56] A Retrieve-and-Edit Framework for Predicting Structured Outputs PDF

Cannot Refute

Contribution

SCORE mechanism for protocol evaluation and training

[30] Fine-grained reward optimization for machine translation using error severity mappings PDF

Cannot Refute

[31] Optimizing large language models through highly dense reward structures and recursive thought process using monte carlo tree search PDF

Cannot Refute

[32] Curriculum guided reinforcement learning for efficient multi hop retrieval augmented generation PDF

Cannot Refute

[33] Cogito Ergo Summ: Abstractive Summarization of Biomedical Papers via Semantic Parsing Graphs and Consistency Rewards PDF

Cannot Refute

[34] Combining semantic guidance and deep reinforcement learning for generating human level paintings PDF

Cannot Refute

[35] Which and how many regions to gaze: Focus discriminative regions for fine-grained visual categorization PDF

Cannot Refute

[36] Robo-Dopamine: General Process Reward Modeling for High-Precision Robotic Manipulation PDF

Cannot Refute

[37] RISK: A Framework for GUI Agents in E-commerce Risk Management PDF

Cannot Refute

[38] GroundedPRM: Tree-Guided and Fidelity-Aware Process Reward Modeling for Step-Level Reasoning PDF

Cannot Refute

Unleashing Scientific Reasoning for Bio-experimental Protocol Generation via Structured Component-based Reward Mechanism

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[19] BioPlanner: automatic evaluation of LLMs on protocol planning in biology PDF

[22] Automating Biomedical Discovery with AI Agents PDF

Contribution Analysis

SciRecipe dataset for protocol generation

[10] AI Agents in Drug Discovery PDF

[19] BioPlanner: automatic evaluation of LLMs on protocol planning in biology PDF

[39] BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol Understanding and Reasoning PDF

[40] SciKnowEval: A Comprehensive Dataset for Evaluating Scientific Knowledge of Large Language Models PDF

[41] BioAutoMATED: an end-to-end automated machine learning tool for explanation and design of biological sequences PDF

[42] Deep learning for biology PDF

[43] Towards expert-level autonomous carotid ultrasonography with large-scale learning-based robotic system PDF

[44] A large and consistent phylogenomic dataset supports sponges as the sister group to all other animals PDF

[45] BioInformatics Agent (BIA): Unleashing the Power of Large Language Models to Reshape Bioinformatics Workflow PDF

[46] Lock3dface: A large-scale database of low-cost kinect 3d faces PDF

Sketch-and-Fill reasoning paradigm

[48] Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning PDF

[47] StructGPT: A General Framework for Large Language Model to Reason over Structured Data PDF

[49] Elevating Legal LLM Responses: Harnessing Trainable Logical Structures and Semantic Knowledge with Legal Reasoning PDF

[50] Structured path guidance for logical coherence in large language model generation PDF

[51] Generating Structured Plan Representation of Procedures with LLMs PDF

[52] HiRA: A Hierarchical Reasoning Framework for Decoupled Planning and Execution in Deep Search PDF

[53] Structured prompting and feedback-guided reasoning with llms for data interpretation PDF

[54] RATT: A Thought Structure for Coherent and Correct LLM Reasoning PDF

[55] Continuum-interaction-driven intelligence: Human-aligned neural architecture via crystallized reasoning and fluid generation PDF

[56] A Retrieve-and-Edit Framework for Predicting Structured Outputs PDF

SCORE mechanism for protocol evaluation and training

[30] Fine-grained reward optimization for machine translation using error severity mappings PDF

[31] Optimizing large language models through highly dense reward structures and recursive thought process using monte carlo tree search PDF

[32] Curriculum guided reinforcement learning for efficient multi hop retrieval augmented generation PDF

[33] Cogito Ergo Summ: Abstractive Summarization of Biomedical Papers via Semantic Parsing Graphs and Consistency Rewards PDF

[34] Combining semantic guidance and deep reinforcement learning for generating human level paintings PDF

[35] Which and how many regions to gaze: Focus discriminative regions for fine-grained visual categorization PDF

[36] Robo-Dopamine: General Process Reward Modeling for High-Precision Robotic Manipulation PDF

[37] RISK: A Framework for GUI Agents in E-commerce Risk Management PDF

[38] GroundedPRM: Tree-Guided and Fidelity-Aware Process Reward Modeling for Step-Level Reasoning PDF

Table of Contents