Unleashing Scientific Reasoning for Bio-experimental Protocol Generation via Structured Component-based Reward Mechanism

ICLR 2026 Conference SubmissionAnonymous Authors
AI for biologyProtocol GenerationScientific ReasoningLarge Language ModelReinforcement Learning
Abstract:

The foundation of reproducible science lies in protocols that are precise, logically ordered, and executable. The autonomous generation of these protocols through natural language queries could greatly improve the efficiency of the reproduction process. However, current leading large language models (LLMs) often generate incomplete or inconsistent protocols, limiting their utility. To address this limitation, we first introduce SciRecipe, a large-scale dataset of over 12K structured protocols spanning 27 biological subfields and encompassing both comprehension and problem-solving tasks. To further improve protocol generation, we propose the "Sketch-and-Fill" paradigm, which separates analysis, structuring, and expression to ensure each step is explicit and verifiable. Complementing this, the structured component-based reward mechanism evaluates step granularity, action order, and semantic fidelity, aligning model optimization with experimental reliability. Building on these components, we develop Thoth, trained through a staged Knowledge-to-Action process that progresses from knowledge acquisition to operational reasoning and ultimately to robust, executable protocol generation. Across multiple benchmarks, Thoth consistently surpasses both proprietary and open-source LLMs, achieving significant improvements in step alignment, logical sequencing, and semantic accuracy. Our approach paves the way for reliable scientific assistants that bridge knowledge with experimental execution.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Thoth, a system for generating executable biological protocols from natural language queries, alongside the SciRecipe dataset spanning 27 biological subfields. Within the taxonomy, it resides in the 'LLM-Based Protocol Generation from Natural Language' leaf, which contains only three papers total. This is a relatively sparse research direction compared to broader branches like 'Autonomous Laboratory Systems' (11 papers) or 'Self-Driving Laboratory Infrastructures' (4 papers). The two sibling papers—BioPlanner and a survey on automating biomedical discovery—focus on planning and broader automation strategies, suggesting limited direct competition in reward-driven protocol refinement.

The taxonomy reveals that neighboring leaves address complementary challenges: 'Hardware-Specific Robotic Scripting' targets platform-specific execution (3 papers), while 'Protocol Formalization and Standardization Languages' emphasizes machine-readable representations (3 papers). The paper's Sketch-and-Fill paradigm bridges these concerns by decomposing protocol generation into analysis, structuring, and expression phases. Unlike the 'Expert-Level Protocol Translation Systems' branch (2 papers), which assumes human-readable inputs, this work starts from natural language queries. The structured reward mechanism connects to 'Experimental Design Optimization' themes but remains distinct by focusing on protocol correctness rather than parameter inference or hypothesis testing.

Among 29 candidates examined, the Sketch-and-Fill paradigm shows one refutable candidate from 10 examined, indicating some prior work in decomposed reasoning approaches. The SciRecipe dataset (0 refutations from 10 candidates) and SCORE mechanism (0 from 9 candidates) appear more novel within this limited search scope. The statistics suggest that while the dataset and reward framework may represent fresh contributions, the staged reasoning approach has at least one overlapping precedent among the top-30 semantic matches. The scale of examination—29 papers across the entire field—means these findings reflect proximity within a focused literature sample, not exhaustive coverage.

Given the sparse population of the LLM-based protocol generation leaf and the limited search scope, the work appears to occupy a relatively open niche. The combination of a large-scale dataset, decomposed reasoning, and component-based rewards distinguishes it from sibling papers focused on planning or surveys. However, the single refutation for Sketch-and-Fill warrants attention to how the staged decomposition differs from prior hierarchical or modular approaches. The analysis covers top-30 semantic matches and does not claim completeness across all protocol generation literature.

Taxonomy

Core-task Taxonomy Papers
29
3
Claimed Contributions
29
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: autonomous generation of executable biological experimental protocols. The field encompasses efforts to translate high-level scientific goals into machine-readable instructions that robotic systems can execute. The taxonomy reveals five main branches: AI-Driven Protocol Generation and Translation focuses on leveraging large language models and structured representations to convert natural language descriptions into formal protocols, with works like BioPlanner[19] and Automating Biomedical Discovery[22] exemplifying LLM-based approaches. Autonomous Laboratory Systems and Platforms addresses the physical infrastructure and integration challenges, including liquid handling automation (Autonomous Liquid Handling[4]) and end-to-end self-driving labs (Self-Driving Labs Review[2], BioLab[3]). Experimental Design Optimization and Decision-Making explores adaptive strategies for selecting experiments, often using reinforcement learning or Bayesian methods (Deep RL Experimental Design[14], Autonomous Reaction Development[15]). Data Infrastructure and Knowledge Management tackles the challenge of organizing protocols and experimental metadata into reusable repositories (Genesis-DB[17], Biological Protocol Language[18]). Cross-Cutting Reviews and Methodological Surveys provide broader perspectives on autonomous discovery (Autonomous Molecular Discovery[6], AI Drug Discovery[10]). Recent activity highlights a tension between end-to-end automation and modular, human-in-the-loop designs. Some systems aim for fully autonomous cycles (AI-Native Biomolecular Lab[28], Modular Autonomous Experimentation[20]), while others emphasize expert-guided translation (Expert Protocol Translation[16]) or hierarchical decomposition (Hierarchical Protocol Design[12]). Structured Component Reward[0] sits within the LLM-Based Protocol Generation cluster alongside BioPlanner[19] and Automating Biomedical Discovery[22], but distinguishes itself by introducing a reward-based framework to refine protocol generation, addressing the challenge of ensuring executability and correctness. Compared to BioPlanner[19], which focuses on planning from natural language, and Automating Biomedical Discovery[22], which surveys broader automation strategies, Structured Component Reward[0] emphasizes learning from feedback to improve protocol quality, bridging the gap between initial generation and reliable execution.

Claimed Contributions

SciRecipe dataset for protocol generation

The authors curate SciRecipe, a large-scale multi-task dataset containing over 12,000 structured experimental protocols across 27 biological subfields. The dataset covers both Protocol-Comprehension tasks (overview and specific analysis) and Problem-Solving tasks (retrieval, planning, troubleshooting, constraint, scaling, and safety), designed to serve as a foundation for training and evaluating protocol generation systems.

10 retrieved papers
Sketch-and-Fill reasoning paradigm

The authors introduce a structured reasoning framework that decomposes protocol generation into three stages: reasoning (think), structuring key information into machine-readable JSON format (key), and expressing steps in natural language (orc). This paradigm ensures that each experimental step is explicit, verifiable, and maintains one-to-one correspondence between structured and natural language representations.

10 retrieved papers
Can Refute
SCORE mechanism for protocol evaluation and training

The authors propose the Structured COmponent-based REward (SCORE) mechanism, which provides both a training reward signal and evaluation framework. SCORE jointly measures three dimensions: step granularity (controlling scale and avoiding redundancy), action ordering (ensuring logically consistent sequences), and semantic fidelity (verifying alignment of actions, objects, and parameters), moving beyond conventional text-based metrics to assess experimental executability.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

SciRecipe dataset for protocol generation

The authors curate SciRecipe, a large-scale multi-task dataset containing over 12,000 structured experimental protocols across 27 biological subfields. The dataset covers both Protocol-Comprehension tasks (overview and specific analysis) and Problem-Solving tasks (retrieval, planning, troubleshooting, constraint, scaling, and safety), designed to serve as a foundation for training and evaluating protocol generation systems.

Contribution

Sketch-and-Fill reasoning paradigm

The authors introduce a structured reasoning framework that decomposes protocol generation into three stages: reasoning (think), structuring key information into machine-readable JSON format (key), and expressing steps in natural language (orc). This paradigm ensures that each experimental step is explicit, verifiable, and maintains one-to-one correspondence between structured and natural language representations.

Contribution

SCORE mechanism for protocol evaluation and training

The authors propose the Structured COmponent-based REward (SCORE) mechanism, which provides both a training reward signal and evaluation framework. SCORE jointly measures three dimensions: step granularity (controlling scale and avoiding redundancy), action ordering (ensuring logically consistent sequences), and semantic fidelity (verifying alignment of actions, objects, and parameters), moving beyond conventional text-based metrics to assess experimental executability.

Unleashing Scientific Reasoning for Bio-experimental Protocol Generation via Structured Component-based Reward Mechanism | Novelty Validation