AutoBio: A Simulation and Benchmark for Robotic Automation in Digital Biology Laboratory

ICLR 2026 Conference SubmissionAnonymous Authors
roboticsrobot learningvision language action modelbiology experimental operationAI for science
Abstract:

Vision-language-action (VLA) models have shown promise as generalist robotic policies by jointly leveraging visual, linguistic, and proprioceptive modalities to generate action trajectories. While recent benchmarks have advanced VLA research in domestic tasks, professional science-oriented domains remain underexplored. We introduce AutoBio, a simulation framework and benchmark designed to evaluate robotic automation in biology laboratory environments—an application domain that combines structured protocols with demanding precision and multimodal interaction. AutoBio extends existing simulation capabilities through a pipeline for digitizing real-world laboratory instruments, specialized physics plugins for mechanisms ubiquitous in laboratory workflows, and a rendering stack that support dynamic instrument interfaces and transparent materials through physically based rendering. Our benchmark comprises biologically grounded tasks spanning three difficulty levels, enabling standardized evaluation of language-guided robotic manipulation in experimental protocols. We provide infrastructure for demonstration generation and seamless integration with VLA models. Baseline evaluations with SOTA VLA models reveal significant gaps in precision manipulation, visual reasoning, and instruction following in scientific workflows. By releasing AutoBio, we aim to catalyze research on generalist robotic systems for complex, high-precision, and multimodal professional environments.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces AutoBio, a simulation framework and benchmark for evaluating vision-language-action models in biology laboratory automation. Within the taxonomy, it occupies the 'Simulation and Benchmarking Frameworks' leaf under 'Software and Computational Infrastructure'. Notably, this leaf contains only one paper—AutoBio itself—indicating a sparse research direction. The broader parent branch includes six papers on AI/ML for laboratory automation and six on workflow orchestration, but no other work explicitly focuses on simulation-based benchmarking for robotic biology tasks.

The taxonomy reveals that most related work concentrates on physical robotic platforms (eight general-purpose systems, four specialized systems) or AI-driven workflow tools (six papers applying LLMs and machine learning to protocol generation). AutoBio bridges these areas by providing a virtual testbed for evaluating VLA models before physical deployment. Its closest conceptual neighbors are AI/ML papers like 'LLMs Robotic Scripts' and 'Design Build Test Learn', which explore computational approaches to laboratory automation but do not offer standardized simulation environments or benchmarks for systematic evaluation.

Among thirty candidates examined across three contributions, none were identified as clearly refuting AutoBio's claims. The simulator contribution examined ten candidates with zero refutable overlaps; the benchmark contribution similarly found no prior work providing biologically grounded VLA evaluation tasks in simulated laboratory settings; and the systematic VLA evaluation examined ten candidates without encountering existing assessments of vision-language-action models in scientific domains. This suggests that within the limited search scope, the combination of biology-specific simulation infrastructure, standardized benchmarking tasks, and VLA model evaluation represents a relatively unexplored intersection.

The analysis reflects a top-30 semantic search plus citation expansion, not an exhaustive literature review. While the taxonomy shows active research in physical laboratory robotics and AI-driven protocol generation, the specific niche of simulation-based benchmarking for VLA models in biology appears underrepresented. The absence of sibling papers in the same taxonomy leaf and the lack of refutable candidates across all contributions suggest novelty within the examined scope, though broader searches in robotics simulation or general VLA benchmarking domains may reveal additional context.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: robotic automation in digital biology laboratory environments. The field encompasses a diverse set of concerns that span physical hardware—ranging from liquid handlers and mobile robots to specialized single-cell manipulators—and the software infrastructure needed to orchestrate these systems. The taxonomy reflects this breadth through five main branches: Robotic Hardware and Physical Automation Systems addresses the mechanical platforms and instrumentation (e.g., Physical Laboratory Automation[1], Modular Robotic Platform[14]); Software and Computational Infrastructure covers middleware, simulation tools, and data management layers (e.g., Property Graph Metadata[12], Genesis DB[24]); Application Domains and Experimental Workflows examines domain-specific deployments in synthetic biology, drug discovery, and diagnostics (e.g., Cell Free Biosensors[20], Accelerating Drug Discovery[28]); Sociotechnical and Organizational Perspectives explores human factors and adoption barriers (e.g., Researchers Perceptions Automation[27]); and Cross-Cutting Reviews and Surveys synthesizes overarching trends (e.g., Self-Driving Labs[32], Robotics Revolutionizing Research[2]). A particularly active line of work focuses on simulation and benchmarking frameworks that enable researchers to prototype and validate automated workflows before deploying them on expensive physical equipment. AutoBio[0] sits squarely within this Software and Computational Infrastructure branch, providing a simulation environment for digital biology experiments. Its emphasis on virtual testbeds contrasts with hardware-centric efforts like RoboCulture[3] or Mobile Robots Workflows[4], which prioritize physical integration and real-world deployment. Meanwhile, works such as LLMs Robotic Scripts[8] and Design Build Test Learn[9] explore how computational tools—including large language models and closed-loop optimization—can streamline protocol generation and experimental iteration. The interplay between simulation platforms like AutoBio[0] and these emerging AI-driven approaches highlights an open question: how to balance the fidelity of virtual models with the practical constraints of wet-lab execution, ensuring that insights from simulation translate reliably into reproducible biological discoveries.

Claimed Contributions

AutoBio simulator for biology laboratory environments

The authors develop a specialized simulation framework that extends existing capabilities through a pipeline for digitizing real-world laboratory instruments using 3D Gaussian Splatting, custom physics plugins for laboratory-specific mechanisms (thread, detent, eccentric mechanisms, and quasi-static liquids), and a rendering stack supporting dynamic instrument interfaces and transparent materials via physically based rendering.

10 retrieved papers
AutoBio benchmark with biologically grounded tasks

The authors introduce a benchmark consisting of 16 tasks across three difficulty levels (easy, medium, hard) that evaluate robotic automation in laboratory protocols. The benchmark includes infrastructure for demonstration generation and seamless integration with VLA models, enabling standardized evaluation of precision control, instruction following, and visual reasoning in scientific workflows.

10 retrieved papers
Systematic evaluation revealing VLA model limitations in scientific settings

The authors conduct comprehensive evaluations of state-of-the-art VLA models (π0, π0.5, and RDT) on the AutoBio benchmark, systematically identifying critical limitations in current approaches including cross-modal grounding errors, visual reasoning limitations, and lack of closed-loop recovery in contact-rich tasks, suggesting directions for future improvements in model architecture and training methodologies.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

AutoBio simulator for biology laboratory environments

The authors develop a specialized simulation framework that extends existing capabilities through a pipeline for digitizing real-world laboratory instruments using 3D Gaussian Splatting, custom physics plugins for laboratory-specific mechanisms (thread, detent, eccentric mechanisms, and quasi-static liquids), and a rendering stack supporting dynamic instrument interfaces and transparent materials via physically based rendering.

Contribution

AutoBio benchmark with biologically grounded tasks

The authors introduce a benchmark consisting of 16 tasks across three difficulty levels (easy, medium, hard) that evaluate robotic automation in laboratory protocols. The benchmark includes infrastructure for demonstration generation and seamless integration with VLA models, enabling standardized evaluation of precision control, instruction following, and visual reasoning in scientific workflows.

Contribution

Systematic evaluation revealing VLA model limitations in scientific settings

The authors conduct comprehensive evaluations of state-of-the-art VLA models (π0, π0.5, and RDT) on the AutoBio benchmark, systematically identifying critical limitations in current approaches including cross-modal grounding errors, visual reasoning limitations, and lack of closed-loop recovery in contact-rich tasks, suggesting directions for future improvements in model architecture and training methodologies.

AutoBio: A Simulation and Benchmark for Robotic Automation in Digital Biology Laboratory | Novelty Validation