MedAgentGym: A Scalable Agentic Training Environment for Code-Centric Reasoning in Biomedical Data Science

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

Medical ReasoningLLM AgentCode Generation

We introduce MedAgentGym, a scalable and interactive training environment designed to enhance coding-based biomedical reasoning capabilities in large language model (LLM) agents. MedAgentGym comprises 72,413 task instances across 129 categories derived from 12 authentic real-world biomedical scenarios. Tasks are encapsulated within executable sandbox environments, each featuring detailed task specifications, interactive feedback mechanisms, verifiable ground truth annotations, and scalable training trajectory generation. Extensive benchmarking of 29 LLMs reveals substantial performance disparities in biomedical data science between commercial and open-source LLMs. Leveraging efficient multi-threaded and multi-turn trajectory sampling in MedAgentGym, Med-Copilot achieves performance gains of +43.02% and +45.28% from offline and online reinforcement learning, respectively, demonstrating MedAgentGym as an effective training ground while establishing itself as a cost-effective, privacy-preserving alternative competitive with proprietary LLMs (gpt-4o). By offering a unified execution environment with a comprehensive benchmark and accessible, extensible training resources, MedAgentGym delivers an integrated platform to develop LLM-based coding assistants for advanced biomedical data science.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

MedAgentGym introduces a scalable training environment for coding-based biomedical reasoning, comprising 72,413 task instances across 129 categories from 12 real-world scenarios. The taxonomy places this work in the 'Scalable Agentic Training Platforms' leaf, which contains only two papers including the original. This represents a relatively sparse research direction within the broader field of interactive training environments, suggesting the work addresses an emerging need for comprehensive, multi-scenario agent training platforms rather than entering a crowded space of established solutions.

The taxonomy reveals that MedAgentGym sits within the 'Interactive Training Environments and Agent Frameworks' branch, which also includes multi-agent collaboration architectures and general-purpose biomedical AI agents. Neighboring branches focus on clinical coding systems, code-driven EHR analysis, and clinical decision support protocols. The scope notes indicate MedAgentGym's emphasis on executable sandboxes and trajectory generation distinguishes it from static benchmarking approaches and from single-task frameworks that lack scalability. This positioning suggests the work bridges agent training infrastructure with practical biomedical coding applications.

Among 30 candidates examined across three contributions, none were identified as clearly refuting the work's novelty. The MedAgentGym training environment examined 10 candidates with zero refutable overlaps, as did the Med-Copilot coding agent and the unified execution environment contributions. The limited search scope means these statistics reflect top-K semantic matches rather than exhaustive coverage. The absence of refutable candidates across all contributions suggests that within the examined literature, the combination of scale, interactivity, and biomedical domain focus appears distinctive, though broader searches might reveal additional related work.

Based on the limited 30-candidate search, the work appears to occupy a relatively novel position combining large-scale interactive training with biomedical coding tasks. The sparse population of its taxonomy leaf and the lack of refutable candidates among examined papers suggest meaningful differentiation from existing approaches. However, the analysis covers semantic neighbors rather than comprehensive field coverage, and the true novelty assessment would benefit from examining additional agent training platforms and biomedical benchmarking systems beyond the top-K matches.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: coding-based biomedical reasoning in large language models. This field explores how LLMs can leverage executable code, structured representations, and computational tools to enhance reasoning over medical data and clinical knowledge. The taxonomy reveals several complementary directions: Interactive Training Environments and Agent Frameworks focus on building scalable platforms where agents learn through interaction with medical tasks; Clinical Coding and Medical Classification address automated assignment of diagnostic codes and structured labels; Code-Driven Reasoning and Execution emphasize using executable programs to perform multi-step inference; Clinical Reasoning and Decision Support target diagnostic workflows and treatment planning; Domain-Specific Biomedical Reasoning tackles specialized problems in genomics, pathology, and other subfields; Training Optimization and Model Improvement develop techniques to refine model capabilities; and Supporting Tools and Infrastructure provide foundational resources such as benchmarks and knowledge bases. Representative works like BioReason[1] and Ehragent[3] illustrate how code generation and agent-based architectures can be combined to handle complex clinical scenarios. A particularly active line of work centers on scalable agentic training platforms, where systems like MedAgentGym[0] and Medagentgym[2] create rich interactive environments for training agents on diverse medical reasoning tasks. These platforms contrast with more narrowly scoped clinical coding systems that focus on ICD assignment or with code-driven reasoning approaches that emphasize symbolic execution over free-form interaction. MedAgentGym[0] sits squarely within the Interactive Training Environments branch, emphasizing large-scale agent training across varied biomedical scenarios, which distinguishes it from works like Ehragent[3] that target specific EHR-based decision support or from Agentic AI Framework[5] that may prioritize general-purpose agent architectures over domain-specific medical training. The central tension across these branches involves balancing the generality of training environments with the precision required for clinical deployment, and determining whether code-based reasoning should be tightly integrated into agent learning loops or treated as a separate inference module.

Claimed Contributions

MedAgentGym training environment

10 retrieved papers

The authors present MedAgentGym as a comprehensive platform comprising 72,413 task instances across 129 categories from 12 real-world biomedical scenarios. Each task is encapsulated in executable sandbox environments with detailed specifications, interactive feedback mechanisms, verifiable ground truth annotations, and scalable training trajectory generation capabilities.

10 retrieved papers

Med-Copilot coding agent

10 retrieved papers

The authors develop Med-Copilot, an LLM-based coding assistant trained using MedAgentGym through both offline and online reinforcement learning methods. The system demonstrates substantial performance improvements and achieves competitive results with proprietary models while maintaining privacy and cost-effectiveness.

10 retrieved papers

Unified execution environment with comprehensive benchmark

10 retrieved papers

The authors provide an integrated platform that combines executable environments, comprehensive benchmarking capabilities, and extensible training resources. This unified framework supports the development and evaluation of LLM-based coding assistants specifically designed for biomedical data science applications.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[2] Medagentgym: Training llm agents for code-based medical reasoning at scale PDF

Xu Ran, Zhuang, Yuchen, Ran Xu, Zhong Yishan, Yuchen Zhuang, Yu Yue, Yishan Zhong, Wang Zifeng, Yue Yu, Yueyang Yu, Tang, Xiangru, Xiangru Tang, Zifeng Wang, Wu Hang, Hang Wu, Wang, May D, M. D. Wang, Ruan Peifeng, Yang, Donghan, Donghan M Yang, Peifeng Ruan, Wang Tao, Tao Wang, Xiao Guang-hua, Guanghua Xiao, Liu Xin, Carl Yang, Carl, Yang Xie, Xin Liu, Xie Yang, Shi, Wenqi, Wenqi Shi (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MedAgentGym training environment

[2] Medagentgym: Training llm agents for code-based medical reasoning at scale PDF

Cannot Refute

[60] Medagents: Large language models as collaborators for zero-shot medical reasoning PDF

Cannot Refute

[61] Interactive computer-aided diagnosis on medical image using large language models PDF

Cannot Refute

[62] Improving interactive diagnostic ability of a large language model agent through clinical experience learning PDF

Cannot Refute

[63] Mediq: Question-asking llms and a benchmark for reliable interactive clinical reasoning PDF

Cannot Refute

[64] Virtual patient simulations using social robotics combined with large language models for clinical reasoning training in medical education: mixed methods â¦ PDF

Cannot Refute

[65] â¦ patient simulations using social robotics combined with large language models for clinical reasoning training in medical education: mixed methods study PDF

Cannot Refute

[66] Self-Evolving Multi-Agent Simulations for Realistic Clinical Interactions PDF

Cannot Refute

[67] A Proactive Agent Collaborative Framework for ZeroâShot Multimodal Medical Reasoning PDF

Cannot Refute

[68] Generative AI for medical education: Insights from a case study with medical students and an AI tutor for clinical reasoning PDF

Cannot Refute

Contribution

Med-Copilot coding agent

[2] Medagentgym: Training llm agents for code-based medical reasoning at scale PDF

Cannot Refute

[51] From llm reasoning to autonomous ai agents: A comprehensive review PDF

Cannot Refute

[52] Structured preference modeling for reinforcement learning-based fine-tuning of large models PDF

Cannot Refute

[53] Reinforcement learning for proposing smoking cessation activities that build competencies: Combining two worldviews in a virtual coach PDF

Cannot Refute

[54] Enhanced vehicle routing for medical waste management via hybrid deep reinforcement learning and optimization algorithms PDF

Cannot Refute

[55] Deep reinforcement-based conversational AI agent in healthcare system PDF

Cannot Refute

[56] MBuilder: A Multi-agent System for Automated Machine Learning in Medical Imaging PDF

Cannot Refute

[57] Reinforcement learning for clinical decision support in critical care: comprehensive review PDF

Cannot Refute

[58] Enhancing software development efficiency through ai-powered code generation PDF

Cannot Refute

[59] s3: You Don't Need That Much Data to Train a Search Agent via RL PDF

Cannot Refute

Contribution

Unified execution environment with comprehensive benchmark

[69] Plot2code: A comprehensive benchmark for evaluating multi-modal large language models in code generation from scientific plots PDF

Cannot Refute

[70] Codearena: A collective evaluation platform for llm code generation PDF

Cannot Refute

[71] Towards an understanding of large language models in software engineering tasks PDF

Cannot Refute

[72] Opencodeinterpreter: Integrating code generation with execution and refinement PDF

Cannot Refute

[73] Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation PDF

Cannot Refute

[74] DesignBench: A Comprehensive Benchmark for MLLM-based Front-end Code Generation PDF

Cannot Refute

[75] AutoGEEval: A Multimodal and Automated Evaluation Framework for Geospatial Code Generation on GEE with Large Language Models PDF

Cannot Refute

[76] Evaluating language models for efficient code generation PDF

Cannot Refute

[77] GeoJSEval: An Automated Evaluation Framework for Large Language Models on JavaScript-Based Geospatial Computation and Visualization Code Generation PDF

Cannot Refute

[78] User Centric Evaluation of Code Generation Tools PDF

Cannot Refute

MedAgentGym: A Scalable Agentic Training Environment for Code-Centric Reasoning in Biomedical Data Science

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[2] Medagentgym: Training llm agents for code-based medical reasoning at scale PDF

Contribution Analysis

MedAgentGym training environment

[2] Medagentgym: Training llm agents for code-based medical reasoning at scale PDF

[60] Medagents: Large language models as collaborators for zero-shot medical reasoning PDF

[61] Interactive computer-aided diagnosis on medical image using large language models PDF

[62] Improving interactive diagnostic ability of a large language model agent through clinical experience learning PDF

[63] Mediq: Question-asking llms and a benchmark for reliable interactive clinical reasoning PDF

[64] Virtual patient simulations using social robotics combined with large language models for clinical reasoning training in medical education: mixed methods â¦ PDF

[65] â¦ patient simulations using social robotics combined with large language models for clinical reasoning training in medical education: mixed methods study PDF

[66] Self-Evolving Multi-Agent Simulations for Realistic Clinical Interactions PDF

[67] A Proactive Agent Collaborative Framework for ZeroâShot Multimodal Medical Reasoning PDF

[68] Generative AI for medical education: Insights from a case study with medical students and an AI tutor for clinical reasoning PDF

Med-Copilot coding agent

[2] Medagentgym: Training llm agents for code-based medical reasoning at scale PDF

[51] From llm reasoning to autonomous ai agents: A comprehensive review PDF

[52] Structured preference modeling for reinforcement learning-based fine-tuning of large models PDF

[53] Reinforcement learning for proposing smoking cessation activities that build competencies: Combining two worldviews in a virtual coach PDF

[54] Enhanced vehicle routing for medical waste management via hybrid deep reinforcement learning and optimization algorithms PDF

[55] Deep reinforcement-based conversational AI agent in healthcare system PDF

[56] MBuilder: A Multi-agent System for Automated Machine Learning in Medical Imaging PDF

[57] Reinforcement learning for clinical decision support in critical care: comprehensive review PDF

[58] Enhancing software development efficiency through ai-powered code generation PDF

[59] s3: You Don't Need That Much Data to Train a Search Agent via RL PDF

Unified execution environment with comprehensive benchmark

[69] Plot2code: A comprehensive benchmark for evaluating multi-modal large language models in code generation from scientific plots PDF

[70] Codearena: A collective evaluation platform for llm code generation PDF

[71] Towards an understanding of large language models in software engineering tasks PDF

[72] Opencodeinterpreter: Integrating code generation with execution and refinement PDF

[73] Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation PDF

[74] DesignBench: A Comprehensive Benchmark for MLLM-based Front-end Code Generation PDF

[75] AutoGEEval: A Multimodal and Automated Evaluation Framework for Geospatial Code Generation on GEE with Large Language Models PDF

[76] Evaluating language models for efficient code generation PDF

[77] GeoJSEval: An Automated Evaluation Framework for Large Language Models on JavaScript-Based Geospatial Computation and Visualization Code Generation PDF

[78] User Centric Evaluation of Code Generation Tools PDF

Table of Contents

[64] Virtual patient simulations using social robotics combined with large language models for clinical reasoning training in medical education: mixed methods â¦ PDF

[65] â¦ patient simulations using social robotics combined with large language models for clinical reasoning training in medical education: mixed methods study PDF

[67] A Proactive Agent Collaborative Framework for ZeroâShot Multimodal Medical Reasoning PDF