MMedAgent-RL: Optimizing Multi-Agent Collaboration for Multimodal Medical Reasoning

ICLR 2026 Conference SubmissionAnonymous Authors
med-vlmmulti-agent collaborationmultimodal medical reasoningmedical vqareinforcement learning
Abstract:

Medical Large Vision-Language Models (Med-LVLMs) have shown strong potential in multimodal diagnostic tasks. However, existing single-agent models struggle to generalize across diverse medical specialties, limiting their performance. Recent efforts introduce multi-agent collaboration frameworks inspired by clinical workflows, where general practitioners (GPs) and specialists interact in a fixed sequence. Despite improvements, these static pipelines lack flexibility and adaptability in reasoning. To address this, we propose MMedAgent-RL, a reinforcement learning (RL)-based multi-agent framework that enables dynamic, optimized collaboration among medical agents. Specifically, we train two GP agents based on Qwen2.5-VL via RL: the triage doctor learns to assign patients to appropriate specialties, while the attending physician integrates the judgments from multi-specialists and its own knowledge to make final decisions. To address the inconsistency in specialist outputs, we introduce a curriculum learning (CL)-guided RL strategy with dynamic entropy regulation, progressively teaching the attending physician to balance between imitating specialists and correcting their mistakes. Experiments on five medical VQA benchmarks demonstrate that MMedAgent-RL outperforms both open-source and proprietary Med-LVLMs. Notably, it achieves an average performance gain of 23.6% over strong baselines.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces MMedAgent-RL, a reinforcement learning-based framework for multi-agent medical reasoning that trains GP agents (triage doctor and attending physician) to coordinate with specialists. It resides in the Specialist-GP Collaboration Models leaf, which contains five papers total, indicating a moderately populated research direction. This leaf focuses specifically on hierarchical GP-specialist architectures, distinguishing it from peer-based or mediator-guided models elsewhere in the taxonomy. The framework's core novelty lies in applying RL to optimize dynamic collaboration patterns rather than relying on fixed clinical workflows.

The taxonomy reveals neighboring approaches in Role-Specialized Agent Frameworks, including Multidisciplinary Team Simulation (four papers simulating oncologists, radiologists, nurses) and Mediator-Guided Agent Coordination (three papers using meta-agents for orchestration). Adjacent branches explore Dynamic Collaboration Strategies, where one paper applies RL-based optimization and two others focus on adaptive task-driven collaboration. MMedAgent-RL bridges these areas by combining role specialization with dynamic optimization, diverging from static pipelines in sibling papers while sharing the GP-specialist hierarchy. The taxonomy's scope notes clarify that this work excludes peer-based models and generic multi-agent systems without clinical role differentiation.

Among fifteen candidates examined, the MMedAgent-RL framework contribution shows no clear refutation across ten papers reviewed, suggesting relative novelty in applying RL to GP-specialist coordination. The curriculum learning strategy with dynamic entropy regulation examined two candidates and found one refutable instance, indicating some overlap with prior curriculum-based training methods. The theoretical analysis contribution reviewed three candidates with no refutations. These statistics reflect a limited semantic search scope, not exhaustive coverage, meaning unexamined literature may contain additional relevant work. The framework contribution appears more distinctive than the curriculum learning component within this search window.

Based on top-fifteen semantic matches, the work occupies a moderately explored niche combining RL optimization with hierarchical medical agent roles. The taxonomy structure shows this sits at the intersection of established role-based frameworks and emerging dynamic collaboration strategies. The limited search scope means the assessment captures nearby prior work but cannot confirm novelty against the broader literature. The curriculum learning component shows clearer precedent than the overall framework design within the examined candidates.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
15
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: Optimizing multi-agent collaboration for multimodal medical reasoning. The field is organized around five main branches that together address how multiple AI agents can work together to interpret diverse clinical data and support medical decision-making. Multi-Agent Collaboration Architectures and Frameworks explores different organizational patterns, including role-specialized designs where agents mimic clinical teams with distinct expertise (e.g., specialist-GP models). Multimodal Reasoning and Evidence Integration focuses on methods for combining imaging, text, and structured data to form coherent clinical assessments. Clinical Reasoning and Decision Support Mechanisms examines how agents replicate diagnostic workflows, differential reasoning, and treatment planning. Domain-Specific Applications and Benchmarks provides testbeds and real-world use cases across radiology, oncology, and other specialties, while Surveys, Reviews, and Theoretical Foundations offer broader perspectives on agent design principles and ethical considerations. Recent work reveals contrasting strategies in how agents divide labor and integrate evidence. Some frameworks emphasize hierarchical coordination with mediator agents orchestrating specialist consultations (e.g., Mediator-guided Collaboration[5], ColaCare[26]), while others favor more modular or peer-to-peer interactions where agents iteratively refine shared hypotheses (e.g., MAM Modular[9], Inquire Interact Integrate[6]). MMedAgent-RL[0] sits within the specialist-GP collaboration cluster, employing role-specialized agents that mirror real clinical hierarchies to tackle multimodal diagnostic tasks. Compared to approaches like Proactive Agent Medical[1], which emphasizes proactive information gathering, or MCM TCM[3], which integrates traditional Chinese medicine perspectives, MMedAgent-RL[0] focuses on reinforcement learning to optimize agent interactions and evidence synthesis. Open questions remain around balancing agent autonomy with interpretability, scaling collaboration to larger teams, and ensuring robust performance across diverse clinical contexts.

Claimed Contributions

MMedAgent-RL framework for multi-agent medical reasoning

The authors introduce a reinforcement learning-driven multi-agent system that simulates clinical workflows (GP → Specialist → GP) for multimodal medical diagnosis. Unlike prior static multi-agent systems, this framework uses RL to train two GP agents (triage doctor and attending physician) to dynamically collaborate with specialist agents.

10 retrieved papers
Curriculum learning strategy with dynamic entropy regulation

The authors propose a curriculum-based RL approach that categorizes training samples by specialist accuracy (easy, medium, hard) and uses dynamic entropy coefficients to control exploration-exploitation trade-offs. This enables the attending physician agent to learn when to trust specialist consensus and when to challenge incorrect advice.

2 retrieved papers
Can Refute
Theoretical analysis of curriculum learning for policy optimization

The authors establish formal convergence guarantees (Theorem 4.1) showing that curriculum learning decomposes a challenging optimization problem into tractable sub-problems, providing warm starts between stages and achieving better convergence than standard SGD under specified conditions.

3 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MMedAgent-RL framework for multi-agent medical reasoning

The authors introduce a reinforcement learning-driven multi-agent system that simulates clinical workflows (GP → Specialist → GP) for multimodal medical diagnosis. Unlike prior static multi-agent systems, this framework uses RL to train two GP agents (triage doctor and attending physician) to dynamically collaborate with specialist agents.

Contribution

Curriculum learning strategy with dynamic entropy regulation

The authors propose a curriculum-based RL approach that categorizes training samples by specialist accuracy (easy, medium, hard) and uses dynamic entropy coefficients to control exploration-exploitation trade-offs. This enables the attending physician agent to learn when to trust specialist consensus and when to challenge incorrect advice.

Contribution

Theoretical analysis of curriculum learning for policy optimization

The authors establish formal convergence guarantees (Theorem 4.1) showing that curriculum learning decomposes a challenging optimization problem into tractable sub-problems, providing warm starts between stages and achieving better convergence than standard SGD under specified conditions.

MMedAgent-RL: Optimizing Multi-Agent Collaboration for Multimodal Medical Reasoning | Novelty Validation