CARE: Towards Clinical Accountability in Multi-Modal Medical Reasoning with an Evidence-Grounded Agentic Framework

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Multi-modal Large Language AgentMedical Visual Question AnsweringVisually Grounded ReasoningReinforcement Learning with Verifiable Reward

Large visual language models (VLMs) have shown strong multi-modal medical reasoning ability, but most operate as end-to-end black boxes, diverging from clinicians’ evidence-based, staged workflows and hindering clinical accountability. Complementarily, expert visual grounding models can accurately localize regions of interest (ROIs), providing explicit, reliable evidence that improves both reasoning accuracy and trust. In this paper, we introduce CARE, advancing Clinical Accountability in multi-modal medical Reasoning with an Evidence-grounded agentic framework. Unlike existing approaches that couple grounding and reasoning within a single generalist model, CARE decomposes the task into coordinated sub-modules to reduce shortcut learning and hallucination: a compact VLM proposes relevant medical entities; an expert entity-referring segmentation model produces pixel-level ROI evidence; and a grounded VLM reasons over the full image augmented by ROI hints. The VLMs are optimized with reinforcement learning with verifiable rewards to align answers with supporting evidence. Furthermore, a VLM coordinator plans tool invocation and reviews evidence-answer consistency, providing agentic control and final verification. Evaluated on standard medical VQA benchmarks, our CARE-Flow (coordinator-free) improves average accuracy by 10.9% over the same size (10B) state-of-the-art (SOTA). With dynamic planning and answer review, our CARE-Coord yields a further gain, outperforming the heavily pre-trained SOTA by 5.2%. Our experiments demonstrate that an agentic framework that emulates clinical workflows, incorporating decoupled specialized models and explicit evidence, yields more accurate and accountable medical AI.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces CARE, an agentic framework that decomposes medical visual question answering into coordinated sub-modules: entity proposal, pixel-level segmentation, and grounded reasoning. Within the taxonomy, it resides in the 'Evidence-Grounded Agentic Frameworks' leaf under 'Grounding and Localization', alongside only one sibling paper (MedAgent-Pro). This leaf represents a sparse, emerging research direction focused on multi-stage, evidence-driven workflows rather than end-to-end models. The framework's emphasis on clinical accountability and verifiable evidence positions it at the frontier of interpretable medical VQA.

The taxonomy reveals that most prior work clusters in adjacent branches: 'Visual Grounding and Region Localization' addresses spatial anchoring without agentic decomposition, while 'Reasoning and Interpretability' focuses on explainability through rationale generation or structured reasoning. Neighboring leaves like 'Retrieval-Augmented Approaches' and 'Large Language Model Integration' explore external knowledge and generative models but typically lack the explicit, pixel-level evidence grounding that CARE emphasizes. The scope notes clarify that agentic frameworks explicitly decompose tasks into specialized modules with evidence verification, distinguishing them from single-model grounding or retrieval-only methods.

Among thirty candidates examined, none clearly refute the three core contributions. The agentic framework contribution examined ten candidates with zero refutable overlaps, suggesting limited prior work on coordinated sub-module decomposition for clinical accountability. Similarly, the region-grounded reasoning workflow and reinforcement learning with verifiable rewards each examined ten candidates without refutation. This pattern indicates that while related concepts exist in retrieval-augmented or grounding-focused methods, the specific combination of agentic coordination, pixel-level evidence, and RL-based alignment appears underexplored within the examined scope.

Based on the limited search of thirty semantically similar papers, the work appears to occupy a relatively novel position, particularly in its integration of agentic decomposition with pixel-level grounding and verifiable reward optimization. However, the analysis does not cover exhaustive literature beyond top-K semantic matches and citation expansion, leaving open the possibility of relevant work in adjacent domains or recent preprints not captured by this search strategy.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Evidence-grounded multi-modal medical visual question answering. The field has evolved from early fusion-based architectures and pre-training strategies to more sophisticated approaches that integrate external knowledge, retrieval mechanisms, and interpretable reasoning. The taxonomy reflects this progression through branches such as Model Architecture and Fusion Mechanisms, which explore how to combine visual and textual modalities (e.g., BPI-MVQA[1], CLIP Medical VQA[2]); Pre-Training and Transfer Learning, which leverage large-scale data to improve generalization; and Data and Knowledge Enhancement, which incorporate domain-specific medical knowledge graphs and structured information (e.g., MKGF[5]). Retrieval-Augmented Approaches bring external evidence into the answering process, while Reasoning and Interpretability focus on making model decisions transparent and logically sound. Grounding and Localization emphasize spatial or evidence-based anchoring of answers, and Large Language Model Integration explores how recent foundation models can be adapted for medical VQA. Specialized Applications target specific clinical domains (e.g., pathology, radiology), and Benchmarks and Evaluation provide standardized datasets like PathVQA[18] and PMC-VQA[11] to measure progress. Recent work has increasingly emphasized agentic and evidence-grounded frameworks that move beyond simple answer generation to include explicit retrieval and reasoning steps. Within the Grounding and Localization branch, CARE[0] exemplifies this trend by proposing an agentic framework that iteratively retrieves and grounds evidence before answering, closely aligning with MedAgent-Pro[31], which similarly employs multi-step reasoning and tool use. These approaches contrast with earlier methods that relied primarily on end-to-end fusion or static knowledge bases, such as LaPA[3] or MOTOR[4], which integrate pre-defined knowledge graphs or localized attention without dynamic retrieval. The shift toward agentic systems reflects broader questions about how to balance interpretability, computational cost, and the need for verifiable evidence in high-stakes medical settings, positioning CARE[0] among a small but growing cluster of works that treat medical VQA as a multi-stage, evidence-driven process.

Claimed Contributions

CARE agentic framework for clinical accountability in medical reasoning

10 retrieved papers

The authors propose CARE, an agentic framework that decomposes medical visual question answering into coordinated specialist submodules (entity proposal, referring segmentation, and evidence-grounded VQA) with a dynamic coordinator that plans tool invocation and reviews evidence-answer consistency, emulating clinical workflows to improve accuracy and accountability.

10 retrieved papers

Region-grounded reasoning workflow with pixel-level evidence

10 retrieved papers

The authors design a workflow where an expert referring-segmentation model produces pixel-level ROI evidence in three forms (zoom-in crops, binary masks, or global indicators), which is then fed back into the VQA model to support evidence-based reasoning and improve both accuracy and accountability.

10 retrieved papers

Reinforcement learning with verifiable rewards for evidence-consistent proposals

10 retrieved papers

The authors optimize their VLMs using reinforcement learning with verifiable rewards (RLVR), including an embedding-similarity reward for entity proposals and task-specific rewards for evidence-grounded VQA, to improve performance and ensure answers align with supporting visual evidence.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[31] MedAgent-Pro: Towards Evidence-Based Multi-Modal Medical Diagnosis via Reasoning Agentic Workflow PDF

Wang, Ziyue, Wu, Junde, Ziyue Wang, Cai, Linghan, Junde Wu, Linghan Cai, Yang Xi-hong, Chang Han Low, Xihong Yang, Jin, Yueming, Qiaxuan Li, Yueming Jin (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

CARE agentic framework for clinical accountability in medical reasoning

[71] When LLMs Decide Who Gets Care: A Vision for Multi-Agent Systems in High Stakes Clinical Decision-Making PDF

Cannot Refute

[72] A Comprehensive Survey of Agentic AI in Healthcare PDF

Cannot Refute

[73] Multi-Agent Intelligence for Multidisciplinary Decision-Making in Gastrointestinal Oncology PDF

Cannot Refute

[74] Aura: A multi-modal medical agent for understanding, reasoning and annotation PDF

Cannot Refute

[75] Agentic large-language-model systems in medicine: A systematic review and taxonomy PDF

Cannot Refute

[76] Agentic AI in Healthcare: A Comprehensive Survey of Foundations, Taxonomy, and Applications PDF

Cannot Refute

[77] Beyond Single Systems: How Multi-Agent AI Is Reshaping Ethics in Radiology PDF

Cannot Refute

[78] Mitigating Cognitive Biases in Clinical Decision-Making Through Multi-Agent Conversations Using Large Language Models: Simulation Study PDF

Cannot Refute

[79] An Explainable Agentic AI Framework for Uncertainty-Aware and Abstention-Enabled Acute Ischemic Stroke Imaging Decisions PDF

Cannot Refute

[80] EvoMDT: a self-evolving multi-agent system for structured clinical decision-making in multi-cancer PDF

Cannot Refute

Contribution

Region-grounded reasoning workflow with pixel-level evidence

[61] Spatialrgpt: Grounded spatial reasoning in vision-language models PDF

Cannot Refute

[62] SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities PDF

Cannot Refute

[63] Geochat: Grounded large vision-language model for remote sensing PDF

Cannot Refute

[64] Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing PDF

Cannot Refute

[65] SpaRE: Enhancing Spatial Reasoning in Vision-Language Models with Synthetic Data PDF

Cannot Refute

[66] Visual reasoning tracer: Object-level grounded reasoning benchmark PDF

Cannot Refute

[67] Weakly-Supervised 3D Spatial Reasoning for Text-Based Visual Question Answering PDF

Cannot Refute

[68] Spatialreasoner: Towards explicit and generalizable 3d spatial reasoning PDF

Cannot Refute

[69] Latent visual reasoning PDF

Cannot Refute

[70] Scene-R1: Video-Grounded Large Language Models for 3D Scene Reasoning without 3D Annotations PDF

Cannot Refute

Contribution

Reinforcement learning with verifiable rewards for evidence-consistent proposals

[51] Visual-rft: Visual reinforcement fine-tuning PDF

Cannot Refute

[52] Maniplvm-r1: Reinforcement learning for reasoning in embodied manipulation with large vision-language models PDF

Cannot Refute

[53] Reinforced embodied planning with verifiable reward for real-world robotic manipulation PDF

Cannot Refute

[54] SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models PDF

Cannot Refute

[55] SSL4RL: Revisiting Self-supervised Learning as Intrinsic Reward for Visual-Language Reasoning PDF

Cannot Refute

[56] Calibrated Self-Rewarding Vision Language Models PDF

Cannot Refute

[57] Unveiling Chain of Step Reasoning for Vision-Language Models with Fine-grained Rewards PDF

Cannot Refute

[58] Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning PDF

Cannot Refute

[59] Real-World Offline Reinforcement Learning from Vision Language Model Feedback PDF

Cannot Refute

[60] Self-rewarding vision-language model via reasoning decomposition PDF

Cannot Refute

CARE: Towards Clinical Accountability in Multi-Modal Medical Reasoning with an Evidence-Grounded Agentic Framework

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[31] MedAgent-Pro: Towards Evidence-Based Multi-Modal Medical Diagnosis via Reasoning Agentic Workflow PDF

Contribution Analysis

CARE agentic framework for clinical accountability in medical reasoning

[71] When LLMs Decide Who Gets Care: A Vision for Multi-Agent Systems in High Stakes Clinical Decision-Making PDF

[72] A Comprehensive Survey of Agentic AI in Healthcare PDF

[73] Multi-Agent Intelligence for Multidisciplinary Decision-Making in Gastrointestinal Oncology PDF

[74] Aura: A multi-modal medical agent for understanding, reasoning and annotation PDF

[75] Agentic large-language-model systems in medicine: A systematic review and taxonomy PDF

[76] Agentic AI in Healthcare: A Comprehensive Survey of Foundations, Taxonomy, and Applications PDF

[77] Beyond Single Systems: How Multi-Agent AI Is Reshaping Ethics in Radiology PDF

[78] Mitigating Cognitive Biases in Clinical Decision-Making Through Multi-Agent Conversations Using Large Language Models: Simulation Study PDF

[79] An Explainable Agentic AI Framework for Uncertainty-Aware and Abstention-Enabled Acute Ischemic Stroke Imaging Decisions PDF

[80] EvoMDT: a self-evolving multi-agent system for structured clinical decision-making in multi-cancer PDF

Region-grounded reasoning workflow with pixel-level evidence

[61] Spatialrgpt: Grounded spatial reasoning in vision-language models PDF

[62] SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities PDF

[63] Geochat: Grounded large vision-language model for remote sensing PDF

[64] Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing PDF

[65] SpaRE: Enhancing Spatial Reasoning in Vision-Language Models with Synthetic Data PDF

[66] Visual reasoning tracer: Object-level grounded reasoning benchmark PDF

[67] Weakly-Supervised 3D Spatial Reasoning for Text-Based Visual Question Answering PDF

[68] Spatialreasoner: Towards explicit and generalizable 3d spatial reasoning PDF

[69] Latent visual reasoning PDF

[70] Scene-R1: Video-Grounded Large Language Models for 3D Scene Reasoning without 3D Annotations PDF

Reinforcement learning with verifiable rewards for evidence-consistent proposals

[51] Visual-rft: Visual reinforcement fine-tuning PDF

[52] Maniplvm-r1: Reinforcement learning for reasoning in embodied manipulation with large vision-language models PDF

[53] Reinforced embodied planning with verifiable reward for real-world robotic manipulation PDF

[54] SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models PDF

[55] SSL4RL: Revisiting Self-supervised Learning as Intrinsic Reward for Visual-Language Reasoning PDF

[56] Calibrated Self-Rewarding Vision Language Models PDF

[57] Unveiling Chain of Step Reasoning for Vision-Language Models with Fine-grained Rewards PDF

[58] Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning PDF

[59] Real-World Offline Reinforcement Learning from Vision Language Model Feedback PDF

[60] Self-rewarding vision-language model via reasoning decomposition PDF

Table of Contents