CARE: Towards Clinical Accountability in Multi-Modal Medical Reasoning with an Evidence-Grounded Agentic Framework

ICLR 2026 Conference SubmissionAnonymous Authors
Multi-modal Large Language AgentMedical Visual Question AnsweringVisually Grounded ReasoningReinforcement Learning with Verifiable Reward
Abstract:

Large visual language models (VLMs) have shown strong multi-modal medical reasoning ability, but most operate as end-to-end black boxes, diverging from clinicians’ evidence-based, staged workflows and hindering clinical accountability. Complementarily, expert visual grounding models can accurately localize regions of interest (ROIs), providing explicit, reliable evidence that improves both reasoning accuracy and trust. In this paper, we introduce CARE, advancing Clinical Accountability in multi-modal medical Reasoning with an Evidence-grounded agentic framework. Unlike existing approaches that couple grounding and reasoning within a single generalist model, CARE decomposes the task into coordinated sub-modules to reduce shortcut learning and hallucination: a compact VLM proposes relevant medical entities; an expert entity-referring segmentation model produces pixel-level ROI evidence; and a grounded VLM reasons over the full image augmented by ROI hints. The VLMs are optimized with reinforcement learning with verifiable rewards to align answers with supporting evidence. Furthermore, a VLM coordinator plans tool invocation and reviews evidence-answer consistency, providing agentic control and final verification. Evaluated on standard medical VQA benchmarks, our CARE-Flow (coordinator-free) improves average accuracy by 10.9% over the same size (10B) state-of-the-art (SOTA). With dynamic planning and answer review, our CARE-Coord yields a further gain, outperforming the heavily pre-trained SOTA by 5.2%. Our experiments demonstrate that an agentic framework that emulates clinical workflows, incorporating decoupled specialized models and explicit evidence, yields more accurate and accountable medical AI.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces CARE, an agentic framework that decomposes medical visual question answering into coordinated sub-modules: entity proposal, pixel-level segmentation, and grounded reasoning. Within the taxonomy, it resides in the 'Evidence-Grounded Agentic Frameworks' leaf under 'Grounding and Localization', alongside only one sibling paper (MedAgent-Pro). This leaf represents a sparse, emerging research direction focused on multi-stage, evidence-driven workflows rather than end-to-end models. The framework's emphasis on clinical accountability and verifiable evidence positions it at the frontier of interpretable medical VQA.

The taxonomy reveals that most prior work clusters in adjacent branches: 'Visual Grounding and Region Localization' addresses spatial anchoring without agentic decomposition, while 'Reasoning and Interpretability' focuses on explainability through rationale generation or structured reasoning. Neighboring leaves like 'Retrieval-Augmented Approaches' and 'Large Language Model Integration' explore external knowledge and generative models but typically lack the explicit, pixel-level evidence grounding that CARE emphasizes. The scope notes clarify that agentic frameworks explicitly decompose tasks into specialized modules with evidence verification, distinguishing them from single-model grounding or retrieval-only methods.

Among thirty candidates examined, none clearly refute the three core contributions. The agentic framework contribution examined ten candidates with zero refutable overlaps, suggesting limited prior work on coordinated sub-module decomposition for clinical accountability. Similarly, the region-grounded reasoning workflow and reinforcement learning with verifiable rewards each examined ten candidates without refutation. This pattern indicates that while related concepts exist in retrieval-augmented or grounding-focused methods, the specific combination of agentic coordination, pixel-level evidence, and RL-based alignment appears underexplored within the examined scope.

Based on the limited search of thirty semantically similar papers, the work appears to occupy a relatively novel position, particularly in its integration of agentic decomposition with pixel-level grounding and verifiable reward optimization. However, the analysis does not cover exhaustive literature beyond top-K semantic matches and citation expansion, leaving open the possibility of relevant work in adjacent domains or recent preprints not captured by this search strategy.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Evidence-grounded multi-modal medical visual question answering. The field has evolved from early fusion-based architectures and pre-training strategies to more sophisticated approaches that integrate external knowledge, retrieval mechanisms, and interpretable reasoning. The taxonomy reflects this progression through branches such as Model Architecture and Fusion Mechanisms, which explore how to combine visual and textual modalities (e.g., BPI-MVQA[1], CLIP Medical VQA[2]); Pre-Training and Transfer Learning, which leverage large-scale data to improve generalization; and Data and Knowledge Enhancement, which incorporate domain-specific medical knowledge graphs and structured information (e.g., MKGF[5]). Retrieval-Augmented Approaches bring external evidence into the answering process, while Reasoning and Interpretability focus on making model decisions transparent and logically sound. Grounding and Localization emphasize spatial or evidence-based anchoring of answers, and Large Language Model Integration explores how recent foundation models can be adapted for medical VQA. Specialized Applications target specific clinical domains (e.g., pathology, radiology), and Benchmarks and Evaluation provide standardized datasets like PathVQA[18] and PMC-VQA[11] to measure progress. Recent work has increasingly emphasized agentic and evidence-grounded frameworks that move beyond simple answer generation to include explicit retrieval and reasoning steps. Within the Grounding and Localization branch, CARE[0] exemplifies this trend by proposing an agentic framework that iteratively retrieves and grounds evidence before answering, closely aligning with MedAgent-Pro[31], which similarly employs multi-step reasoning and tool use. These approaches contrast with earlier methods that relied primarily on end-to-end fusion or static knowledge bases, such as LaPA[3] or MOTOR[4], which integrate pre-defined knowledge graphs or localized attention without dynamic retrieval. The shift toward agentic systems reflects broader questions about how to balance interpretability, computational cost, and the need for verifiable evidence in high-stakes medical settings, positioning CARE[0] among a small but growing cluster of works that treat medical VQA as a multi-stage, evidence-driven process.

Claimed Contributions

CARE agentic framework for clinical accountability in medical reasoning

The authors propose CARE, an agentic framework that decomposes medical visual question answering into coordinated specialist submodules (entity proposal, referring segmentation, and evidence-grounded VQA) with a dynamic coordinator that plans tool invocation and reviews evidence-answer consistency, emulating clinical workflows to improve accuracy and accountability.

10 retrieved papers
Region-grounded reasoning workflow with pixel-level evidence

The authors design a workflow where an expert referring-segmentation model produces pixel-level ROI evidence in three forms (zoom-in crops, binary masks, or global indicators), which is then fed back into the VQA model to support evidence-based reasoning and improve both accuracy and accountability.

10 retrieved papers
Reinforcement learning with verifiable rewards for evidence-consistent proposals

The authors optimize their VLMs using reinforcement learning with verifiable rewards (RLVR), including an embedding-similarity reward for entity proposals and task-specific rewards for evidence-grounded VQA, to improve performance and ensure answers align with supporting visual evidence.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

CARE agentic framework for clinical accountability in medical reasoning

The authors propose CARE, an agentic framework that decomposes medical visual question answering into coordinated specialist submodules (entity proposal, referring segmentation, and evidence-grounded VQA) with a dynamic coordinator that plans tool invocation and reviews evidence-answer consistency, emulating clinical workflows to improve accuracy and accountability.

Contribution

Region-grounded reasoning workflow with pixel-level evidence

The authors design a workflow where an expert referring-segmentation model produces pixel-level ROI evidence in three forms (zoom-in crops, binary masks, or global indicators), which is then fed back into the VQA model to support evidence-based reasoning and improve both accuracy and accountability.

Contribution

Reinforcement learning with verifiable rewards for evidence-consistent proposals

The authors optimize their VLMs using reinforcement learning with verifiable rewards (RLVR), including an embedding-similarity reward for entity proposals and task-specific rewards for evidence-grounded VQA, to improve performance and ensure answers align with supporting visual evidence.