Beyond Classification Accuracy: Neural-MedBench and the Need for Deeper Reasoning Benchmarks

ICLR 2026 Conference SubmissionAnonymous Authors
vision-language modelsbenchmark datasetmedical AI evaluationreasoning-intensive tasks
Abstract:

Recent advances in vision-language models (VLMs) have achieved remarkable performance on standard medical benchmarks, yet their true clinical reasoning ability remains unclear. Existing datasets predominantly emphasize classification accuracy, creating an evaluation illusion in which models appear proficient while still failing at high-stakes diagnostic reasoning. We introduce \texttt{Neural-MedBench}, a compact yet reasoning-intensive benchmark specifically designed to probe the limits of multimodal clinical reasoning in neurology. Neural-MedBench integrates multi-sequence MRI scans, structured electronic health records, and clinical notes, and encompasses three core task families: differential diagnosis, lesion recognition, and rationale generation. To ensure reliable evaluation, we develop a hybrid scoring pipeline that combines LLM-based graders, clinician validation, and semantic similarity metrics. Through systematic evaluation of state-of-the-art VLMs, including GPT-4o, Claude-4, and MedGemma, we observe a sharp performance drop compared to conventional datasets. Error analysis shows that reasoning failures, rather than perceptual errors, dominate model shortcomings. Our findings highlight the necessity of a Two-Axis Evaluation Framework: breadth-oriented large datasets for statistical generalization, and depth-oriented, compact benchmarks such as Neural-MedBench for reasoning fidelity. We release Neural-MedBench as an open and extensible diagnostic testbed, which guides the expansion of future benchmarks and enables rigorous yet cost-effective assessment of clinically trustworthy AI.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Neural-MedBench, a reasoning-intensive benchmark for evaluating multimodal clinical reasoning in neurology, integrating multi-sequence MRI, electronic health records, and clinical notes across differential diagnosis, lesion recognition, and rationale generation tasks. It resides in the 'Benchmark Development and Reasoning Assessment' leaf, which contains only one sibling paper ('Beyond Classification Accuracy'). This represents a relatively sparse research direction within the broader taxonomy, suggesting that rigorous reasoning-focused benchmarks in neurology remain underexplored compared to the more crowded architectural development and disease-specific application branches.

The taxonomy reveals that neighboring leaves focus on vision-language model evaluation in medical diagnosis and multimodal generative AI in clinical diagnostics, both emphasizing model performance assessment but not necessarily reasoning depth. The parent category 'Clinical Reasoning and Diagnostic Evaluation in Neurology' sits alongside branches addressing AI architectures, data fusion methodologies, and clinical monitoring systems. The scope note for the benchmark leaf explicitly excludes general diagnostic models and clinical monitoring studies, positioning this work as distinct from the numerous disease-specific applications (e.g., MS diagnosis, stroke imaging) and architectural innovations (e.g., CNN-RNN fusion, graph neural networks) that populate other taxonomy branches.

Among the three contributions analyzed, each examined ten candidate papers from the limited search scope of thirty total candidates, with zero refutable pairs identified across all contributions. The two-axis evaluation framework, the Neural-MedBench benchmark itself, and the empirical evidence of breadth-depth disconnect in VLM evaluation all appear to lack substantial overlapping prior work within the examined candidate set. This suggests that the specific combination of reasoning-intensive tasks, hybrid scoring pipelines, and systematic VLM evaluation in neurology may represent a relatively novel configuration, though the limited search scope (top-K semantic search plus citation expansion) means this assessment cannot claim exhaustiveness.

Based on the thirty candidates examined, the work appears to occupy a distinct position at the intersection of benchmark development and clinical reasoning assessment in neurology. The absence of refutable pairs across all contributions, combined with the sparse population of the taxonomy leaf, suggests potential novelty, though the limited search scope and the existence of related work in adjacent leaves (VLM evaluation, generative AI diagnostics) warrant cautious interpretation. The analysis does not cover the full landscape of medical AI benchmarking or reasoning evaluation beyond the examined candidates.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Multimodal clinical reasoning in neurology. The field encompasses a diverse set of approaches that combine imaging, clinical data, and computational methods to support neurological diagnosis and patient care. At the highest level, the taxonomy reveals several major branches: some focus on AI architectures tailored for neurological diagnosis, others on data integration and fusion methodologies that merge heterogeneous sources, and still others on clinical reasoning frameworks, prognostic systems in neurocritical care, disease-specific applications, specialized imaging modalities, educational approaches, and computational modeling. Works such as Integration of multimodal imaging[1] and Integration of multimodal data[2] illustrate efforts to harmonize disparate data streams, while A comprehensive review on[3] and Multimodal machine learning in[15] highlight the growing role of machine learning in synthesizing complex clinical information. Meanwhile, branches addressing clinical monitoring and prognostic systems emphasize real-time decision support in acute settings, and educational frameworks explore how multimodal strategies can enhance training and clinical practice. Within this landscape, a particularly active line of work centers on benchmark development and reasoning assessment, where researchers seek to move beyond simple classification accuracy toward more nuanced evaluation of diagnostic reasoning. Beyond Classification Accuracy[0] sits squarely in this cluster, emphasizing the need for metrics that capture the quality and interpretability of clinical inferences rather than raw performance alone. This contrasts with neighboring efforts such as A multi-agent approach to[9], which explores collaborative reasoning architectures, and with disease-specific tools like Multimodal classification of Alzheimers[10] or Early Detection of Parkinsonâs[39], which prioritize predictive power for particular conditions. The original paper's focus on rigorous reasoning assessment reflects broader concerns about transparency and clinical validity, themes echoed in works on explainability like SeruNet Smart Explainable Platform[7] and Explainable graph neural network[23]. Open questions remain about how to balance model complexity with interpretability and how to design benchmarks that truly reflect the multifaceted nature of neurological diagnosis.

Claimed Contributions

Two-Axis Evaluation Framework for medical AI

The authors introduce a conceptual framework that distinguishes two independent dimensions for evaluating medical AI systems: breadth-oriented evaluation for statistical generalization across populations, and depth-oriented evaluation for reasoning fidelity and clinical trustworthiness. They argue both axes are necessary for complete assessment of model readiness.

10 retrieved papers
Neural-MedBench benchmark

The authors develop a compact, reasoning-intensive benchmark for neurology that integrates multi-sequence MRI scans, electronic health records, and clinical notes. It encompasses three task families: differential diagnosis, lesion recognition, and rationale generation, designed to probe clinical reasoning rather than classification accuracy.

10 retrieved papers
Empirical evidence of breadth-depth disconnect in VLM evaluation

Through systematic evaluation of leading VLMs including GPT-4o, Claude-4, and MedGemma, the authors demonstrate that models excelling on breadth-oriented benchmarks exhibit sharp performance drops on Neural-MedBench. Error analysis reveals failures stem from reasoning breakdowns rather than perceptual errors, supporting the independence of the two evaluation axes.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Two-Axis Evaluation Framework for medical AI

The authors introduce a conceptual framework that distinguishes two independent dimensions for evaluating medical AI systems: breadth-oriented evaluation for statistical generalization across populations, and depth-oriented evaluation for reasoning fidelity and clinical trustworthiness. They argue both axes are necessary for complete assessment of model readiness.

Contribution

Neural-MedBench benchmark

The authors develop a compact, reasoning-intensive benchmark for neurology that integrates multi-sequence MRI scans, electronic health records, and clinical notes. It encompasses three task families: differential diagnosis, lesion recognition, and rationale generation, designed to probe clinical reasoning rather than classification accuracy.

Contribution

Empirical evidence of breadth-depth disconnect in VLM evaluation

Through systematic evaluation of leading VLMs including GPT-4o, Claude-4, and MedGemma, the authors demonstrate that models excelling on breadth-oriented benchmarks exhibit sharp performance drops on Neural-MedBench. Error analysis reveals failures stem from reasoning breakdowns rather than perceptual errors, supporting the independence of the two evaluation axes.