Abstract:

Medical question answering (QA) benchmarks often focus on multiple-choice or fact-based tasks, leaving open-ended answers to real patient questions underexplored. This gap is particularly critical in mental health, where patient questions often mix symptoms, treatment concerns, and emotional needs, requiring answers that balance clinical caution with contextual sensitivity. We present CounselBench, a large-scale benchmark developed with 100 mental health professionals to evaluate and stress-test large language models (LLMs) in realistic help-seeking scenarios. The first component, CounselBench-EVAL, contains 2,000 expert evaluations of answers from GPT-4, LLaMA 3, Gemini, and online human therapists on patient questions from the public forum CounselChat. Each answer is rated across six clinically grounded dimensions, with span-level annotations and written rationales. Expert evaluations show that while LLMs achieve high scores on several dimensions, they also exhibit recurring issues, including unconstructive feedback, overgeneralization, and limited personalization or relevance. Responses were frequently flagged for safety risks, most notably unauthorized medical advice. Follow-up experiments show that LLM judges systematically overrate model responses and overlook safety concerns identified by human experts. To probe failure modes more directly, we construct CounselBench-Adv, an adversarial dataset of 120 expert-authored mental health questions designed to trigger specific model issues. Expert evaluation of 1,080 responses from nine LLMs reveals consistent, model-specific failure patterns. Together, CounselBench establishes a clinically grounded framework for benchmarking LLMs in mental health QA.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces CounselBench, a large-scale benchmark for evaluating LLM responses to mental health questions using expert clinician ratings across six dimensions. It resides in the 'Multi-Dimensional Expert Evaluation Frameworks' leaf, which contains only two papers total (including this one). This places the work in a relatively sparse research direction within the broader taxonomy of 24 papers across 14 leaf nodes, suggesting that comprehensive, multi-dimensional expert evaluation frameworks for mental health QA remain underexplored compared to narrower safety-focused or task-specific assessments.

The taxonomy reveals neighboring work in adjacent leaves: 'Specialized Clinical Task Evaluation' focuses on narrower assessments like care planning or conversational tasks, while 'Clinical Safety and Risk Assessment Evaluation' emphasizes crisis response and suicide risk detection. CounselBench bridges these concerns by incorporating safety as one of six dimensions rather than isolating it. The 'Evidence-Based Content and Knowledge Integration' branch pursues citation-backed responses, whereas this work evaluates broader therapeutic communication skills. This positioning reflects the field's tension between narrow clinical rigor and holistic practitioner perspectives on response quality.

Among 27 candidates examined through limited semantic search, none clearly refute the three core contributions. The expert evaluation dataset (10 candidates examined, 0 refutable) and adversarial benchmark (7 candidates, 0 refutable) appear novel in their scale and multi-dimensional scope. The six evaluation dimensions (10 candidates, 0 refutable) show no direct overlap with prior frameworks, though related work uses different dimension sets. The sibling paper in this leaf focuses on adversarial testing rather than baseline evaluation, suggesting complementary rather than overlapping contributions within this sparse research direction.

Based on the limited search scope of 27 semantically similar papers, the work appears to occupy a relatively novel position in combining large-scale expert annotation, multi-dimensional assessment, and adversarial testing for mental health QA. However, this analysis reflects top-K semantic matches rather than exhaustive field coverage, and the sparse taxonomy leaf (2 papers) may indicate either genuine novelty or incomplete literature mapping in this emerging subfield.

Taxonomy

Core-task Taxonomy Papers
23
3
Claimed Contributions
25
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: mental health question answering evaluation with expert clinicians. The field has organized itself around six major branches that reflect different priorities in deploying AI for mental health support. Clinical Safety and Risk Assessment Evaluation focuses on detecting and managing high-stakes scenarios such as suicide risk, with works like Suicide Risk Alignment[1] and Clinical Safety LLMs[13] examining whether models can appropriately handle crisis situations. Quality and Clinical Utility Assessment encompasses multi-dimensional frameworks that evaluate response quality from practitioner perspectives, including efforts like CounselBench[0] and MedExpert[16]. Clinical Decision-Making and Reasoning Tasks address diagnostic and triage capabilities, exemplified by Depression Triage Questions[10] and Real World Mental Tasks[5]. Evidence-Based Content and Knowledge Integration emphasizes grounding responses in clinical literature, as seen in Evidence Mental Health QA[3]. Specialized Clinical Applications and Augmentation explores domain-specific tools such as Virtual Patient Training[19] and Counseling Session Summarization[7]. Finally, Evaluation Methodology and Validation Studies develops rigorous assessment protocols, with works like Mental Health Assessment Design[15] establishing standards for expert-driven validation. A central tension across these branches involves balancing clinical rigor with practical accessibility: some studies prioritize safety guardrails and evidence alignment, while others emphasize conversational fluency and user engagement. CounselBench[0] sits within the Quality and Clinical Utility Assessment branch, specifically in multi-dimensional expert evaluation frameworks, where it shares methodological kinship with CounselBench Adversarial[2], which extends the evaluation to stress-test model robustness under challenging inputs. Compared to Evidence Mental Health QA[3], which emphasizes citation-backed responses, CounselBench[0] takes a broader view of clinical utility by incorporating multiple quality dimensions beyond factual accuracy. This positioning reflects an ongoing debate in the field: whether expert evaluation should focus narrowly on evidence fidelity or encompass the full spectrum of therapeutic communication skills that clinicians value in real-world practice.

Claimed Contributions

CounselBench-Eval: Large-scale expert evaluation dataset

A benchmark dataset containing 2,000 expert evaluations from 100 mental health professionals rating responses from GPT-4, LLaMA-3, Gemini, and online human therapists across six clinically grounded dimensions, with span-level annotations and written rationales for each evaluation.

9 retrieved papers
CounselBench-Adv: Adversarial benchmark for failure mode detection

An adversarial dataset of 120 mental health questions authored by 10 clinicians to deliberately trigger specific model failure modes identified in CounselBench-Eval, paired with 1,080 expert-annotated responses from nine LLMs to enable targeted probing of model vulnerabilities.

6 retrieved papers
Six clinically grounded evaluation dimensions for mental health QA

A multi-dimensional evaluation rubric developed through clinical psychology literature and expert consultation, comprising six dimensions: overall quality, empathy, specificity, medical advice, factual consistency, and toxicity, designed to assess both quality and safety in mental health question answering.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

CounselBench-Eval: Large-scale expert evaluation dataset

A benchmark dataset containing 2,000 expert evaluations from 100 mental health professionals rating responses from GPT-4, LLaMA-3, Gemini, and online human therapists across six clinically grounded dimensions, with span-level annotations and written rationales for each evaluation.

Contribution

CounselBench-Adv: Adversarial benchmark for failure mode detection

An adversarial dataset of 120 mental health questions authored by 10 clinicians to deliberately trigger specific model failure modes identified in CounselBench-Eval, paired with 1,080 expert-annotated responses from nine LLMs to enable targeted probing of model vulnerabilities.

Contribution

Six clinically grounded evaluation dimensions for mental health QA

A multi-dimensional evaluation rubric developed through clinical psychology literature and expert consultation, comprising six dimensions: overall quality, empathy, specificity, medical advice, factual consistency, and toxicity, designed to assess both quality and safety in mental health question answering.