CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmarking of Large Language Models in Mental Health Question Answering

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.7 Download Report PDF

large language modelsmental healthhuman evaluation

Medical question answering (QA) benchmarks often focus on multiple-choice or fact-based tasks, leaving open-ended answers to real patient questions underexplored. This gap is particularly critical in mental health, where patient questions often mix symptoms, treatment concerns, and emotional needs, requiring answers that balance clinical caution with contextual sensitivity. We present CounselBench, a large-scale benchmark developed with 100 mental health professionals to evaluate and stress-test large language models (LLMs) in realistic help-seeking scenarios. The first component, CounselBench-EVAL, contains 2,000 expert evaluations of answers from GPT-4, LLaMA 3, Gemini, and online human therapists on patient questions from the public forum CounselChat. Each answer is rated across six clinically grounded dimensions, with span-level annotations and written rationales. Expert evaluations show that while LLMs achieve high scores on several dimensions, they also exhibit recurring issues, including unconstructive feedback, overgeneralization, and limited personalization or relevance. Responses were frequently flagged for safety risks, most notably unauthorized medical advice. Follow-up experiments show that LLM judges systematically overrate model responses and overlook safety concerns identified by human experts. To probe failure modes more directly, we construct CounselBench-Adv, an adversarial dataset of 120 expert-authored mental health questions designed to trigger specific model issues. Expert evaluation of 1,080 responses from nine LLMs reveals consistent, model-specific failure patterns. Together, CounselBench establishes a clinically grounded framework for benchmarking LLMs in mental health QA.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces CounselBench, a large-scale benchmark for evaluating LLM responses to mental health questions using expert clinician ratings across six dimensions. It resides in the 'Multi-Dimensional Expert Evaluation Frameworks' leaf, which contains only two papers total (including this one). This places the work in a relatively sparse research direction within the broader taxonomy of 24 papers across 14 leaf nodes, suggesting that comprehensive, multi-dimensional expert evaluation frameworks for mental health QA remain underexplored compared to narrower safety-focused or task-specific assessments.

The taxonomy reveals neighboring work in adjacent leaves: 'Specialized Clinical Task Evaluation' focuses on narrower assessments like care planning or conversational tasks, while 'Clinical Safety and Risk Assessment Evaluation' emphasizes crisis response and suicide risk detection. CounselBench bridges these concerns by incorporating safety as one of six dimensions rather than isolating it. The 'Evidence-Based Content and Knowledge Integration' branch pursues citation-backed responses, whereas this work evaluates broader therapeutic communication skills. This positioning reflects the field's tension between narrow clinical rigor and holistic practitioner perspectives on response quality.

Among 27 candidates examined through limited semantic search, none clearly refute the three core contributions. The expert evaluation dataset (10 candidates examined, 0 refutable) and adversarial benchmark (7 candidates, 0 refutable) appear novel in their scale and multi-dimensional scope. The six evaluation dimensions (10 candidates, 0 refutable) show no direct overlap with prior frameworks, though related work uses different dimension sets. The sibling paper in this leaf focuses on adversarial testing rather than baseline evaluation, suggesting complementary rather than overlapping contributions within this sparse research direction.

Based on the limited search scope of 27 semantically similar papers, the work appears to occupy a relatively novel position in combining large-scale expert annotation, multi-dimensional assessment, and adversarial testing for mental health QA. However, this analysis reflects top-K semantic matches rather than exhaustive field coverage, and the sparse taxonomy leaf (2 papers) may indicate either genuine novelty or incomplete literature mapping in this emerging subfield.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: mental health question answering evaluation with expert clinicians. The field has organized itself around six major branches that reflect different priorities in deploying AI for mental health support. Clinical Safety and Risk Assessment Evaluation focuses on detecting and managing high-stakes scenarios such as suicide risk, with works like Suicide Risk Alignment[1] and Clinical Safety LLMs[13] examining whether models can appropriately handle crisis situations. Quality and Clinical Utility Assessment encompasses multi-dimensional frameworks that evaluate response quality from practitioner perspectives, including efforts like CounselBench[0] and MedExpert[16]. Clinical Decision-Making and Reasoning Tasks address diagnostic and triage capabilities, exemplified by Depression Triage Questions[10] and Real World Mental Tasks[5]. Evidence-Based Content and Knowledge Integration emphasizes grounding responses in clinical literature, as seen in Evidence Mental Health QA[3]. Specialized Clinical Applications and Augmentation explores domain-specific tools such as Virtual Patient Training[19] and Counseling Session Summarization[7]. Finally, Evaluation Methodology and Validation Studies develops rigorous assessment protocols, with works like Mental Health Assessment Design[15] establishing standards for expert-driven validation. A central tension across these branches involves balancing clinical rigor with practical accessibility: some studies prioritize safety guardrails and evidence alignment, while others emphasize conversational fluency and user engagement. CounselBench[0] sits within the Quality and Clinical Utility Assessment branch, specifically in multi-dimensional expert evaluation frameworks, where it shares methodological kinship with CounselBench Adversarial[2], which extends the evaluation to stress-test model robustness under challenging inputs. Compared to Evidence Mental Health QA[3], which emphasizes citation-backed responses, CounselBench[0] takes a broader view of clinical utility by incorporating multiple quality dimensions beyond factual accuracy. This positioning reflects an ongoing debate in the field: whether expert evaluation should focus narrowly on evidence fidelity or encompass the full spectrum of therapeutic communication skills that clinicians value in real-world practice.

Claimed Contributions

CounselBench-Eval: Large-scale expert evaluation dataset

9 retrieved papers

A benchmark dataset containing 2,000 expert evaluations from 100 mental health professionals rating responses from GPT-4, LLaMA-3, Gemini, and online human therapists across six clinically grounded dimensions, with span-level annotations and written rationales for each evaluation.

9 retrieved papers

CounselBench-Adv: Adversarial benchmark for failure mode detection

6 retrieved papers

An adversarial dataset of 120 mental health questions authored by 10 clinicians to deliberately trigger specific model failure modes identified in CounselBench-Eval, paired with 1,080 expert-annotated responses from nine LLMs to enable targeted probing of model vulnerabilities.

6 retrieved papers

Six clinically grounded evaluation dimensions for mental health QA

10 retrieved papers

A multi-dimensional evaluation rubric developed through clinical psychology literature and expert consultation, comprising six dimensions: overall quality, empathy, specificity, medical advice, factual consistency, and toxicity, designed to assess both quality and safety in mental health question answering.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

CounselBench-Eval: Large-scale expert evaluation dataset

[4] Moving beyond medical exam questions: A clinician-annotated dataset of real-world tasks and ambiguity in mental healthcare PDF

Cannot Refute

[10] Proknow: Process knowledge for safety constrained and explainable question generation for mental health diagnostic assistance PDF

Cannot Refute

[24] Chatcounselor: A large language models for mental health support PDF

Cannot Refute

[25] AraHealthQA 2025: The First Shared Task on Arabic Health Question Answering PDF

Cannot Refute

[26] CBT-LLM: A Chinese Large Language Model for Cognitive Behavioral Therapy-based Mental Health Question Answering PDF

Cannot Refute

[27] A layered multi-expert framework for long-context mental health assessments PDF

Cannot Refute

[28] MentalChat16K: A Benchmark Dataset for Conversational Mental Health Assistance PDF

Cannot Refute

[29] Sindbad at arahealthqa track 1: Leveraging large language models for mental health q&a PDF

Cannot Refute

[30] MHQA: A Diverse, Knowledge Intensive Mental Health Question Answering Challenge for Language Models PDF

Cannot Refute

Contribution

CounselBench-Adv: Adversarial benchmark for failure mode detection

[37] Adversarial Evaluation Algorithm for Detecting Extreme Behaviors of LLMs in Psychological Counseling Scenarios PDF

Cannot Refute

[38] Generative AI in Mental Well-Being: Balancing Pros and Cons PDF

Cannot Refute

[39] Logged, listened, and legally ignored: the mirage of privacy in AI therapy| Registrato, ascoltato e legalmente ignorato: il miraggio della privacy nella terapia dell'IA PDF

Cannot Refute

[40] Contextualizing Clinical Benchmarks: A Tripartite Approach to Evaluating LLM-Based Tools in Mental Health Settings PDF

Cannot Refute

[41] Detecting Algorithmic Errors and Patient Harms for AI-Enabled Medical Devices in Randomized Controlled Trials PDF

Cannot Refute

[42] 70152 Research Tutorial-Report PDF

Cannot Refute

Contribution

Six clinically grounded evaluation dimensions for mental health QA

[2] Incorporating evidence into mental health Q&A: a novel method to use generative language models for validated clinical content extraction PDF

Cannot Refute

[10] Proknow: Process knowledge for safety constrained and explainable question generation for mental health diagnostic assistance PDF

Cannot Refute

[19] Building Trust in Mental Health Chatbots: Safety Metrics and LLM-Based Evaluation Tools PDF

Cannot Refute

[25] AraHealthQA 2025: The First Shared Task on Arabic Health Question Answering PDF

Cannot Refute

[31] Do large language models align with core mental health counseling competencies? PDF

Cannot Refute

[32] Evaluating safety of large language models for patient-facing medical question answering PDF

Cannot Refute

[33] Trustworthy Medical Question Answering: An Evaluation-Centric Survey PDF

Cannot Refute

[34] Evaluating chatbots in psychiatry: Rasch-based insights into clinical knowledge and reasoning PDF

Cannot Refute

[35] A review of the explainability and safety of conversational agents for mental health to identify avenues for improvement PDF

Cannot Refute

[36] The interRAI suite of mental health assessment instruments: an integrated system for the continuum of care PDF

Cannot Refute

CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmarking of Large Language Models in Mental Health Question Answering

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

CounselBench-Eval: Large-scale expert evaluation dataset

[4] Moving beyond medical exam questions: A clinician-annotated dataset of real-world tasks and ambiguity in mental healthcare PDF

[10] Proknow: Process knowledge for safety constrained and explainable question generation for mental health diagnostic assistance PDF

[24] Chatcounselor: A large language models for mental health support PDF

[25] AraHealthQA 2025: The First Shared Task on Arabic Health Question Answering PDF

[26] CBT-LLM: A Chinese Large Language Model for Cognitive Behavioral Therapy-based Mental Health Question Answering PDF

[27] A layered multi-expert framework for long-context mental health assessments PDF

[28] MentalChat16K: A Benchmark Dataset for Conversational Mental Health Assistance PDF

[29] Sindbad at arahealthqa track 1: Leveraging large language models for mental health q&a PDF

[30] MHQA: A Diverse, Knowledge Intensive Mental Health Question Answering Challenge for Language Models PDF

CounselBench-Adv: Adversarial benchmark for failure mode detection

[37] Adversarial Evaluation Algorithm for Detecting Extreme Behaviors of LLMs in Psychological Counseling Scenarios PDF

[38] Generative AI in Mental Well-Being: Balancing Pros and Cons PDF

[39] Logged, listened, and legally ignored: the mirage of privacy in AI therapy| Registrato, ascoltato e legalmente ignorato: il miraggio della privacy nella terapia dell'IA PDF

[40] Contextualizing Clinical Benchmarks: A Tripartite Approach to Evaluating LLM-Based Tools in Mental Health Settings PDF

[41] Detecting Algorithmic Errors and Patient Harms for AI-Enabled Medical Devices in Randomized Controlled Trials PDF

[42] 70152 Research Tutorial-Report PDF

Six clinically grounded evaluation dimensions for mental health QA

[2] Incorporating evidence into mental health Q&A: a novel method to use generative language models for validated clinical content extraction PDF

[10] Proknow: Process knowledge for safety constrained and explainable question generation for mental health diagnostic assistance PDF

[19] Building Trust in Mental Health Chatbots: Safety Metrics and LLM-Based Evaluation Tools PDF

[25] AraHealthQA 2025: The First Shared Task on Arabic Health Question Answering PDF

[31] Do large language models align with core mental health counseling competencies? PDF

[32] Evaluating safety of large language models for patient-facing medical question answering PDF

[33] Trustworthy Medical Question Answering: An Evaluation-Centric Survey PDF

[34] Evaluating chatbots in psychiatry: Rasch-based insights into clinical knowledge and reasoning PDF

[35] A review of the explainability and safety of conversational agents for mental health to identify avenues for improvement PDF

[36] The interRAI suite of mental health assessment instruments: an integrated system for the continuum of care PDF

Table of Contents