CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmarking of Large Language Models in Mental Health Question Answering
Overview
Overall Novelty Assessment
The paper introduces CounselBench, a large-scale benchmark for evaluating LLM responses to mental health questions using expert clinician ratings across six dimensions. It resides in the 'Multi-Dimensional Expert Evaluation Frameworks' leaf, which contains only two papers total (including this one). This places the work in a relatively sparse research direction within the broader taxonomy of 24 papers across 14 leaf nodes, suggesting that comprehensive, multi-dimensional expert evaluation frameworks for mental health QA remain underexplored compared to narrower safety-focused or task-specific assessments.
The taxonomy reveals neighboring work in adjacent leaves: 'Specialized Clinical Task Evaluation' focuses on narrower assessments like care planning or conversational tasks, while 'Clinical Safety and Risk Assessment Evaluation' emphasizes crisis response and suicide risk detection. CounselBench bridges these concerns by incorporating safety as one of six dimensions rather than isolating it. The 'Evidence-Based Content and Knowledge Integration' branch pursues citation-backed responses, whereas this work evaluates broader therapeutic communication skills. This positioning reflects the field's tension between narrow clinical rigor and holistic practitioner perspectives on response quality.
Among 27 candidates examined through limited semantic search, none clearly refute the three core contributions. The expert evaluation dataset (10 candidates examined, 0 refutable) and adversarial benchmark (7 candidates, 0 refutable) appear novel in their scale and multi-dimensional scope. The six evaluation dimensions (10 candidates, 0 refutable) show no direct overlap with prior frameworks, though related work uses different dimension sets. The sibling paper in this leaf focuses on adversarial testing rather than baseline evaluation, suggesting complementary rather than overlapping contributions within this sparse research direction.
Based on the limited search scope of 27 semantically similar papers, the work appears to occupy a relatively novel position in combining large-scale expert annotation, multi-dimensional assessment, and adversarial testing for mental health QA. However, this analysis reflects top-K semantic matches rather than exhaustive field coverage, and the sparse taxonomy leaf (2 papers) may indicate either genuine novelty or incomplete literature mapping in this emerging subfield.
Taxonomy
Research Landscape Overview
Claimed Contributions
A benchmark dataset containing 2,000 expert evaluations from 100 mental health professionals rating responses from GPT-4, LLaMA-3, Gemini, and online human therapists across six clinically grounded dimensions, with span-level annotations and written rationales for each evaluation.
An adversarial dataset of 120 mental health questions authored by 10 clinicians to deliberately trigger specific model failure modes identified in CounselBench-Eval, paired with 1,080 expert-annotated responses from nine LLMs to enable targeted probing of model vulnerabilities.
A multi-dimensional evaluation rubric developed through clinical psychology literature and expert consultation, comprising six dimensions: overall quality, empathy, specificity, medical advice, factual consistency, and toxicity, designed to assess both quality and safety in mental health question answering.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
Contribution Analysis
Detailed comparisons for each claimed contribution
CounselBench-Eval: Large-scale expert evaluation dataset
A benchmark dataset containing 2,000 expert evaluations from 100 mental health professionals rating responses from GPT-4, LLaMA-3, Gemini, and online human therapists across six clinically grounded dimensions, with span-level annotations and written rationales for each evaluation.
[4] Moving beyond medical exam questions: A clinician-annotated dataset of real-world tasks and ambiguity in mental healthcare PDF
[10] Proknow: Process knowledge for safety constrained and explainable question generation for mental health diagnostic assistance PDF
[24] Chatcounselor: A large language models for mental health support PDF
[25] AraHealthQA 2025: The First Shared Task on Arabic Health Question Answering PDF
[26] CBT-LLM: A Chinese Large Language Model for Cognitive Behavioral Therapy-based Mental Health Question Answering PDF
[27] A layered multi-expert framework for long-context mental health assessments PDF
[28] MentalChat16K: A Benchmark Dataset for Conversational Mental Health Assistance PDF
[29] Sindbad at arahealthqa track 1: Leveraging large language models for mental health q&a PDF
[30] MHQA: A Diverse, Knowledge Intensive Mental Health Question Answering Challenge for Language Models PDF
CounselBench-Adv: Adversarial benchmark for failure mode detection
An adversarial dataset of 120 mental health questions authored by 10 clinicians to deliberately trigger specific model failure modes identified in CounselBench-Eval, paired with 1,080 expert-annotated responses from nine LLMs to enable targeted probing of model vulnerabilities.
[37] Adversarial Evaluation Algorithm for Detecting Extreme Behaviors of LLMs in Psychological Counseling Scenarios PDF
[38] Generative AI in Mental Well-Being: Balancing Pros and Cons PDF
[39] Logged, listened, and legally ignored: the mirage of privacy in AI therapy| Registrato, ascoltato e legalmente ignorato: il miraggio della privacy nella terapia dell'IA PDF
[40] Contextualizing Clinical Benchmarks: A Tripartite Approach to Evaluating LLM-Based Tools in Mental Health Settings PDF
[41] Detecting Algorithmic Errors and Patient Harms for AI-Enabled Medical Devices in Randomized Controlled Trials PDF
[42] 70152 Research Tutorial-Report PDF
Six clinically grounded evaluation dimensions for mental health QA
A multi-dimensional evaluation rubric developed through clinical psychology literature and expert consultation, comprising six dimensions: overall quality, empathy, specificity, medical advice, factual consistency, and toxicity, designed to assess both quality and safety in mental health question answering.