Can LLMs Move Beyond Short Exchanges to Realistic Therapy Conversations?

ICLR 2026 Conference SubmissionAnonymous Authors
Real-word CounselingCBT TherapyMental Health
Abstract:

Recent incidents have revealed that large language models (LLMs) deployed in mental health contexts can generate unsafe guidance, including reports of chatbots encouraging self-harm. Such risks highlight the urgent need for rigorous, clinically valid evaluation before integration into care. However, existing benchmarks remain inadequate: 1) they rely on synthetic or weakly validated data, undermining clinical reliability; 2) they reduce counseling to isolated QA or single-turn tasks, overlooking the extended, adaptive nature of real interactions; and 3) they rarely capture the formal therapeutic structure of sessions, such as rapport building, guided exploration, intervention, and closure. These gaps risk overestimating LLM competence and obscuring safety-critical failures. To address this, we present \textbf{CareBench-CBT}, the largest clinically validated benchmark for CBT-based counseling. It unifies three components: 1) we provide thousands of expert-curated and validated items to ensure data reliability; 2) we include realistic multi-turn dialogues to capture long-form therapeutic interaction; and 3) we align all sessions with CBT’s formal structure, enabling process-level evaluation of empathy, therapeutic alignment, and intervention quality. All data are anonymized, double-reviewed by 21 licensed professionals, and validated with reliability and competence metrics. Evaluating 18 state-of-the-art LLMs reveals consistent gaps: high scores on public QA degrade under expert rephrasing, vignette reasoning remains difficult, and dialogue competence falls well below human counselors. CareBench-CBT provides a rigorous foundation for advancing safe and responsible integration of LLMs into mental health care. All code and data are released in the Supplementary Materials.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces CareBench-CBT, a benchmark for evaluating LLMs in CBT-based counseling through expert-curated data, multi-turn dialogues, and alignment with formal CBT session structure. It resides in the 'Comprehensive CBT Benchmarks' leaf alongside two sibling papers within the 'Evaluation Frameworks and Benchmarks' branch. This leaf represents a focused but not overcrowded research direction, with only three papers total addressing multi-level CBT evaluation. The taxonomy reveals that while evaluation frameworks constitute a substantial branch, comprehensive benchmarks remain relatively sparse compared to system design or intervention technique studies.

The taxonomy shows neighboring leaves include 'Task-Specific Evaluation Studies' focusing on isolated CBT tasks like cognitive distortion detection, and 'Comparative Effectiveness Studies' examining LLM performance against human therapists. The broader 'Evaluation Frameworks and Benchmarks' branch sits adjacent to 'LLM System Design and Development for CBT' and 'Therapeutic Interaction and Dialogue Generation', indicating that evaluation work connects closely to both system-building and dialogue quality research. CareBench-CBT's emphasis on formal session structure and multi-turn realism distinguishes it from task-specific evaluations while complementing comparative effectiveness studies by providing standardized assessment infrastructure.

Among 25 candidates examined across three contributions, the core benchmark contribution shows one refutable candidate from 10 examined, while the three-component framework and dual-protocol methodology show zero refutable candidates from 10 and 5 examined respectively. The limited search scope means these statistics reflect top-K semantic matches rather than exhaustive coverage. The benchmark contribution appears to have more substantial prior work overlap, whereas the evaluation framework and dual-protocol methodology appear more distinctive within the examined candidate set. The single refutable pair suggests some overlap in benchmark design principles, though the extent remains unclear given the search limitations.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
25
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: Evaluating large language models in cognitive behavioral therapy counseling. The field has grown rapidly around several interconnected branches. LLM System Design and Development for CBT focuses on building specialized architectures and fine-tuning strategies, as seen in works like CBT-LLM Chinese[2] and AutoCBT Framework[6]. Evaluation Frameworks and Benchmarks establish standardized assessments such as CBT-Bench[14] and Realistic Therapy Conversations[0], enabling systematic comparison of model performance. Therapeutic Interaction and Dialogue Generation addresses conversational quality and empathy, exemplified by ChatCounselor[8] and Empathic Reasoning Enhancement[27]. Cognitive Intervention Techniques explore specific CBT methods like cognitive restructuring and Socratic questioning, while Training and Educational Applications examine how LLMs support therapist education through simulated clients. Theoretical Foundations and Integration bridges clinical psychology with computational methods, and Accessibility and Deployment Contexts considers real-world implementation challenges across diverse populations. A particularly active line of work examines whether LLMs can match or augment human therapists, with studies like LLMs Replace Therapists[3] and ChatGPT versus Human CBT[11] comparing clinical effectiveness. Another contrasting direction emphasizes building comprehensive benchmarks that capture the nuanced demands of CBT practice. Realistic Therapy Conversations[0] sits squarely within this evaluation-focused cluster, providing a benchmark for assessing therapeutic dialogue quality in naturalistic settings. Compared to CBT-Bench[14], which offers broad coverage of CBT competencies, and Therapeutic Artificial Agents[32], which explores agent design principles, Realistic Therapy Conversations[0] emphasizes ecological validity and conversation realism. The central tension across these branches remains balancing rigorous quantitative evaluation with the inherently subjective, relational nature of therapy, raising open questions about what metrics truly capture therapeutic effectiveness and how to ensure safety in deployment.

Claimed Contributions

CareBench-CBT benchmark for CBT-based counseling evaluation

The authors introduce CareBench-CBT, a comprehensive benchmark designed to evaluate large language models in cognitive behavioral therapy contexts. It addresses three key gaps in existing benchmarks: unreliable data through expert curation and validation, lack of realistic interaction through multi-turn dialogues, and missing therapeutic structure through alignment with formal CBT processes.

10 retrieved papers
Can Refute
Three-component evaluation framework spanning knowledge, reasoning, and dialogue competence

The benchmark integrates three distinct evaluation types: knowledge-based QA with professionally rephrased items, case vignette classification requiring clinical reasoning, and complete multi-turn counseling sessions averaging 30 turns that follow CBT's formal therapeutic structure including rapport building, exploration, intervention, and closure.

10 retrieved papers
Dual-protocol multi-turn evaluation methodology

The authors develop two evaluation protocols for multi-turn dialogues: model-based history where models build on their own generated responses to measure realistic deployment performance, and human-based history using gold-standard counselor responses to isolate per-turn competence without error accumulation.

5 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

CareBench-CBT benchmark for CBT-based counseling evaluation

The authors introduce CareBench-CBT, a comprehensive benchmark designed to evaluate large language models in cognitive behavioral therapy contexts. It addresses three key gaps in existing benchmarks: unreliable data through expert curation and validation, lack of realistic interaction through multi-turn dialogues, and missing therapeutic structure through alignment with formal CBT processes.

Contribution

Three-component evaluation framework spanning knowledge, reasoning, and dialogue competence

The benchmark integrates three distinct evaluation types: knowledge-based QA with professionally rephrased items, case vignette classification requiring clinical reasoning, and complete multi-turn counseling sessions averaging 30 turns that follow CBT's formal therapeutic structure including rapport building, exploration, intervention, and closure.

Contribution

Dual-protocol multi-turn evaluation methodology

The authors develop two evaluation protocols for multi-turn dialogues: model-based history where models build on their own generated responses to measure realistic deployment performance, and human-based history using gold-standard counselor responses to isolate per-turn competence without error accumulation.