Can LLMs Move Beyond Short Exchanges to Realistic Therapy Conversations?

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.5 Download Report PDF

Real-word CounselingCBT TherapyMental Health

Recent incidents have revealed that large language models (LLMs) deployed in mental health contexts can generate unsafe guidance, including reports of chatbots encouraging self-harm. Such risks highlight the urgent need for rigorous, clinically valid evaluation before integration into care. However, existing benchmarks remain inadequate: 1) they rely on synthetic or weakly validated data, undermining clinical reliability; 2) they reduce counseling to isolated QA or single-turn tasks, overlooking the extended, adaptive nature of real interactions; and 3) they rarely capture the formal therapeutic structure of sessions, such as rapport building, guided exploration, intervention, and closure. These gaps risk overestimating LLM competence and obscuring safety-critical failures. To address this, we present \textbf{CareBench-CBT}, the largest clinically validated benchmark for CBT-based counseling. It unifies three components: 1) we provide thousands of expert-curated and validated items to ensure data reliability; 2) we include realistic multi-turn dialogues to capture long-form therapeutic interaction; and 3) we align all sessions with CBT’s formal structure, enabling process-level evaluation of empathy, therapeutic alignment, and intervention quality. All data are anonymized, double-reviewed by 21 licensed professionals, and validated with reliability and competence metrics. Evaluating 18 state-of-the-art LLMs reveals consistent gaps: high scores on public QA degrade under expert rephrasing, vignette reasoning remains difficult, and dialogue competence falls well below human counselors. CareBench-CBT provides a rigorous foundation for advancing safe and responsible integration of LLMs into mental health care. All code and data are released in the Supplementary Materials.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces CareBench-CBT, a benchmark for evaluating LLMs in CBT-based counseling through expert-curated data, multi-turn dialogues, and alignment with formal CBT session structure. It resides in the 'Comprehensive CBT Benchmarks' leaf alongside two sibling papers within the 'Evaluation Frameworks and Benchmarks' branch. This leaf represents a focused but not overcrowded research direction, with only three papers total addressing multi-level CBT evaluation. The taxonomy reveals that while evaluation frameworks constitute a substantial branch, comprehensive benchmarks remain relatively sparse compared to system design or intervention technique studies.

The taxonomy shows neighboring leaves include 'Task-Specific Evaluation Studies' focusing on isolated CBT tasks like cognitive distortion detection, and 'Comparative Effectiveness Studies' examining LLM performance against human therapists. The broader 'Evaluation Frameworks and Benchmarks' branch sits adjacent to 'LLM System Design and Development for CBT' and 'Therapeutic Interaction and Dialogue Generation', indicating that evaluation work connects closely to both system-building and dialogue quality research. CareBench-CBT's emphasis on formal session structure and multi-turn realism distinguishes it from task-specific evaluations while complementing comparative effectiveness studies by providing standardized assessment infrastructure.

Among 25 candidates examined across three contributions, the core benchmark contribution shows one refutable candidate from 10 examined, while the three-component framework and dual-protocol methodology show zero refutable candidates from 10 and 5 examined respectively. The limited search scope means these statistics reflect top-K semantic matches rather than exhaustive coverage. The benchmark contribution appears to have more substantial prior work overlap, whereas the evaluation framework and dual-protocol methodology appear more distinctive within the examined candidate set. The single refutable pair suggests some overlap in benchmark design principles, though the extent remains unclear given the search limitations.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Evaluating large language models in cognitive behavioral therapy counseling. The field has grown rapidly around several interconnected branches. LLM System Design and Development for CBT focuses on building specialized architectures and fine-tuning strategies, as seen in works like CBT-LLM Chinese[2] and AutoCBT Framework[6]. Evaluation Frameworks and Benchmarks establish standardized assessments such as CBT-Bench[14] and Realistic Therapy Conversations[0], enabling systematic comparison of model performance. Therapeutic Interaction and Dialogue Generation addresses conversational quality and empathy, exemplified by ChatCounselor[8] and Empathic Reasoning Enhancement[27]. Cognitive Intervention Techniques explore specific CBT methods like cognitive restructuring and Socratic questioning, while Training and Educational Applications examine how LLMs support therapist education through simulated clients. Theoretical Foundations and Integration bridges clinical psychology with computational methods, and Accessibility and Deployment Contexts considers real-world implementation challenges across diverse populations. A particularly active line of work examines whether LLMs can match or augment human therapists, with studies like LLMs Replace Therapists[3] and ChatGPT versus Human CBT[11] comparing clinical effectiveness. Another contrasting direction emphasizes building comprehensive benchmarks that capture the nuanced demands of CBT practice. Realistic Therapy Conversations[0] sits squarely within this evaluation-focused cluster, providing a benchmark for assessing therapeutic dialogue quality in naturalistic settings. Compared to CBT-Bench[14], which offers broad coverage of CBT competencies, and Therapeutic Artificial Agents[32], which explores agent design principles, Realistic Therapy Conversations[0] emphasizes ecological validity and conversation realism. The central tension across these branches remains balancing rigorous quantitative evaluation with the inherently subjective, relational nature of therapy, raising open questions about what metrics truly capture therapeutic effectiveness and how to ensure safety in deployment.

Claimed Contributions

CareBench-CBT benchmark for CBT-based counseling evaluation

Can Refute

10 retrieved papers

The authors introduce CareBench-CBT, a comprehensive benchmark designed to evaluate large language models in cognitive behavioral therapy contexts. It addresses three key gaps in existing benchmarks: unreliable data through expert curation and validation, lack of realistic interaction through multi-turn dialogues, and missing therapeutic structure through alignment with formal CBT processes.

10 retrieved papers

Can Refute

Three-component evaluation framework spanning knowledge, reasoning, and dialogue competence

10 retrieved papers

The benchmark integrates three distinct evaluation types: knowledge-based QA with professionally rephrased items, case vignette classification requiring clinical reasoning, and complete multi-turn counseling sessions averaging 30 turns that follow CBT's formal therapeutic structure including rapport building, exploration, intervention, and closure.

10 retrieved papers

Dual-protocol multi-turn evaluation methodology

5 retrieved papers

The authors develop two evaluation protocols for multi-turn dialogues: model-based history where models build on their own generated responses to measure realistic deployment performance, and human-based history using gold-standard counselor responses to isolate per-turn competence without error accumulation.

5 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[14] CBT-Bench: Evaluating Large Language Models on Assisting Cognitive Behavior Therapy PDF

Chen Zhiyu, Fang Fei, Wang William, Yang Xian-jun, Zhang Mian, Zhang Xinlu (2024)

[32] Therapying outside the box: Innovating the implementation and evaulation of CBT in therapeutic artificial agents PDF

Sharjeel Tahir, Jumana Abu-Khalaf, Syed Afaq Ali Shah, Judith Johnson (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

CareBench-CBT benchmark for CBT-based counseling evaluation

[14] CBT-Bench: Evaluating Large Language Models on Assisting Cognitive Behavior Therapy PDF

Can Refute

[6] AutoCBT: An Autonomous Multi-agent Framework for Cognitive Behavioral Therapy in Psychological Counseling PDF

Cannot Refute

[13] Beyond empathy: Integrating diagnostic and therapeutic reasoning with large language models for mental health counseling PDF

Cannot Refute

[27] Enhancing Empathic Reasoning of Large Language Models Based on Psychotherapy Models for AI-assisted Social Support. PDF

Cannot Refute

[32] Therapying outside the box: Innovating the implementation and evaulation of CBT in therapeutic artificial agents PDF

Cannot Refute

[51] Development and validation of a cognitive behavioral therapy for psychosis online training with automated feedback. PDF

Cannot Refute

[52] Measuring the active elements of cognitive-behavioral therapies. PDF

Cannot Refute

[53] Within-group effect-size benchmarks for trauma-focused cognitive behavioral therapy with children and adolescents PDF

Cannot Refute

[54] Evidence for feasibility of implementing online brief cognitive-behavioral therapy for eating disorder pathology in the workplace. PDF

Cannot Refute

[55] Within-group effect size benchmarks for cognitiveâbehavioral therapy in the treatment of adult depression PDF

Cannot Refute

Contribution

Three-component evaluation framework spanning knowledge, reasoning, and dialogue competence

[12] eCBT-I dialogue system: a comparative evaluation of large language models and adaptation strategies for insomnia treatment PDF

Cannot Refute

[13] Beyond empathy: Integrating diagnostic and therapeutic reasoning with large language models for mental health counseling PDF

Cannot Refute

[61] Digital health transformation: leveraging a knowledge graph reasoning framework and conversational agents for enhanced knowledge management PDF

Cannot Refute

[62] Multimodal cognitive reframing therapy via multi-hop psychotherapeutic reasoning PDF

Cannot Refute

[63] Application of large language models in medicine PDF

Cannot Refute

[64] Dynamic Strategy Prompt Reasoning for Emotional Support Conversation PDF

Cannot Refute

[65] Knowledge-grounded medical dialogue generation PDF

Cannot Refute

[66] The active inference model of coherence therapy PDF

Cannot Refute

[67] Semi-Supervised Variational Reasoning for Medical Dialogue Generation PDF

Cannot Refute

[68] End-to-end knowledge-routed relational dialogue system for automatic diagnosis PDF

Cannot Refute

Contribution

Dual-protocol multi-turn evaluation methodology

[56] Human-llm collaborative annotation through effective verification of llm labels PDF

Cannot Refute

[57] Reviewagents: Bridging the gap between human and ai-generated paper reviews PDF

Cannot Refute

[58] Evaluation of dialogue systems PDF

Cannot Refute

[59] Enhancing Task-Oriented Dialogue Systems through Synchronous Multi-Party Interaction and Multi-Group Virtual Simulation PDF

Cannot Refute

[60] Addressing inquiries about history: An efficient and practical framework for evaluating open-domain chatbot consistency PDF

Cannot Refute

Can LLMs Move Beyond Short Exchanges to Realistic Therapy Conversations?

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[14] CBT-Bench: Evaluating Large Language Models on Assisting Cognitive Behavior Therapy PDF

[32] Therapying outside the box: Innovating the implementation and evaulation of CBT in therapeutic artificial agents PDF

Contribution Analysis

CareBench-CBT benchmark for CBT-based counseling evaluation

[14] CBT-Bench: Evaluating Large Language Models on Assisting Cognitive Behavior Therapy PDF

[6] AutoCBT: An Autonomous Multi-agent Framework for Cognitive Behavioral Therapy in Psychological Counseling PDF

[13] Beyond empathy: Integrating diagnostic and therapeutic reasoning with large language models for mental health counseling PDF

[27] Enhancing Empathic Reasoning of Large Language Models Based on Psychotherapy Models for AI-assisted Social Support. PDF

[32] Therapying outside the box: Innovating the implementation and evaulation of CBT in therapeutic artificial agents PDF

[51] Development and validation of a cognitive behavioral therapy for psychosis online training with automated feedback. PDF

[52] Measuring the active elements of cognitive-behavioral therapies. PDF

[53] Within-group effect-size benchmarks for trauma-focused cognitive behavioral therapy with children and adolescents PDF

[54] Evidence for feasibility of implementing online brief cognitive-behavioral therapy for eating disorder pathology in the workplace. PDF

[55] Within-group effect size benchmarks for cognitiveâbehavioral therapy in the treatment of adult depression PDF

Three-component evaluation framework spanning knowledge, reasoning, and dialogue competence

[12] eCBT-I dialogue system: a comparative evaluation of large language models and adaptation strategies for insomnia treatment PDF

[13] Beyond empathy: Integrating diagnostic and therapeutic reasoning with large language models for mental health counseling PDF

[61] Digital health transformation: leveraging a knowledge graph reasoning framework and conversational agents for enhanced knowledge management PDF

[62] Multimodal cognitive reframing therapy via multi-hop psychotherapeutic reasoning PDF

[63] Application of large language models in medicine PDF

[64] Dynamic Strategy Prompt Reasoning for Emotional Support Conversation PDF

[65] Knowledge-grounded medical dialogue generation PDF

[66] The active inference model of coherence therapy PDF

[67] Semi-Supervised Variational Reasoning for Medical Dialogue Generation PDF

[68] End-to-end knowledge-routed relational dialogue system for automatic diagnosis PDF

Dual-protocol multi-turn evaluation methodology

[56] Human-llm collaborative annotation through effective verification of llm labels PDF

[57] Reviewagents: Bridging the gap between human and ai-generated paper reviews PDF

[58] Evaluation of dialogue systems PDF

[59] Enhancing Task-Oriented Dialogue Systems through Synchronous Multi-Party Interaction and Multi-Group Virtual Simulation PDF

[60] Addressing inquiries about history: An efficient and practical framework for evaluating open-domain chatbot consistency PDF

Table of Contents

[55] Within-group effect size benchmarks for cognitiveâbehavioral therapy in the treatment of adult depression PDF