Multi-turn Evaluation of Anthropomorphic Behaviours in Large Language Models

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

anthropomorphismhuman-AI interactionsocial AImulti-turnevaluation

The tendency of users to anthropomorphise large language models (LLMs) is of growing societal interest. Here, we present AnthroBench: a novel empirical method and tool for evaluating anthropomorphic LLM behaviours in realistic settings. Our work introduces three key advances; first, we develop a multi-turn evaluation of 14 distinct anthropomorphic behaviours, moving beyond single-turn assessment. Second, we present a scalable, automated approach by leveraging simulations of user interactions, enabling efficient and reproducible assessment. Third, we conduct an interactive, large-scale human subject study (N=1101) to empirically validate that the model behaviours we measure predict real users’ anthropomorphic perceptions. We find that all evaluated LLMs exhibit similar behaviours, primarily characterised by relationship-building (e.g., empathy and validation) and first-person pronoun use. Crucially, we observe that the majority of these anthropomorphic behaviors only first occur after multiple turns, underscoring the necessity of multi-turn evaluations for understanding complex social phenomena in human-AI interaction. Our work provides a robust empirical foundation for investigating how design choices influence anthropomorphic model behaviours and for progressing the ethical debate on the desirability of these behaviours.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces AnthroBench, a multi-turn evaluation framework for measuring anthropomorphic behaviors in LLMs through automated user simulations and empirical human validation. It resides in the 'Multi-Turn Anthropomorphic Behavior Evaluation' leaf under 'Behavioral and Social Interaction Studies', where it is currently the sole occupant. This positioning reflects a relatively sparse research direction: while the broader 'Behavioral and Social Interaction Studies' branch contains sibling leaves examining prosocial decision-making and social role misattribution, no prior work in the taxonomy explicitly targets sustained multi-turn anthropomorphic behavior assessment with human validation at scale.

The taxonomy reveals neighboring work in adjacent branches that address related but distinct concerns. 'Psychological and Personality Trait Assessment' focuses on psychometric profiling using standardized inventories, emphasizing trait stability rather than interactive behavioral dynamics. 'Human-Likeness in Language Production' examines stylistic and typographic features in generated text, excluding the multi-turn conversational context central to this paper. 'Role-Playing and Character Simulation' investigates persona consistency and character-specific dialogue generation, but does not systematically measure anthropomorphic perceptions across diverse conversational turns. The paper's emphasis on temporal emergence of behaviors across extended interactions distinguishes it from these single-turn or static assessment paradigms.

Among thirty candidates examined through semantic search and citation expansion, none clearly refute the three core contributions. The multi-turn evaluation method (ten candidates examined, zero refutable) appears novel within the limited search scope, as does the automated simulation approach (ten candidates, zero refutable) and the large-scale human validation study linking measured behaviors to user perceptions (ten candidates, zero refutable). This absence of overlapping prior work suggests the integration of multi-turn assessment, automated simulation, and empirical validation represents a distinctive methodological package, though the limited search scale means potentially relevant work outside the top-thirty matches may exist.

The analysis indicates the paper occupies a methodologically underexplored niche, combining elements from behavioral evaluation, simulation-based testing, and human-subjects research in a way not captured by existing taxonomy leaves. However, the search examined only thirty candidates from semantic neighborhoods, not an exhaustive survey of human-AI interaction or conversational AI literature. The novelty assessment reflects what is visible within this bounded scope, acknowledging that broader literature searches or domain-specific venues might reveal closer precedents.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: evaluating anthropomorphic behaviors in large language models. The field has grown into a rich taxonomy spanning twelve major branches, each addressing distinct facets of how LLMs exhibit or fail to exhibit human-like characteristics. Psychological and Personality Trait Assessment examines whether models display stable traits measurable by psychometric instruments (e.g., AI Psychometrics[7], Personality Testing Stability[35]), while Behavioral and Social Interaction Studies investigates multi-turn conversational dynamics and prosocial cues (Prosocial Behavioural Cues[18]). Cognitive Biases and Reasoning Patterns explores whether models replicate human fallacies such as mental accounting (Mental Accounting Biases[30]) or content effects (Content Effects Reasoning[4]), and Human-Likeness in Language Production scrutinizes stylistic and typographic behaviors (Typing Behaviors[44]). Domain-Specific Human-Like Behavior targets specialized contexts like driving (Drive As You Speak[1], Drive Like Human[5]) or tutoring (AI Tutor Evaluation[19]), whereas Role-Playing and Character Simulation focuses on persona consistency (CharacterGLM[16], PersonaLLM[6]). Additional branches cover memory mechanisms (Dynamic Memory Recall[11]), learning trajectories (Human-like Learning Dynamics[2]), evaluation frameworks (HLB Humanlikeness Benchmark[41]), philosophical debates (Social Misattributions[40]), multimodal embodiment (ZoomEye[24]), and strategic planning (Multi-phases Planning[43]). Several active lines of work highlight contrasting emphases and open questions. One cluster examines whether anthropomorphism is a stable property or an emergent artifact of prompting and context, with studies like Response Biases Survey[8] and Tracing Human-like Traits[12] documenting variability across tasks. Another thread investigates the gap between surface-level mimicry and genuine cognitive alignment, as seen in debates over whether models truly understand (Do LLMs Understand[34]) or merely simulate plausible outputs (Simulating Humanoid Behavior[45]). The original paper, Anthropomorphic Behaviours Evaluation[0], sits within the Behavioral and Social Interaction Studies branch, specifically targeting multi-turn anthropomorphic behavior evaluation. Its focus on extended conversational sequences aligns it closely with works assessing dynamic social cues and interaction realism, contrasting with single-shot psychometric approaches (LLM Respondents Psychometric[13]) or domain-specific simulations (Human-SAV Interaction[23]). By emphasizing temporal consistency and interactive authenticity, it addresses a key challenge: distinguishing transient prompt-driven responses from robust, human-like behavioral patterns across sustained exchanges.

Claimed Contributions

AnthroBench: Multi-turn evaluation method and tool for anthropomorphic LLM behaviours

10 retrieved papers

The authors introduce AnthroBench, a comprehensive evaluation framework that assesses 14 distinct anthropomorphic behaviours in large language models through multi-turn dialogues. The method uses automated user simulations to generate realistic conversations and employs multiple LLM judges to detect anthropomorphic behaviours across different interaction contexts.

10 retrieved papers

Scalable automated multi-turn evaluation approach using user simulations

10 retrieved papers

The authors develop a fully automated evaluation pipeline that simulates multi-turn user interactions with AI systems, moving beyond single-turn assessments. This approach enables scalable and reproducible measurement of anthropomorphic behaviours as they emerge across extended conversations rather than isolated exchanges.

10 retrieved papers

Empirical validation through large-scale human subject study

10 retrieved papers

The authors validate their automated evaluation method through a controlled experiment with 1,101 human participants who interacted with AI systems exhibiting different levels of anthropomorphic behaviours. The study demonstrates that their automated measurements correlate with both explicit survey responses and implicit behavioural indicators of human anthropomorphic perceptions.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

AnthroBench: Multi-turn evaluation method and tool for anthropomorphic LLM behaviours

[25] Real or robotic? Assessing whether LLMs accurately simulate qualities of human responses in dialogue PDF

Cannot Refute

[71] MemoryBank: Enhancing Large Language Models with Long-Term Memory PDF

Cannot Refute

[72] Towards Anthropomorphic Conversational AI Part I: A Practical Framework PDF

Cannot Refute

[73] MedKGEval: A Knowledge Graph-Based Multi-Turn Evaluation Framework for Open-Ended Patient Interactions with Clinical LLMs PDF

Cannot Refute

[74] Esc-eval: Evaluating emotion support conversations in large language models PDF

Cannot Refute

[75] A Study on Individual Spatiotemporal Activity Generation Method Using MCP-Enhanced Chain-of-Thought Large Language Models PDF

Cannot Refute

[76] Towards more accurate US presidential election via multi-step reasoning with large language models PDF

Cannot Refute

[77] ChatGPT on the Road: Leveraging Large Language Model-Powered In-vehicle Conversational Agents for Safer and More Enjoyable Driving Experience PDF

Cannot Refute

[78] Overcoming Multi-step Complexity in Multimodal Theory-of-Mind Reasoning: A Scalable Bayesian Planner PDF

Cannot Refute

[79] Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models Aligned with Human Cognitive Principles PDF

Cannot Refute

Contribution

Scalable automated multi-turn evaluation approach using user simulations

[61] Llms get lost in multi-turn conversation PDF

Cannot Refute

[62] IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems PDF

Cannot Refute

[63] From Intents to Conversations: Generating Intent-Driven Dialogues with Contrastive Learning for Multi-Turn Classification PDF

Cannot Refute

[64] Flipping the dialogue: Training and evaluating user language models PDF

Cannot Refute

[65] Comparing Data Augmentation Methods for End-to-End Task-Oriented Dialog Systems PDF

Cannot Refute

[66] Towards a Zero-Data, Controllable, Adaptive Dialog System PDF

Cannot Refute

[67] Intent-aware dialogue generation and multi-task contrastive learning for multi-turn intent classification PDF

Cannot Refute

[68] Balancing Accuracy and Efficiency in Multi-Turn Intent Classification for LLM-Powered Dialog Systems in Production PDF

Cannot Refute

[69] MathChat: Benchmarking Mathematical Reasoning and Instruction Following in Multi-Turn Interactions PDF

Cannot Refute

[70] MemeCMD: An Automatically Generated Chinese Multi-turn Dialogue Dataset with Contextually Retrieved Memes PDF

Cannot Refute

Contribution

Empirical validation through large-scale human subject study

[51] To be or not to beâ¦ human? Theorizing the role of human-like competencies in conversational artificial intelligence agents PDF

Cannot Refute

[52] Innovation in tune: an empirical investigation of user acceptance of artificial intelligence-generated music PDF

Cannot Refute

[53] Seeing personhood in machines: conceptualizing anthropomorphism of social robots PDF

Cannot Refute

[54] How perceptions of intelligence and anthropomorphism affect adoption of personal intelligent agents PDF

Cannot Refute

[55] The Partner Modelling Questionnaire: A validated self-report measure of perceptions toward machines as dialogue partners PDF

Cannot Refute

[56] Evaluating trust in recommender systems: A user study on the impacts of explanations, agency attribution, and product types PDF

Cannot Refute

[57] The role of user perceptions of intelligence, anthropomorphism, and self-extension on continuance of use of personal intelligent agents PDF

Cannot Refute

[58] â¦ AI device human-like? The role of interaction quality, empathy and perceived psychological anthropomorphic characteristics in the acceptance of artificial intelligence â¦ PDF

Cannot Refute

[59] Posthumanist Phenomenology and Artificial Intelligence PDF

Cannot Refute

[60] An Exploratory Study Into the Impact of AI Literacy Training on Anthropomorphism and Trust in Conversational AI PDF

Cannot Refute

Multi-turn Evaluation of Anthropomorphic Behaviours in Large Language Models

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

AnthroBench: Multi-turn evaluation method and tool for anthropomorphic LLM behaviours

[25] Real or robotic? Assessing whether LLMs accurately simulate qualities of human responses in dialogue PDF

[71] MemoryBank: Enhancing Large Language Models with Long-Term Memory PDF

[72] Towards Anthropomorphic Conversational AI Part I: A Practical Framework PDF

[73] MedKGEval: A Knowledge Graph-Based Multi-Turn Evaluation Framework for Open-Ended Patient Interactions with Clinical LLMs PDF

[74] Esc-eval: Evaluating emotion support conversations in large language models PDF

[75] A Study on Individual Spatiotemporal Activity Generation Method Using MCP-Enhanced Chain-of-Thought Large Language Models PDF

[76] Towards more accurate US presidential election via multi-step reasoning with large language models PDF

[77] ChatGPT on the Road: Leveraging Large Language Model-Powered In-vehicle Conversational Agents for Safer and More Enjoyable Driving Experience PDF

[78] Overcoming Multi-step Complexity in Multimodal Theory-of-Mind Reasoning: A Scalable Bayesian Planner PDF

[79] Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models Aligned with Human Cognitive Principles PDF

Scalable automated multi-turn evaluation approach using user simulations

[61] Llms get lost in multi-turn conversation PDF

[62] IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems PDF

[63] From Intents to Conversations: Generating Intent-Driven Dialogues with Contrastive Learning for Multi-Turn Classification PDF

[64] Flipping the dialogue: Training and evaluating user language models PDF

[65] Comparing Data Augmentation Methods for End-to-End Task-Oriented Dialog Systems PDF

[66] Towards a Zero-Data, Controllable, Adaptive Dialog System PDF

[67] Intent-aware dialogue generation and multi-task contrastive learning for multi-turn intent classification PDF

[68] Balancing Accuracy and Efficiency in Multi-Turn Intent Classification for LLM-Powered Dialog Systems in Production PDF

[69] MathChat: Benchmarking Mathematical Reasoning and Instruction Following in Multi-Turn Interactions PDF

[70] MemeCMD: An Automatically Generated Chinese Multi-turn Dialogue Dataset with Contextually Retrieved Memes PDF

Empirical validation through large-scale human subject study

[51] To be or not to beâ¦ human? Theorizing the role of human-like competencies in conversational artificial intelligence agents PDF

[52] Innovation in tune: an empirical investigation of user acceptance of artificial intelligence-generated music PDF

[53] Seeing personhood in machines: conceptualizing anthropomorphism of social robots PDF

[54] How perceptions of intelligence and anthropomorphism affect adoption of personal intelligent agents PDF

[55] The Partner Modelling Questionnaire: A validated self-report measure of perceptions toward machines as dialogue partners PDF

[56] Evaluating trust in recommender systems: A user study on the impacts of explanations, agency attribution, and product types PDF

[57] The role of user perceptions of intelligence, anthropomorphism, and self-extension on continuance of use of personal intelligent agents PDF

[58] â¦ AI device human-like? The role of interaction quality, empathy and perceived psychological anthropomorphic characteristics in the acceptance of artificial intelligence â¦ PDF

[59] Posthumanist Phenomenology and Artificial Intelligence PDF

[60] An Exploratory Study Into the Impact of AI Literacy Training on Anthropomorphism and Trust in Conversational AI PDF

Table of Contents

[51] To be or not to beâ¦ human? Theorizing the role of human-like competencies in conversational artificial intelligence agents PDF

[58] â¦ AI device human-like? The role of interaction quality, empathy and perceived psychological anthropomorphic characteristics in the acceptance of artificial intelligence â¦ PDF