Multi-turn Evaluation of Anthropomorphic Behaviours in Large Language Models

ICLR 2026 Conference SubmissionAnonymous Authors
anthropomorphismhuman-AI interactionsocial AImulti-turnevaluation
Abstract:

The tendency of users to anthropomorphise large language models (LLMs) is of growing societal interest. Here, we present AnthroBench: a novel empirical method and tool for evaluating anthropomorphic LLM behaviours in realistic settings. Our work introduces three key advances; first, we develop a multi-turn evaluation of 14 distinct anthropomorphic behaviours, moving beyond single-turn assessment. Second, we present a scalable, automated approach by leveraging simulations of user interactions, enabling efficient and reproducible assessment. Third, we conduct an interactive, large-scale human subject study (N=1101) to empirically validate that the model behaviours we measure predict real users’ anthropomorphic perceptions. We find that all evaluated LLMs exhibit similar behaviours, primarily characterised by relationship-building (e.g., empathy and validation) and first-person pronoun use. Crucially, we observe that the majority of these anthropomorphic behaviors only first occur after multiple turns, underscoring the necessity of multi-turn evaluations for understanding complex social phenomena in human-AI interaction. Our work provides a robust empirical foundation for investigating how design choices influence anthropomorphic model behaviours and for progressing the ethical debate on the desirability of these behaviours.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces AnthroBench, a multi-turn evaluation framework for measuring anthropomorphic behaviors in LLMs through automated user simulations and empirical human validation. It resides in the 'Multi-Turn Anthropomorphic Behavior Evaluation' leaf under 'Behavioral and Social Interaction Studies', where it is currently the sole occupant. This positioning reflects a relatively sparse research direction: while the broader 'Behavioral and Social Interaction Studies' branch contains sibling leaves examining prosocial decision-making and social role misattribution, no prior work in the taxonomy explicitly targets sustained multi-turn anthropomorphic behavior assessment with human validation at scale.

The taxonomy reveals neighboring work in adjacent branches that address related but distinct concerns. 'Psychological and Personality Trait Assessment' focuses on psychometric profiling using standardized inventories, emphasizing trait stability rather than interactive behavioral dynamics. 'Human-Likeness in Language Production' examines stylistic and typographic features in generated text, excluding the multi-turn conversational context central to this paper. 'Role-Playing and Character Simulation' investigates persona consistency and character-specific dialogue generation, but does not systematically measure anthropomorphic perceptions across diverse conversational turns. The paper's emphasis on temporal emergence of behaviors across extended interactions distinguishes it from these single-turn or static assessment paradigms.

Among thirty candidates examined through semantic search and citation expansion, none clearly refute the three core contributions. The multi-turn evaluation method (ten candidates examined, zero refutable) appears novel within the limited search scope, as does the automated simulation approach (ten candidates, zero refutable) and the large-scale human validation study linking measured behaviors to user perceptions (ten candidates, zero refutable). This absence of overlapping prior work suggests the integration of multi-turn assessment, automated simulation, and empirical validation represents a distinctive methodological package, though the limited search scale means potentially relevant work outside the top-thirty matches may exist.

The analysis indicates the paper occupies a methodologically underexplored niche, combining elements from behavioral evaluation, simulation-based testing, and human-subjects research in a way not captured by existing taxonomy leaves. However, the search examined only thirty candidates from semantic neighborhoods, not an exhaustive survey of human-AI interaction or conversational AI literature. The novelty assessment reflects what is visible within this bounded scope, acknowledging that broader literature searches or domain-specific venues might reveal closer precedents.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: evaluating anthropomorphic behaviors in large language models. The field has grown into a rich taxonomy spanning twelve major branches, each addressing distinct facets of how LLMs exhibit or fail to exhibit human-like characteristics. Psychological and Personality Trait Assessment examines whether models display stable traits measurable by psychometric instruments (e.g., AI Psychometrics[7], Personality Testing Stability[35]), while Behavioral and Social Interaction Studies investigates multi-turn conversational dynamics and prosocial cues (Prosocial Behavioural Cues[18]). Cognitive Biases and Reasoning Patterns explores whether models replicate human fallacies such as mental accounting (Mental Accounting Biases[30]) or content effects (Content Effects Reasoning[4]), and Human-Likeness in Language Production scrutinizes stylistic and typographic behaviors (Typing Behaviors[44]). Domain-Specific Human-Like Behavior targets specialized contexts like driving (Drive As You Speak[1], Drive Like Human[5]) or tutoring (AI Tutor Evaluation[19]), whereas Role-Playing and Character Simulation focuses on persona consistency (CharacterGLM[16], PersonaLLM[6]). Additional branches cover memory mechanisms (Dynamic Memory Recall[11]), learning trajectories (Human-like Learning Dynamics[2]), evaluation frameworks (HLB Humanlikeness Benchmark[41]), philosophical debates (Social Misattributions[40]), multimodal embodiment (ZoomEye[24]), and strategic planning (Multi-phases Planning[43]). Several active lines of work highlight contrasting emphases and open questions. One cluster examines whether anthropomorphism is a stable property or an emergent artifact of prompting and context, with studies like Response Biases Survey[8] and Tracing Human-like Traits[12] documenting variability across tasks. Another thread investigates the gap between surface-level mimicry and genuine cognitive alignment, as seen in debates over whether models truly understand (Do LLMs Understand[34]) or merely simulate plausible outputs (Simulating Humanoid Behavior[45]). The original paper, Anthropomorphic Behaviours Evaluation[0], sits within the Behavioral and Social Interaction Studies branch, specifically targeting multi-turn anthropomorphic behavior evaluation. Its focus on extended conversational sequences aligns it closely with works assessing dynamic social cues and interaction realism, contrasting with single-shot psychometric approaches (LLM Respondents Psychometric[13]) or domain-specific simulations (Human-SAV Interaction[23]). By emphasizing temporal consistency and interactive authenticity, it addresses a key challenge: distinguishing transient prompt-driven responses from robust, human-like behavioral patterns across sustained exchanges.

Claimed Contributions

AnthroBench: Multi-turn evaluation method and tool for anthropomorphic LLM behaviours

The authors introduce AnthroBench, a comprehensive evaluation framework that assesses 14 distinct anthropomorphic behaviours in large language models through multi-turn dialogues. The method uses automated user simulations to generate realistic conversations and employs multiple LLM judges to detect anthropomorphic behaviours across different interaction contexts.

10 retrieved papers
Scalable automated multi-turn evaluation approach using user simulations

The authors develop a fully automated evaluation pipeline that simulates multi-turn user interactions with AI systems, moving beyond single-turn assessments. This approach enables scalable and reproducible measurement of anthropomorphic behaviours as they emerge across extended conversations rather than isolated exchanges.

10 retrieved papers
Empirical validation through large-scale human subject study

The authors validate their automated evaluation method through a controlled experiment with 1,101 human participants who interacted with AI systems exhibiting different levels of anthropomorphic behaviours. The study demonstrates that their automated measurements correlate with both explicit survey responses and implicit behavioural indicators of human anthropomorphic perceptions.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

AnthroBench: Multi-turn evaluation method and tool for anthropomorphic LLM behaviours

The authors introduce AnthroBench, a comprehensive evaluation framework that assesses 14 distinct anthropomorphic behaviours in large language models through multi-turn dialogues. The method uses automated user simulations to generate realistic conversations and employs multiple LLM judges to detect anthropomorphic behaviours across different interaction contexts.

Contribution

Scalable automated multi-turn evaluation approach using user simulations

The authors develop a fully automated evaluation pipeline that simulates multi-turn user interactions with AI systems, moving beyond single-turn assessments. This approach enables scalable and reproducible measurement of anthropomorphic behaviours as they emerge across extended conversations rather than isolated exchanges.

Contribution

Empirical validation through large-scale human subject study

The authors validate their automated evaluation method through a controlled experiment with 1,101 human participants who interacted with AI systems exhibiting different levels of anthropomorphic behaviours. The study demonstrates that their automated measurements correlate with both explicit survey responses and implicit behavioural indicators of human anthropomorphic perceptions.

Multi-turn Evaluation of Anthropomorphic Behaviours in Large Language Models | Novelty Validation