SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

human behavior simulationlarge language modelsbenchmarkingcomputational social sciencehuman-AI alignmentcalibrationhuman-centered AI

Large language model (LLM) simulations of human behavior have the potential to revolutionize the social and behavioral sciences, if and only if they faithfully reflect real human behaviors. Current evaluations are fragmented, based on bespoke tasks and metrics, creating a patchwork of incomparable results. To address this, we introduce SimBench, the first large-scale, standardized benchmark for a robust, reproducible science of LLM simulation. By unifying 20 diverse datasets covering tasks from moral decision-making to economic choice across a large global participant pool, SimBench provides the necessary foundation to ask fundamental questions about when, how, and why LLM simulations succeed or fail. We show that, while even the best LLMs today have limited simulation ability (score: 40.80/100), performance scales log-linearly with model size. Simulation performance is not improved by increased inference-time compute. We discover an alignment-simulation trade-off: instruction-tuning improves performance on low-entropy (consensus) questions but degrades it on high-entropy (diverse) ones. Models particularly struggle when simulating specific demographic groups. Finally, we demonstrate that simulation ability correlates most strongly with deep, knowledge-intensive reasoning (MMLU-Pro, r=0.939). By making progress measurable, we aim to accelerate the development of more faithful LLM simulators.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces SimBench, a large-scale standardized benchmark unifying 20 diverse datasets to evaluate LLM simulation of human behavior across moral decision-making, economic choice, and other domains. It resides in the 'Comprehensive Benchmarks and Evaluation Frameworks' leaf, which contains only three papers total. This leaf sits within the broader 'Evaluation, Benchmarking, and Methodological Foundations' branch, indicating that systematic evaluation infrastructure for LLM behavior simulation is a relatively sparse but foundational research direction compared to more crowded application-specific or cognitive modeling areas.

The taxonomy reveals neighboring evaluation work in 'Domain-Specific Validation and Accuracy Assessment' (e.g., shopping behavior, trust tasks), which focuses on narrow contexts rather than broad coverage. Sibling papers in the same leaf include efforts targeting personalized digital replicas and general behavioral alignment. The broader field structure shows that while application domains (user behavior, driving, social networks) and cognitive modeling branches are well-populated, the infrastructure for standardized, cross-domain evaluation remains underdeveloped. SimBench bridges this gap by providing unified metrics where prior benchmarks addressed fragmented, bespoke tasks.

Among 30 candidates examined, Contribution A (the benchmark itself) found 1 refutable candidate among 10 examined, suggesting some prior work on standardized evaluation exists but is limited in scope. Contributions B (alignment-simulation trade-off) and C (empirical characterization) each examined 10 candidates with zero refutations, indicating these findings appear more novel within the search scope. The alignment-simulation trade-off—where instruction-tuning helps consensus questions but harms diverse ones—and the log-linear scaling observation represent empirical discoveries not clearly anticipated by the examined literature, though the limited search scale means undiscovered prior work remains possible.

Based on top-30 semantic matches, the work appears to occupy a relatively novel position by systematically unifying evaluation across 20 datasets and uncovering the alignment-simulation trade-off. However, the search scope is modest, and the taxonomy shows active neighboring work on persona simulation, cognitive biases, and domain-specific validation that may contain relevant insights not captured here. The benchmark's standardization effort addresses a recognized gap in the field's fragmented evaluation landscape, though whether its specific design choices and empirical findings are entirely unprecedented requires broader literature coverage.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Simulating human behaviors with large language models. The field has grown into a rich landscape organized around several major branches. Strategic and Economic Behavior Simulation explores how LLMs replicate decision-making in game-theoretic settings such as repeated games and ultimatum bargaining (e.g., Repeated Games[1], Ultimatum Games[5]). Cognitive and Psychological Process Simulation investigates whether models can mirror human reasoning patterns, biases, and mental processes (e.g., Intuitive Behavior Biases[18], Models of Cognition[14]). Application-Specific Behavior Simulation targets domains like autonomous driving, user interaction, and healthcare (e.g., Driving Language Network[10], User Behavior Simulation[6]). Social Dynamics and Population-Level Simulation examines collective phenomena and opinion formation (e.g., Opinion Dynamics[36], Agentsociety[42]), while Persona and Role-Based Simulation focuses on creating agents with distinct identities and roles (e.g., Role Play[2], Mixture of Personas[19]). Evaluation, Benchmarking, and Methodological Foundations provides the infrastructure for rigorous assessment, and Critical Perspectives and Limitations addresses the risks and shortcomings of these approaches (e.g., Perils and Opportunities[11]). Within this taxonomy, a particularly active line of work centers on comprehensive evaluation frameworks that systematically measure how well LLMs capture human-like behavior across diverse contexts. SimBench[0] sits squarely in this evaluation-focused cluster, offering a broad benchmark for assessing simulation fidelity. It contrasts with more narrowly scoped efforts like Digital Twins Benchmark[47], which targets personalized digital replicas, by aiming for wider coverage of behavioral dimensions. Meanwhile, works such as Choice Behavior[3] and Trust Behavior[4] probe specific facets of human decision-making, highlighting ongoing debates about whether LLMs genuinely replicate cognitive processes or merely surface-level patterns. The central tension across these branches revolves around balancing realism, generalizability, and ethical considerations—questions that SimBench[0] and its neighbors help to operationalize through systematic measurement.

Claimed Contributions

SIMBENCH: A large-scale standardized benchmark for LLM human behavior simulation

Can Refute

10 retrieved papers

The authors present SIMBENCH, which unifies 20 diverse datasets covering tasks from moral decision-making to economic choice across a large global participant pool. This benchmark provides a standardized framework to rigorously measure and compare simulation fidelity across models, tasks, and populations.

10 retrieved papers

Can Refute

Discovery of alignment-simulation trade-off in instruction-tuned models

10 retrieved papers

The authors discover that instruction-tuning creates a fundamental trade-off where it improves simulation performance on consensus questions but degrades it on questions with diverse human responses. A causal analysis confirms this results from competing instruction-following and entropy-reduction effects.

10 retrieved papers

Comprehensive empirical characterization of LLM simulation capabilities

10 retrieved papers

The authors conduct a systematic investigation addressing six research questions about LLM simulation, establishing baselines, examining how model characteristics affect performance, exploring sources of variance, and investigating practical implications including demographic group differences and correlations with other capabilities.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[29] Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies PDF

Aher, Gati, Gati Aher, Arriaga, Rosa I., Rosa I. Arriaga, Kalai, Adam Tauman, Adam Tauman Kalai, RosaI. Arriaga, A. Kalai (2022)

[47] How Far are LLMs from Being Our Digital Twins? A Benchmark for Persona-Based Behavior Chain Simulation PDF

Dong, Qingxiu, Li Rui, Li, Wenjie, Sui, Zhifang, Yuan Xinfeng, Sha Lei, Xia, Heming (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

SIMBENCH: A large-scale standardized benchmark for LLM human behavior simulation

[76] Benchmarking distributional alignment of large language models PDF

Can Refute

[29] Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies PDF

Cannot Refute

[68] Inadequacies of large language model benchmarks in the era of generative artificial intelligence PDF

Cannot Refute

[69] Towards diverse behaviors: A benchmark for imitation learning with human demonstrations PDF

Cannot Refute

[70] How Far Are LLMs from Believable AI? A Benchmark for Evaluating the Believability of Human Behavior Simulation PDF

Cannot Refute

[71] A publicly available benchmark for assessing large language modelsâ ability to predict how humans balance self-interest and the interest of others PDF

Cannot Refute

[72] Structured, flexible, and robust: benchmarking and improving large language models towards more human-like behavior in out-of-distribution reasoning tasks PDF

Cannot Refute

[73] CogBench: a large language model walks into a psychology lab PDF

Cannot Refute

[74] How different ai chatbots behave? benchmarking large language models in behavioral economics games PDF

Cannot Refute

[75] Be.FM: Open Foundation Models for Human Behavior PDF

Cannot Refute

Contribution

Discovery of alignment-simulation trade-off in instruction-tuned models

[51] Fine-tuning language models to find agreement among humans with diverse preferences PDF

Cannot Refute

[52] Can llms speak for diverse people? tuning llms via debate to generate controllable controversial statements PDF

Cannot Refute

[53] Moral Instruction Fine Tuning for Aligning LMs with Multiple Ethical Principles PDF

Cannot Refute

[54] Metaalign: Align large language models with diverse preferences during inference time PDF

Cannot Refute

[55] Evaluating and Enhancing Japanese Large Language Models for Genetic Counseling Support: Comparative Study of Domain Adaptation and the Development of an Expert-Evaluated Dataset PDF

Cannot Refute

[56] Self-agreement: a framework for fine-tuning language models to find agreement among diverse opinions PDF

Cannot Refute

[57] Active Instruction Tuning for Large Language Models with Reference-Free Instruction Selection PDF

Cannot Refute

[58] Evaluating and Enhancing Japanese Large Language Models for Genetic Counseling Support: Comparative Study of Domain Adaptation and the Development of an Expert-Evaluated Dataset (Preprint) PDF

Cannot Refute

[59] The Homogenizing Engine: AI's Role in Standardizing Culture and the Path to Policy PDF

Cannot Refute

[60] Large Language Model based Smart Contract Auditing with LLMBugScanner PDF

Cannot Refute

Contribution

Comprehensive empirical characterization of LLM simulation capabilities

[6] User behavior simulation with large language model-based agents PDF

Cannot Refute

[16] S3: Social-network simulation system with large language model-empowered agents PDF

Cannot Refute

[17] Gensim: A general social simulation platform with large language model based agents PDF

Cannot Refute

[61] Evaluating large language models in theory of mind tasks PDF

Cannot Refute

[62] A survey on evaluation of large language models PDF

Cannot Refute

[63] Language models can solve computer tasks PDF

Cannot Refute

[64] Holistic evaluation of language models PDF

Cannot Refute

[65] Citybench: Evaluating the capabilities of large language models for urban tasks PDF

Cannot Refute

[66] Evaluating the ability of large language models to emulate personality PDF

Cannot Refute

[67] Pengi: An Audio Language Model for Audio Tasks PDF

Cannot Refute

SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[29] Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies PDF

[47] How Far are LLMs from Being Our Digital Twins? A Benchmark for Persona-Based Behavior Chain Simulation PDF

Contribution Analysis

SIMBENCH: A large-scale standardized benchmark for LLM human behavior simulation

[76] Benchmarking distributional alignment of large language models PDF

[29] Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies PDF

[68] Inadequacies of large language model benchmarks in the era of generative artificial intelligence PDF

[69] Towards diverse behaviors: A benchmark for imitation learning with human demonstrations PDF

[70] How Far Are LLMs from Believable AI? A Benchmark for Evaluating the Believability of Human Behavior Simulation PDF

[71] A publicly available benchmark for assessing large language modelsâ ability to predict how humans balance self-interest and the interest of others PDF

[72] Structured, flexible, and robust: benchmarking and improving large language models towards more human-like behavior in out-of-distribution reasoning tasks PDF

[73] CogBench: a large language model walks into a psychology lab PDF

[74] How different ai chatbots behave? benchmarking large language models in behavioral economics games PDF

[75] Be.FM: Open Foundation Models for Human Behavior PDF

Discovery of alignment-simulation trade-off in instruction-tuned models

[51] Fine-tuning language models to find agreement among humans with diverse preferences PDF

[52] Can llms speak for diverse people? tuning llms via debate to generate controllable controversial statements PDF

[53] Moral Instruction Fine Tuning for Aligning LMs with Multiple Ethical Principles PDF

[54] Metaalign: Align large language models with diverse preferences during inference time PDF

[55] Evaluating and Enhancing Japanese Large Language Models for Genetic Counseling Support: Comparative Study of Domain Adaptation and the Development of an Expert-Evaluated Dataset PDF

[56] Self-agreement: a framework for fine-tuning language models to find agreement among diverse opinions PDF

[57] Active Instruction Tuning for Large Language Models with Reference-Free Instruction Selection PDF

[58] Evaluating and Enhancing Japanese Large Language Models for Genetic Counseling Support: Comparative Study of Domain Adaptation and the Development of an Expert-Evaluated Dataset (Preprint) PDF

[59] The Homogenizing Engine: AI's Role in Standardizing Culture and the Path to Policy PDF

[60] Large Language Model based Smart Contract Auditing with LLMBugScanner PDF

Comprehensive empirical characterization of LLM simulation capabilities

[6] User behavior simulation with large language model-based agents PDF

[16] S3: Social-network simulation system with large language model-empowered agents PDF

[17] Gensim: A general social simulation platform with large language model based agents PDF

[61] Evaluating large language models in theory of mind tasks PDF

[62] A survey on evaluation of large language models PDF

[63] Language models can solve computer tasks PDF

[64] Holistic evaluation of language models PDF

[65] Citybench: Evaluating the capabilities of large language models for urban tasks PDF

[66] Evaluating the ability of large language models to emulate personality PDF

[67] Pengi: An Audio Language Model for Audio Tasks PDF

Table of Contents

[71] A publicly available benchmark for assessing large language modelsâ ability to predict how humans balance self-interest and the interest of others PDF