SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors

ICLR 2026 Conference SubmissionAnonymous Authors
human behavior simulationlarge language modelsbenchmarkingcomputational social sciencehuman-AI alignmentcalibrationhuman-centered AI
Abstract:

Large language model (LLM) simulations of human behavior have the potential to revolutionize the social and behavioral sciences, if and only if they faithfully reflect real human behaviors. Current evaluations are fragmented, based on bespoke tasks and metrics, creating a patchwork of incomparable results. To address this, we introduce SimBench, the first large-scale, standardized benchmark for a robust, reproducible science of LLM simulation. By unifying 20 diverse datasets covering tasks from moral decision-making to economic choice across a large global participant pool, SimBench provides the necessary foundation to ask fundamental questions about when, how, and why LLM simulations succeed or fail. We show that, while even the best LLMs today have limited simulation ability (score: 40.80/100), performance scales log-linearly with model size. Simulation performance is not improved by increased inference-time compute. We discover an alignment-simulation trade-off: instruction-tuning improves performance on low-entropy (consensus) questions but degrades it on high-entropy (diverse) ones. Models particularly struggle when simulating specific demographic groups. Finally, we demonstrate that simulation ability correlates most strongly with deep, knowledge-intensive reasoning (MMLU-Pro, r=0.939). By making progress measurable, we aim to accelerate the development of more faithful LLM simulators.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces SimBench, a large-scale standardized benchmark unifying 20 diverse datasets to evaluate LLM simulation of human behavior across moral decision-making, economic choice, and other domains. It resides in the 'Comprehensive Benchmarks and Evaluation Frameworks' leaf, which contains only three papers total. This leaf sits within the broader 'Evaluation, Benchmarking, and Methodological Foundations' branch, indicating that systematic evaluation infrastructure for LLM behavior simulation is a relatively sparse but foundational research direction compared to more crowded application-specific or cognitive modeling areas.

The taxonomy reveals neighboring evaluation work in 'Domain-Specific Validation and Accuracy Assessment' (e.g., shopping behavior, trust tasks), which focuses on narrow contexts rather than broad coverage. Sibling papers in the same leaf include efforts targeting personalized digital replicas and general behavioral alignment. The broader field structure shows that while application domains (user behavior, driving, social networks) and cognitive modeling branches are well-populated, the infrastructure for standardized, cross-domain evaluation remains underdeveloped. SimBench bridges this gap by providing unified metrics where prior benchmarks addressed fragmented, bespoke tasks.

Among 30 candidates examined, Contribution A (the benchmark itself) found 1 refutable candidate among 10 examined, suggesting some prior work on standardized evaluation exists but is limited in scope. Contributions B (alignment-simulation trade-off) and C (empirical characterization) each examined 10 candidates with zero refutations, indicating these findings appear more novel within the search scope. The alignment-simulation trade-off—where instruction-tuning helps consensus questions but harms diverse ones—and the log-linear scaling observation represent empirical discoveries not clearly anticipated by the examined literature, though the limited search scale means undiscovered prior work remains possible.

Based on top-30 semantic matches, the work appears to occupy a relatively novel position by systematically unifying evaluation across 20 datasets and uncovering the alignment-simulation trade-off. However, the search scope is modest, and the taxonomy shows active neighboring work on persona simulation, cognitive biases, and domain-specific validation that may contain relevant insights not captured here. The benchmark's standardization effort addresses a recognized gap in the field's fragmented evaluation landscape, though whether its specific design choices and empirical findings are entirely unprecedented requires broader literature coverage.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: Simulating human behaviors with large language models. The field has grown into a rich landscape organized around several major branches. Strategic and Economic Behavior Simulation explores how LLMs replicate decision-making in game-theoretic settings such as repeated games and ultimatum bargaining (e.g., Repeated Games[1], Ultimatum Games[5]). Cognitive and Psychological Process Simulation investigates whether models can mirror human reasoning patterns, biases, and mental processes (e.g., Intuitive Behavior Biases[18], Models of Cognition[14]). Application-Specific Behavior Simulation targets domains like autonomous driving, user interaction, and healthcare (e.g., Driving Language Network[10], User Behavior Simulation[6]). Social Dynamics and Population-Level Simulation examines collective phenomena and opinion formation (e.g., Opinion Dynamics[36], Agentsociety[42]), while Persona and Role-Based Simulation focuses on creating agents with distinct identities and roles (e.g., Role Play[2], Mixture of Personas[19]). Evaluation, Benchmarking, and Methodological Foundations provides the infrastructure for rigorous assessment, and Critical Perspectives and Limitations addresses the risks and shortcomings of these approaches (e.g., Perils and Opportunities[11]). Within this taxonomy, a particularly active line of work centers on comprehensive evaluation frameworks that systematically measure how well LLMs capture human-like behavior across diverse contexts. SimBench[0] sits squarely in this evaluation-focused cluster, offering a broad benchmark for assessing simulation fidelity. It contrasts with more narrowly scoped efforts like Digital Twins Benchmark[47], which targets personalized digital replicas, by aiming for wider coverage of behavioral dimensions. Meanwhile, works such as Choice Behavior[3] and Trust Behavior[4] probe specific facets of human decision-making, highlighting ongoing debates about whether LLMs genuinely replicate cognitive processes or merely surface-level patterns. The central tension across these branches revolves around balancing realism, generalizability, and ethical considerations—questions that SimBench[0] and its neighbors help to operationalize through systematic measurement.

Claimed Contributions

SIMBENCH: A large-scale standardized benchmark for LLM human behavior simulation

The authors present SIMBENCH, which unifies 20 diverse datasets covering tasks from moral decision-making to economic choice across a large global participant pool. This benchmark provides a standardized framework to rigorously measure and compare simulation fidelity across models, tasks, and populations.

10 retrieved papers
Can Refute
Discovery of alignment-simulation trade-off in instruction-tuned models

The authors discover that instruction-tuning creates a fundamental trade-off where it improves simulation performance on consensus questions but degrades it on questions with diverse human responses. A causal analysis confirms this results from competing instruction-following and entropy-reduction effects.

10 retrieved papers
Comprehensive empirical characterization of LLM simulation capabilities

The authors conduct a systematic investigation addressing six research questions about LLM simulation, establishing baselines, examining how model characteristics affect performance, exploring sources of variance, and investigating practical implications including demographic group differences and correlations with other capabilities.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

SIMBENCH: A large-scale standardized benchmark for LLM human behavior simulation

The authors present SIMBENCH, which unifies 20 diverse datasets covering tasks from moral decision-making to economic choice across a large global participant pool. This benchmark provides a standardized framework to rigorously measure and compare simulation fidelity across models, tasks, and populations.

Contribution

Discovery of alignment-simulation trade-off in instruction-tuned models

The authors discover that instruction-tuning creates a fundamental trade-off where it improves simulation performance on consensus questions but degrades it on questions with diverse human responses. A causal analysis confirms this results from competing instruction-following and entropy-reduction effects.

Contribution

Comprehensive empirical characterization of LLM simulation capabilities

The authors conduct a systematic investigation addressing six research questions about LLM simulation, establishing baselines, examining how model characteristics affect performance, exploring sources of variance, and investigating practical implications including demographic group differences and correlations with other capabilities.