SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors
Overview
Overall Novelty Assessment
The paper introduces SimBench, a large-scale standardized benchmark unifying 20 diverse datasets to evaluate LLM simulation of human behavior across moral decision-making, economic choice, and other domains. It resides in the 'Comprehensive Benchmarks and Evaluation Frameworks' leaf, which contains only three papers total. This leaf sits within the broader 'Evaluation, Benchmarking, and Methodological Foundations' branch, indicating that systematic evaluation infrastructure for LLM behavior simulation is a relatively sparse but foundational research direction compared to more crowded application-specific or cognitive modeling areas.
The taxonomy reveals neighboring evaluation work in 'Domain-Specific Validation and Accuracy Assessment' (e.g., shopping behavior, trust tasks), which focuses on narrow contexts rather than broad coverage. Sibling papers in the same leaf include efforts targeting personalized digital replicas and general behavioral alignment. The broader field structure shows that while application domains (user behavior, driving, social networks) and cognitive modeling branches are well-populated, the infrastructure for standardized, cross-domain evaluation remains underdeveloped. SimBench bridges this gap by providing unified metrics where prior benchmarks addressed fragmented, bespoke tasks.
Among 30 candidates examined, Contribution A (the benchmark itself) found 1 refutable candidate among 10 examined, suggesting some prior work on standardized evaluation exists but is limited in scope. Contributions B (alignment-simulation trade-off) and C (empirical characterization) each examined 10 candidates with zero refutations, indicating these findings appear more novel within the search scope. The alignment-simulation trade-off—where instruction-tuning helps consensus questions but harms diverse ones—and the log-linear scaling observation represent empirical discoveries not clearly anticipated by the examined literature, though the limited search scale means undiscovered prior work remains possible.
Based on top-30 semantic matches, the work appears to occupy a relatively novel position by systematically unifying evaluation across 20 datasets and uncovering the alignment-simulation trade-off. However, the search scope is modest, and the taxonomy shows active neighboring work on persona simulation, cognitive biases, and domain-specific validation that may contain relevant insights not captured here. The benchmark's standardization effort addresses a recognized gap in the field's fragmented evaluation landscape, though whether its specific design choices and empirical findings are entirely unprecedented requires broader literature coverage.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors present SIMBENCH, which unifies 20 diverse datasets covering tasks from moral decision-making to economic choice across a large global participant pool. This benchmark provides a standardized framework to rigorously measure and compare simulation fidelity across models, tasks, and populations.
The authors discover that instruction-tuning creates a fundamental trade-off where it improves simulation performance on consensus questions but degrades it on questions with diverse human responses. A causal analysis confirms this results from competing instruction-following and entropy-reduction effects.
The authors conduct a systematic investigation addressing six research questions about LLM simulation, establishing baselines, examining how model characteristics affect performance, exploring sources of variance, and investigating practical implications including demographic group differences and correlations with other capabilities.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[29] Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies PDF
[47] How Far are LLMs from Being Our Digital Twins? A Benchmark for Persona-Based Behavior Chain Simulation PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
SIMBENCH: A large-scale standardized benchmark for LLM human behavior simulation
The authors present SIMBENCH, which unifies 20 diverse datasets covering tasks from moral decision-making to economic choice across a large global participant pool. This benchmark provides a standardized framework to rigorously measure and compare simulation fidelity across models, tasks, and populations.
[76] Benchmarking distributional alignment of large language models PDF
[29] Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies PDF
[68] Inadequacies of large language model benchmarks in the era of generative artificial intelligence PDF
[69] Towards diverse behaviors: A benchmark for imitation learning with human demonstrations PDF
[70] How Far Are LLMs from Believable AI? A Benchmark for Evaluating the Believability of Human Behavior Simulation PDF
[71] A publicly available benchmark for assessing large language modelsâ ability to predict how humans balance self-interest and the interest of others PDF
[72] Structured, flexible, and robust: benchmarking and improving large language models towards more human-like behavior in out-of-distribution reasoning tasks PDF
[73] CogBench: a large language model walks into a psychology lab PDF
[74] How different ai chatbots behave? benchmarking large language models in behavioral economics games PDF
[75] Be.FM: Open Foundation Models for Human Behavior PDF
Discovery of alignment-simulation trade-off in instruction-tuned models
The authors discover that instruction-tuning creates a fundamental trade-off where it improves simulation performance on consensus questions but degrades it on questions with diverse human responses. A causal analysis confirms this results from competing instruction-following and entropy-reduction effects.
[51] Fine-tuning language models to find agreement among humans with diverse preferences PDF
[52] Can llms speak for diverse people? tuning llms via debate to generate controllable controversial statements PDF
[53] Moral Instruction Fine Tuning for Aligning LMs with Multiple Ethical Principles PDF
[54] Metaalign: Align large language models with diverse preferences during inference time PDF
[55] Evaluating and Enhancing Japanese Large Language Models for Genetic Counseling Support: Comparative Study of Domain Adaptation and the Development of an Expert-Evaluated Dataset PDF
[56] Self-agreement: a framework for fine-tuning language models to find agreement among diverse opinions PDF
[57] Active Instruction Tuning for Large Language Models with Reference-Free Instruction Selection PDF
[58] Evaluating and Enhancing Japanese Large Language Models for Genetic Counseling Support: Comparative Study of Domain Adaptation and the Development of an Expert-Evaluated Dataset (Preprint) PDF
[59] The Homogenizing Engine: AI's Role in Standardizing Culture and the Path to Policy PDF
[60] Large Language Model based Smart Contract Auditing with LLMBugScanner PDF
Comprehensive empirical characterization of LLM simulation capabilities
The authors conduct a systematic investigation addressing six research questions about LLM simulation, establishing baselines, examining how model characteristics affect performance, exploring sources of variance, and investigating practical implications including demographic group differences and correlations with other capabilities.