Unpacking Human Preference for LLMs: Demographically Aware Evaluation with the Diverse Framework

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

LeaderboardsLLMEvaluationBenchmarking

Current evaluation of large language models relies predominantly on technical benchmarks that fail to capture how users actually experience these systems in practice. Even the most notable human preference evaluation approaches suffer from methodological limitations including unrepresentative sampling, superficial assessment depth, and single-metric reductionism that obscures the multidimensional nature of human-AI interaction quality. We introduce DIVERSE, a rigorous evaluation framework that addresses these limitations through demographically stratified sampling, multi-turn naturalistic conversations, and assessment across five human-centric dimensions. We collected conversations from 21,352 participants stratified across 22 demographic groups in the US and UK, evaluating 27 state-of-the-art language models through pairwise comparisons. Using a robust hierarchical Bradley-Terry-Davidson model alongside post-stratified demographic adjustments to census weights, we reveal insights unavailable within existing approaches: (1) clear performance hierarchies with Gemini-2.5-Pro achieving 97% probability of ranking first for overall preference, (2) quantification of significant preference heterogeneity, identifying user age as the primary factor, revealing failures in model generalization across populations, and (3) differential discriminative power across human-centric evaluation dimensions, with Trust, Ethics & Safety showing significantly higher tie rates than task performance metrics. Our framework demonstrates that meaningful evaluation requires moving beyond aggregate preference scores to understand the complex, demographic-specific patterns that determine real-world model preference. We release our complete dataset, interactive leaderboard, and evaluation framework to catalyse further research into more rigorous and equitable evaluation of language models.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces DIVERSE, a framework for evaluating large language models through demographically stratified sampling and multi-dimensional assessment. It resides in the 'Participatory and Representative Preference Datasets' leaf alongside three sibling papers (PRISM Dataset, PRISM Alignment Dataset, PRISM Alignment Project). This leaf sits within the broader 'Preference Collection and Alignment Methodologies' branch, which contains four distinct research directions. The taxonomy reveals this is a moderately populated area focused on inclusive data collection, distinct from the more crowded bias analysis branches containing multiple sub-categories.

The framework connects to neighboring research directions through its emphasis on representative sampling and preference modeling. Adjacent leaves include 'Group-Specific and Pluralistic Alignment Techniques' (four papers on training methods) and 'Synthetic Preference Generation via LLM Personas' (three papers using simulated respondents). The taxonomy's scope notes clarify boundaries: DIVERSE belongs in participatory datasets rather than alignment techniques because it focuses on evaluation infrastructure rather than model training. Nearby branches like 'Demographic Bias Analysis' (fourteen papers across four sub-categories) examine output biases, while DIVERSE concentrates on input preference collection.

Among twenty candidates examined across three contributions, the 'Large-Scale Demographically Stratified Dataset' contribution shows one refutable candidate from ten examined, suggesting some overlap with existing participatory data collection efforts. The 'DIVERSE Framework' contribution examined ten candidates with zero refutations, indicating potential methodological novelty in combining stratified sampling with multi-turn evaluation. The 'Hierarchical Bayesian Bradley-Terry-Davidson Model' contribution was not examined against candidates. The limited search scope (twenty total candidates) means these statistics reflect top semantic matches rather than exhaustive prior work coverage.

Based on examination of twenty semantically similar candidates, the framework appears to occupy a distinct methodological position within participatory evaluation approaches. The analysis reveals moderate prior work in demographically stratified datasets but limited overlap with the specific combination of multi-dimensional assessment and hierarchical modeling. However, the search scope represents a targeted sample rather than comprehensive field coverage, particularly for statistical modeling techniques that may exist outside the semantic neighborhood of demographic evaluation literature.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Demographically aware human preference evaluation for large language models. The field addresses how LLMs can be aligned with diverse human values while accounting for demographic variation in preferences and perceptions. The taxonomy reveals several major branches: Preference Collection and Alignment Methodologies focuses on gathering representative datasets and developing training techniques that respect demographic diversity, including participatory approaches like PRISM Dataset[1] and optimization methods such as Group Preference Optimization[3] and Maxmin RLHF[4]. Demographic Bias Analysis in LLM Outputs examines how models exhibit differential behavior across demographic groups, studying phenomena like age bias (LLM Age Bias[2]) and name-based stereotyping (Name Based Bias[13]). User Interaction and Preference Heterogeneity Analysis investigates how individual traits and personas shape preferences (Individual Traits Preferences[8], PersonaGym[32]), while Domain-Specific Demographic Evaluation applies these concerns to specialized contexts like healthcare (DiversityMedQA[48]) and content moderation (Target Substitution Moderation[15]). Evaluation Frameworks and Methodologies develops systematic approaches for measuring fairness and representation (HumBEL[5], Diverse Perspectives Evaluation[34]). A particularly active tension emerges between aggregating diverse preferences into unified models versus preserving pluralistic viewpoints. Works like Modeling Human Subjectivity[12] and Pluralistic Values Tradeoffs[46] grapple with whether alignment should seek consensus or maintain heterogeneity. The Diverse Framework[0] sits within the Participatory and Representative Preference Datasets cluster, emphasizing inclusive data collection alongside neighbors PRISM Alignment Dataset[21] and PRISM Alignment Project[22]. While these PRISM efforts focus on building large-scale demographically annotated corpora, Diverse Framework[0] appears to provide methodological infrastructure for evaluating how well preferences from different demographic groups are represented and respected. This contrasts with optimization-focused approaches like Group Preference Optimization[3], which assumes preferences are already collected and concentrates on algorithmic fairness during training. Open questions persist around scalability of participatory methods, the granularity of demographic categories, and whether technical solutions can adequately address fundamentally social challenges in value alignment.

Claimed Contributions

DIVERSE Framework for Demographically Aware LLM Evaluation

10 retrieved papers

The authors introduce DIVERSE, a methodology for human-centric AI evaluation that addresses three validity threats in existing approaches: sampling bias through demographically stratified recruitment, assessment depth through multi-turn naturalistic conversations, and metric reductionism through multidimensional evaluation across five human-centric dimensions.

10 retrieved papers

Large-Scale Demographically Stratified Dataset

Can Refute

10 retrieved papers

The authors collected 119,890 multi-dimensional human judgments from 23,404 participants stratified across 22 demographic groups in the US and UK, evaluating 28 state-of-the-art models. The dataset includes structured metadata characterising conversational dynamics, task properties, and interaction outcomes.

10 retrieved papers

Can Refute

Hierarchical Bayesian Bradley-Terry-Davidson Model with Post-Stratification

0 retrieved papers

The authors develop a hierarchical Bayesian implementation of the Bradley-Terry-Davidson model that converts pairwise comparisons into continuous skill ratings while capturing demographic heterogeneity through a factorised structure. The model uses post-stratification to census data to ensure representative population-level estimates.

0 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] â¦ dataset: What participatory, representative and individualised human feedback reveals about the subjective and multicultural alignment of large language models PDF

HR Kirk, A Whitefield, P Rottger (2024)

[21] The PRISM Alignment Dataset: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models PDF

Max Bartolo, Andrew Bean, Hannah Rose Kirk, Juan Ciro, Alexander Whitefield, Scott Hale, Paul Rottger, He He, Andrew M. Bean, Hannah Kirk, Katerina Margatina, Rafael Mosquera, Paul RÃ¶ttger, Bertie Vidgen, Adina Williams, Scott A. Hale (2024)

[22] The PRISM Alignment Project: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models PDF

Kirk, Hannah Rose, Hannah Rose Kirk, Alexander Whitefield, RÃ¶ttger, Paul, Paul RÃ¶ttger, Bean, Andrew, Andrew Bean, Margatina, Katerina, Katerina Margatina, Andrew M. Bean, Ciro, Juan, Juan Ciro, Mosquera, Rafael, Rafael Mosquera, Bartolo, Max, Max Bartolo, Williams, Adina, Adina Williams, He He, Vidgen, Bertie, Bertie Vidgen, Hale, Scott A., Scott A. Hale (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

DIVERSE Framework for Demographically Aware LLM Evaluation

[27] On Fairness of Unified Multimodal Large Language Model for Image Generation PDF

Cannot Refute

[51] Bias and Fairness in Large Language Models: A Survey PDF

Cannot Refute

[52] Performance and biases of large language models in public opinion simulation PDF

Cannot Refute

[53] Questioning the survey responses of large language models PDF

Cannot Refute

[54] Challenges in detoxifying language models PDF

Cannot Refute

[55] Assessing racial and ethnic bias in text generation by large language models for health careârelated tasks: Cross-sectional study PDF

Cannot Refute

[56] Large language models still exhibit bias in long text PDF

Cannot Refute

[57] Sociodemographic bias in language models: A survey and forward path PDF

Cannot Refute

[58] Evaluating fairness in large vision-language models across diverse demographic attributes and prompts PDF

Cannot Refute

[59] Sociodemographic biases in medical decision making by large language models PDF

Cannot Refute

Contribution

Large-Scale Demographically Stratified Dataset

[61] Dices dataset: Diversity in conversational ai evaluation for safety PDF

Can Refute

[5] HumBEL: A human-in-the-loop approach for evaluating demographic factors of language models in human-machine conversations PDF

Cannot Refute

[60] AI can help humans find common ground in democratic deliberation PDF

Cannot Refute

[62] Designing a dashboard for transparency and control of conversational AI PDF

Cannot Refute

[63] First-person fairness in chatbots PDF

Cannot Refute

[64] Insights on disagreement patterns in multimodal safety perception across diverse rater groups PDF

Cannot Refute

[65] The reasonable effectiveness of diverse evaluation data PDF

Cannot Refute

[66] Intersectionality in AI safety: using multilevel models to understand diverse perceptions of safety in conversational AI PDF

Cannot Refute

[67] Evaluating for Evidence of Sociodemographic Bias in Conversational AI for Mental Health Support. PDF

Cannot Refute

[68] Intersectionality in Conversational AI Safety: How Bayesian Multilevel Models Help Understand Diverse Perceptions of Safety PDF

Cannot Refute

Contribution

Unpacking Human Preference for LLMs: Demographically Aware Evaluation with the Diverse Framework

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] â¦ dataset: What participatory, representative and individualised human feedback reveals about the subjective and multicultural alignment of large language models PDF

[21] The PRISM Alignment Dataset: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models PDF

[22] The PRISM Alignment Project: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models PDF

Contribution Analysis

DIVERSE Framework for Demographically Aware LLM Evaluation

[27] On Fairness of Unified Multimodal Large Language Model for Image Generation PDF

[51] Bias and Fairness in Large Language Models: A Survey PDF

[52] Performance and biases of large language models in public opinion simulation PDF

[53] Questioning the survey responses of large language models PDF

[54] Challenges in detoxifying language models PDF

[55] Assessing racial and ethnic bias in text generation by large language models for health careârelated tasks: Cross-sectional study PDF

[56] Large language models still exhibit bias in long text PDF

[57] Sociodemographic bias in language models: A survey and forward path PDF

[58] Evaluating fairness in large vision-language models across diverse demographic attributes and prompts PDF

[59] Sociodemographic biases in medical decision making by large language models PDF

Large-Scale Demographically Stratified Dataset

[61] Dices dataset: Diversity in conversational ai evaluation for safety PDF

[5] HumBEL: A human-in-the-loop approach for evaluating demographic factors of language models in human-machine conversations PDF

[60] AI can help humans find common ground in democratic deliberation PDF

[62] Designing a dashboard for transparency and control of conversational AI PDF

[63] First-person fairness in chatbots PDF

[64] Insights on disagreement patterns in multimodal safety perception across diverse rater groups PDF

[65] The reasonable effectiveness of diverse evaluation data PDF

[66] Intersectionality in AI safety: using multilevel models to understand diverse perceptions of safety in conversational AI PDF

[67] Evaluating for Evidence of Sociodemographic Bias in Conversational AI for Mental Health Support. PDF

[68] Intersectionality in Conversational AI Safety: How Bayesian Multilevel Models Help Understand Diverse Perceptions of Safety PDF

Hierarchical Bayesian Bradley-Terry-Davidson Model with Post-Stratification

Table of Contents

[1] â¦ dataset: What participatory, representative and individualised human feedback reveals about the subjective and multicultural alignment of large language models PDF

[55] Assessing racial and ethnic bias in text generation by large language models for health careârelated tasks: Cross-sectional study PDF