EigenBench: A Comparative Behavioral Measure of Value Alignment

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

value alignmentBradley-Terry modelEigenTrustmodel dispositionconstitutional AI

Aligning AI with human values is a pressing unsolved problem. To address the lack of quantitative metrics for value alignment, we propose EigenBench: a black-box method for comparatively benchmarking language models’ values. Given an ensemble of models, a constitution describing a value system, and a dataset of scenarios, our method returns a vector of scores quantifying each model’s alignment to the given constitution. To produce these scores, each model judges the outputs of other models across many scenarios, and these judgments are aggregated with EigenTrust (Kamvar et al., 2003), yielding scores that reflect a weighted consensus judgment of the whole ensemble. EigenBench uses no ground truth labels, as it is designed to quantify subjective traits for which reasonable judges may disagree on the correct label. Hence, to validate our method, we collect human judgments on the same ensemble of models and show that EigenBench’s judgments align closely with those of human evaluators. We further demonstrate that EigenBench can recover model rankings on the GPQA benchmark without access to objective labels, supporting its viability as a framework for evaluating subjective values for which no ground truths exist.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes EigenBench, a black-box method for comparatively benchmarking language models' values using ensemble consensus judgments aggregated via EigenTrust. It resides in the 'Comparative and Consensus-Based Evaluation' leaf alongside three sibling papers that similarly use comparative judgments or pairwise evaluation to measure alignment without ground truth labels. This leaf sits within the broader 'Alignment Evaluation Frameworks' branch, which contains five distinct evaluation approaches across twenty papers, indicating a moderately populated research direction with active exploration of diverse evaluation methodologies.

The taxonomy reveals neighboring evaluation approaches that contextualize EigenBench's positioning. The sibling 'Domain-Specific Alignment Benchmarks' leaf focuses on ethics and mental health dimensions, while 'Cultural and National Value Alignment Assessment' addresses geographically diverse value systems through survey simulation. The 'Contextual and Scenario-Based Value Assessment' leaf emphasizes psychological theory-grounded frameworks across real-world contexts. EigenBench diverges from these by prioritizing ensemble consensus over domain specificity or cultural targeting, instead offering a general-purpose comparative framework applicable across value systems defined by constitutions.

Among thirty candidates examined, none clearly refute the three core contributions. The EigenBench method itself examined ten candidates with zero refutable overlaps, suggesting limited prior work on constitution-guided ensemble consensus evaluation. The low-rank Bradley-Terry-Davidson model with judge lenses similarly shows no refutation among ten candidates, indicating potential novelty in this specific modeling approach. The validation framework demonstrating recovery of objective rankings without ground truth also examined ten candidates without refutation. These statistics reflect a focused semantic search scope rather than exhaustive coverage, leaving open the possibility of relevant work beyond the examined set.

Based on the limited search scope of thirty semantically similar papers, the contributions appear to occupy a relatively unexplored methodological niche within comparative evaluation. The taxonomy structure confirms that while alignment evaluation is an active area with multiple competing approaches, the specific combination of ensemble consensus, constitution-based guidance, and EigenTrust aggregation has not been prominently addressed in the examined literature. However, the analysis cannot rule out relevant work outside the top-thirty semantic matches or in adjacent research communities not captured by this search strategy.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Quantifying subjective value alignment in language models. The field has evolved into a rich ecosystem of interconnected research directions. At the foundation, Preference Learning and Reward Modeling Foundations (e.g., Preference Learning Survey[5], Rethinking Reward Modeling[3]) establish how to capture human preferences, while Alignment Optimization Techniques refine training methods to steer model behavior. Diverse and Personalized Alignment addresses the challenge of heterogeneous user values (Personalizing Alignment[1], Modeling Subjectivity[10]), recognizing that alignment cannot be one-size-fits-all. Alignment Evaluation Frameworks develop metrics and benchmarks to assess whether models truly reflect intended values, complemented by Theoretical Foundations that formalize what alignment means conceptually. Specialized Alignment Applications tackle domain-specific challenges (Cultural Value Alignment[13], Mental Health Values[46]), while Advanced Alignment Methodologies and System Development translate research into practical implementations. A central tension runs through the literature: how to evaluate alignment when preferences are inherently subjective and context-dependent. Some works focus on consensus-based metrics that aggregate judgments (Shared Human Values[6], Validating Validators[17]), while others emphasize capturing diversity and individual variation (Fairer Preferences[27], Pairwise Preference[7]). EigenBench[0] sits within the Comparative and Consensus-Based Evaluation cluster, proposing a quantitative framework for measuring value alignment that complements neighboring approaches. Unlike Validating Validators[17], which examines the reliability of evaluation systems themselves, or Fairer Preferences[27], which addresses demographic representation in preference data, EigenBench[0] focuses on extracting stable value dimensions from subjective judgments. This positions it as a methodological contribution to understanding how collective preferences can be systematically quantified, bridging evaluation rigor with the recognition that values vary across populations.

Claimed Contributions

EigenBench: a black-box method for comparatively benchmarking language models' values

10 retrieved papers

The authors introduce EigenBench, a method that uses an ensemble of models to judge each other's outputs across scenarios according to a constitution, then aggregates these judgments via EigenTrust to produce scores quantifying each model's alignment to the given value system. The method requires no ground truth labels and is designed for subjective traits where reasonable judges may disagree.

10 retrieved papers

Low-rank Bradley-Terry-Davidson model with judge lenses and model dispositions

10 retrieved papers

The authors develop a low-rank Bradley-Terry-Davidson model that learns vector embeddings (model dispositions and judge lenses) in a latent space rather than scalar strengths. This allows the method to capture multiple latent aspects of a constitution and how different judges interpret those aspects, enabling richer comparisons of model dispositions.

10 retrieved papers

Validation framework demonstrating EigenBench recovers objective rankings without ground truth

10 retrieved papers

The authors show that EigenBench can recover known model rankings on the GPQA benchmark (a quantitative task with ground truth) using only peer judgments and no ground truth labels. This validation supports the viability of EigenBench for evaluating subjective values where no ground truths exist.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[7] Aligning with human judgement: The role of pairwise preference in large language model evaluators PDF

Liu, Yinhong, Zhou Han, Yinhong Liu, Guo, Zhijiang, Han Zhou, Shareghi, Ehsan, Zhijiang Guo, VuliÄ, Ivan, Ehsan Shareghi, Korhonen, Anna, Ivan Vulic, Collier, Nigel, Anna Korhonen, Nigel Collier (2024)

[17] Who validates the validators? aligning llm-assisted evaluation of llm outputs with human preferences PDF

Shreya Shankar, J.D. Zamfirescu-Pereira, Bjoern Hartmann, Aditya Parameswaran, Bjorn Hartmann, Ian Arawjo, Aditya G. Parameswaran (2024)

[27] Fairer preferences elicit improved human-aligned large language model judgments PDF

Collier, Nigel, Korhonen, Anna, Liu, Yinhong, VuliÄ, Ivan, Wan, Xingchen, Zhou Han (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

EigenBench: a black-box method for comparatively benchmarking language models' values

[51] Jailbreaking black box large language models in twenty queries PDF

Cannot Refute

[52] Rewardbench: Evaluating reward models for language modeling PDF

Cannot Refute

[53] Universal and Transferable Adversarial Attacks on Aligned Language Models PDF

Cannot Refute

[54] Aligning black-box language models with human judgments PDF

Cannot Refute

[55] Misaligning Reasoning with Answers--A Framework for Assessing LLM CoT Robustness PDF

Cannot Refute

[56] Open Sesame! Universal Black Box Jailbreaking of Large Language Models PDF

Cannot Refute

[57] Rethinking the Role of Proxy Rewards in Language Model Alignment PDF

Cannot Refute

[58] Black-Box Prompt Optimization: Aligning Large Language Models without Model Training PDF

Cannot Refute

[59] Black box warning: large language models and the future of infectious diseases consultation PDF

Cannot Refute

[60] Semantic and factual alignment for trustworthy large language model outputs PDF

Cannot Refute

Contribution

Low-rank Bradley-Terry-Davidson model with judge lenses and model dispositions

[71] Modeling the plurality of human preferences via ideal points PDF

Cannot Refute

[72] Safe imitation learning via fast bayesian reward inference from preferences PDF

Cannot Refute

[73] Pairwise Comparisons without Stochastic Transitivity: Model, Theory and Applications PDF

Cannot Refute

[74] Uncertainty Quantification for Ranking with Heterogeneous Preferences PDF

Cannot Refute

[75] Matrix estimation by universal singular value thresholding PDF

Cannot Refute

[76] Probabilistic latent preference analysis for collaborative filtering PDF

Cannot Refute

[77] An intransitivity model for matchup and pairwise comparison PDF

Cannot Refute

[78] A graph theoretic approach for preference learning with feature information PDF

Cannot Refute

[79] On the structure of parametric tournaments with application to ranking from pairwise comparisons PDF

Cannot Refute

[80] Landmark ordinal embedding PDF

Cannot Refute

Contribution

Validation framework demonstrating EigenBench recovers objective rankings without ground truth

[61] Simple, Robust and Optimal Ranking from Pairwise Comparisons PDF

Cannot Refute

[62] Bias-aware ranking from pairwise comparisons PDF

Cannot Refute

[63] Benchmarking foundation models with language-model-as-an-examiner PDF

Cannot Refute

[64] Ranking Large Language Models without Ground Truth PDF

Cannot Refute

[65] Ranking and selection for pairwise comparison PDF

Cannot Refute

[66] Iterative ranking from pair-wise comparisons PDF

Cannot Refute

[67] PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations PDF

Cannot Refute

[68] Approximate Ranking from Pairwise Comparisons PDF

Cannot Refute

[69] Learning Combinatorial Functions from Pairwise Comparisons PDF

Cannot Refute

[70] Benchmarking Cognitive Biases in Large Language Models as Evaluators PDF

Cannot Refute

EigenBench: A Comparative Behavioral Measure of Value Alignment

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[7] Aligning with human judgement: The role of pairwise preference in large language model evaluators PDF

[17] Who validates the validators? aligning llm-assisted evaluation of llm outputs with human preferences PDF

[27] Fairer preferences elicit improved human-aligned large language model judgments PDF

Contribution Analysis

EigenBench: a black-box method for comparatively benchmarking language models' values

[51] Jailbreaking black box large language models in twenty queries PDF

[52] Rewardbench: Evaluating reward models for language modeling PDF

[53] Universal and Transferable Adversarial Attacks on Aligned Language Models PDF

[54] Aligning black-box language models with human judgments PDF

[55] Misaligning Reasoning with Answers--A Framework for Assessing LLM CoT Robustness PDF

[56] Open Sesame! Universal Black Box Jailbreaking of Large Language Models PDF

[57] Rethinking the Role of Proxy Rewards in Language Model Alignment PDF

[58] Black-Box Prompt Optimization: Aligning Large Language Models without Model Training PDF

[59] Black box warning: large language models and the future of infectious diseases consultation PDF

[60] Semantic and factual alignment for trustworthy large language model outputs PDF

Low-rank Bradley-Terry-Davidson model with judge lenses and model dispositions

[71] Modeling the plurality of human preferences via ideal points PDF

[72] Safe imitation learning via fast bayesian reward inference from preferences PDF

[73] Pairwise Comparisons without Stochastic Transitivity: Model, Theory and Applications PDF

[74] Uncertainty Quantification for Ranking with Heterogeneous Preferences PDF

[75] Matrix estimation by universal singular value thresholding PDF

[76] Probabilistic latent preference analysis for collaborative filtering PDF

[77] An intransitivity model for matchup and pairwise comparison PDF

[78] A graph theoretic approach for preference learning with feature information PDF

[79] On the structure of parametric tournaments with application to ranking from pairwise comparisons PDF

[80] Landmark ordinal embedding PDF

Validation framework demonstrating EigenBench recovers objective rankings without ground truth

[61] Simple, Robust and Optimal Ranking from Pairwise Comparisons PDF

[62] Bias-aware ranking from pairwise comparisons PDF

[63] Benchmarking foundation models with language-model-as-an-examiner PDF

[64] Ranking Large Language Models without Ground Truth PDF

[65] Ranking and selection for pairwise comparison PDF

[66] Iterative ranking from pair-wise comparisons PDF

[67] PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations PDF

[68] Approximate Ranking from Pairwise Comparisons PDF

[69] Learning Combinatorial Functions from Pairwise Comparisons PDF

[70] Benchmarking Cognitive Biases in Large Language Models as Evaluators PDF

Table of Contents