EigenBench: A Comparative Behavioral Measure of Value Alignment

ICLR 2026 Conference SubmissionAnonymous Authors
value alignmentBradley-Terry modelEigenTrustmodel dispositionconstitutional AI
Abstract:

Aligning AI with human values is a pressing unsolved problem. To address the lack of quantitative metrics for value alignment, we propose EigenBench: a black-box method for comparatively benchmarking language models’ values. Given an ensemble of models, a constitution describing a value system, and a dataset of scenarios, our method returns a vector of scores quantifying each model’s alignment to the given constitution. To produce these scores, each model judges the outputs of other models across many scenarios, and these judgments are aggregated with EigenTrust (Kamvar et al., 2003), yielding scores that reflect a weighted consensus judgment of the whole ensemble. EigenBench uses no ground truth labels, as it is designed to quantify subjective traits for which reasonable judges may disagree on the correct label. Hence, to validate our method, we collect human judgments on the same ensemble of models and show that EigenBench’s judgments align closely with those of human evaluators. We further demonstrate that EigenBench can recover model rankings on the GPQA benchmark without access to objective labels, supporting its viability as a framework for evaluating subjective values for which no ground truths exist.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes EigenBench, a black-box method for comparatively benchmarking language models' values using ensemble consensus judgments aggregated via EigenTrust. It resides in the 'Comparative and Consensus-Based Evaluation' leaf alongside three sibling papers that similarly use comparative judgments or pairwise evaluation to measure alignment without ground truth labels. This leaf sits within the broader 'Alignment Evaluation Frameworks' branch, which contains five distinct evaluation approaches across twenty papers, indicating a moderately populated research direction with active exploration of diverse evaluation methodologies.

The taxonomy reveals neighboring evaluation approaches that contextualize EigenBench's positioning. The sibling 'Domain-Specific Alignment Benchmarks' leaf focuses on ethics and mental health dimensions, while 'Cultural and National Value Alignment Assessment' addresses geographically diverse value systems through survey simulation. The 'Contextual and Scenario-Based Value Assessment' leaf emphasizes psychological theory-grounded frameworks across real-world contexts. EigenBench diverges from these by prioritizing ensemble consensus over domain specificity or cultural targeting, instead offering a general-purpose comparative framework applicable across value systems defined by constitutions.

Among thirty candidates examined, none clearly refute the three core contributions. The EigenBench method itself examined ten candidates with zero refutable overlaps, suggesting limited prior work on constitution-guided ensemble consensus evaluation. The low-rank Bradley-Terry-Davidson model with judge lenses similarly shows no refutation among ten candidates, indicating potential novelty in this specific modeling approach. The validation framework demonstrating recovery of objective rankings without ground truth also examined ten candidates without refutation. These statistics reflect a focused semantic search scope rather than exhaustive coverage, leaving open the possibility of relevant work beyond the examined set.

Based on the limited search scope of thirty semantically similar papers, the contributions appear to occupy a relatively unexplored methodological niche within comparative evaluation. The taxonomy structure confirms that while alignment evaluation is an active area with multiple competing approaches, the specific combination of ensemble consensus, constitution-based guidance, and EigenTrust aggregation has not been prominently addressed in the examined literature. However, the analysis cannot rule out relevant work outside the top-thirty semantic matches or in adjacent research communities not captured by this search strategy.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Quantifying subjective value alignment in language models. The field has evolved into a rich ecosystem of interconnected research directions. At the foundation, Preference Learning and Reward Modeling Foundations (e.g., Preference Learning Survey[5], Rethinking Reward Modeling[3]) establish how to capture human preferences, while Alignment Optimization Techniques refine training methods to steer model behavior. Diverse and Personalized Alignment addresses the challenge of heterogeneous user values (Personalizing Alignment[1], Modeling Subjectivity[10]), recognizing that alignment cannot be one-size-fits-all. Alignment Evaluation Frameworks develop metrics and benchmarks to assess whether models truly reflect intended values, complemented by Theoretical Foundations that formalize what alignment means conceptually. Specialized Alignment Applications tackle domain-specific challenges (Cultural Value Alignment[13], Mental Health Values[46]), while Advanced Alignment Methodologies and System Development translate research into practical implementations. A central tension runs through the literature: how to evaluate alignment when preferences are inherently subjective and context-dependent. Some works focus on consensus-based metrics that aggregate judgments (Shared Human Values[6], Validating Validators[17]), while others emphasize capturing diversity and individual variation (Fairer Preferences[27], Pairwise Preference[7]). EigenBench[0] sits within the Comparative and Consensus-Based Evaluation cluster, proposing a quantitative framework for measuring value alignment that complements neighboring approaches. Unlike Validating Validators[17], which examines the reliability of evaluation systems themselves, or Fairer Preferences[27], which addresses demographic representation in preference data, EigenBench[0] focuses on extracting stable value dimensions from subjective judgments. This positions it as a methodological contribution to understanding how collective preferences can be systematically quantified, bridging evaluation rigor with the recognition that values vary across populations.

Claimed Contributions

EigenBench: a black-box method for comparatively benchmarking language models' values

The authors introduce EigenBench, a method that uses an ensemble of models to judge each other's outputs across scenarios according to a constitution, then aggregates these judgments via EigenTrust to produce scores quantifying each model's alignment to the given value system. The method requires no ground truth labels and is designed for subjective traits where reasonable judges may disagree.

10 retrieved papers
Low-rank Bradley-Terry-Davidson model with judge lenses and model dispositions

The authors develop a low-rank Bradley-Terry-Davidson model that learns vector embeddings (model dispositions and judge lenses) in a latent space rather than scalar strengths. This allows the method to capture multiple latent aspects of a constitution and how different judges interpret those aspects, enabling richer comparisons of model dispositions.

10 retrieved papers
Validation framework demonstrating EigenBench recovers objective rankings without ground truth

The authors show that EigenBench can recover known model rankings on the GPQA benchmark (a quantitative task with ground truth) using only peer judgments and no ground truth labels. This validation supports the viability of EigenBench for evaluating subjective values where no ground truths exist.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

EigenBench: a black-box method for comparatively benchmarking language models' values

The authors introduce EigenBench, a method that uses an ensemble of models to judge each other's outputs across scenarios according to a constitution, then aggregates these judgments via EigenTrust to produce scores quantifying each model's alignment to the given value system. The method requires no ground truth labels and is designed for subjective traits where reasonable judges may disagree.

Contribution

Low-rank Bradley-Terry-Davidson model with judge lenses and model dispositions

The authors develop a low-rank Bradley-Terry-Davidson model that learns vector embeddings (model dispositions and judge lenses) in a latent space rather than scalar strengths. This allows the method to capture multiple latent aspects of a constitution and how different judges interpret those aspects, enabling richer comparisons of model dispositions.

Contribution

Validation framework demonstrating EigenBench recovers objective rankings without ground truth

The authors show that EigenBench can recover known model rankings on the GPQA benchmark (a quantitative task with ground truth) using only peer judgments and no ground truth labels. This validation supports the viability of EigenBench for evaluating subjective values where no ground truths exist.