EigenBench: A Comparative Behavioral Measure of Value Alignment
Overview
Overall Novelty Assessment
The paper proposes EigenBench, a black-box method for comparatively benchmarking language models' values using ensemble consensus judgments aggregated via EigenTrust. It resides in the 'Comparative and Consensus-Based Evaluation' leaf alongside three sibling papers that similarly use comparative judgments or pairwise evaluation to measure alignment without ground truth labels. This leaf sits within the broader 'Alignment Evaluation Frameworks' branch, which contains five distinct evaluation approaches across twenty papers, indicating a moderately populated research direction with active exploration of diverse evaluation methodologies.
The taxonomy reveals neighboring evaluation approaches that contextualize EigenBench's positioning. The sibling 'Domain-Specific Alignment Benchmarks' leaf focuses on ethics and mental health dimensions, while 'Cultural and National Value Alignment Assessment' addresses geographically diverse value systems through survey simulation. The 'Contextual and Scenario-Based Value Assessment' leaf emphasizes psychological theory-grounded frameworks across real-world contexts. EigenBench diverges from these by prioritizing ensemble consensus over domain specificity or cultural targeting, instead offering a general-purpose comparative framework applicable across value systems defined by constitutions.
Among thirty candidates examined, none clearly refute the three core contributions. The EigenBench method itself examined ten candidates with zero refutable overlaps, suggesting limited prior work on constitution-guided ensemble consensus evaluation. The low-rank Bradley-Terry-Davidson model with judge lenses similarly shows no refutation among ten candidates, indicating potential novelty in this specific modeling approach. The validation framework demonstrating recovery of objective rankings without ground truth also examined ten candidates without refutation. These statistics reflect a focused semantic search scope rather than exhaustive coverage, leaving open the possibility of relevant work beyond the examined set.
Based on the limited search scope of thirty semantically similar papers, the contributions appear to occupy a relatively unexplored methodological niche within comparative evaluation. The taxonomy structure confirms that while alignment evaluation is an active area with multiple competing approaches, the specific combination of ensemble consensus, constitution-based guidance, and EigenTrust aggregation has not been prominently addressed in the examined literature. However, the analysis cannot rule out relevant work outside the top-thirty semantic matches or in adjacent research communities not captured by this search strategy.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors introduce EigenBench, a method that uses an ensemble of models to judge each other's outputs across scenarios according to a constitution, then aggregates these judgments via EigenTrust to produce scores quantifying each model's alignment to the given value system. The method requires no ground truth labels and is designed for subjective traits where reasonable judges may disagree.
The authors develop a low-rank Bradley-Terry-Davidson model that learns vector embeddings (model dispositions and judge lenses) in a latent space rather than scalar strengths. This allows the method to capture multiple latent aspects of a constitution and how different judges interpret those aspects, enabling richer comparisons of model dispositions.
The authors show that EigenBench can recover known model rankings on the GPQA benchmark (a quantitative task with ground truth) using only peer judgments and no ground truth labels. This validation supports the viability of EigenBench for evaluating subjective values where no ground truths exist.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[7] Aligning with human judgement: The role of pairwise preference in large language model evaluators PDF
[17] Who validates the validators? aligning llm-assisted evaluation of llm outputs with human preferences PDF
[27] Fairer preferences elicit improved human-aligned large language model judgments PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
EigenBench: a black-box method for comparatively benchmarking language models' values
The authors introduce EigenBench, a method that uses an ensemble of models to judge each other's outputs across scenarios according to a constitution, then aggregates these judgments via EigenTrust to produce scores quantifying each model's alignment to the given value system. The method requires no ground truth labels and is designed for subjective traits where reasonable judges may disagree.
[51] Jailbreaking black box large language models in twenty queries PDF
[52] Rewardbench: Evaluating reward models for language modeling PDF
[53] Universal and Transferable Adversarial Attacks on Aligned Language Models PDF
[54] Aligning black-box language models with human judgments PDF
[55] Misaligning Reasoning with Answers--A Framework for Assessing LLM CoT Robustness PDF
[56] Open Sesame! Universal Black Box Jailbreaking of Large Language Models PDF
[57] Rethinking the Role of Proxy Rewards in Language Model Alignment PDF
[58] Black-Box Prompt Optimization: Aligning Large Language Models without Model Training PDF
[59] Black box warning: large language models and the future of infectious diseases consultation PDF
[60] Semantic and factual alignment for trustworthy large language model outputs PDF
Low-rank Bradley-Terry-Davidson model with judge lenses and model dispositions
The authors develop a low-rank Bradley-Terry-Davidson model that learns vector embeddings (model dispositions and judge lenses) in a latent space rather than scalar strengths. This allows the method to capture multiple latent aspects of a constitution and how different judges interpret those aspects, enabling richer comparisons of model dispositions.
[71] Modeling the plurality of human preferences via ideal points PDF
[72] Safe imitation learning via fast bayesian reward inference from preferences PDF
[73] Pairwise Comparisons without Stochastic Transitivity: Model, Theory and Applications PDF
[74] Uncertainty Quantification for Ranking with Heterogeneous Preferences PDF
[75] Matrix estimation by universal singular value thresholding PDF
[76] Probabilistic latent preference analysis for collaborative filtering PDF
[77] An intransitivity model for matchup and pairwise comparison PDF
[78] A graph theoretic approach for preference learning with feature information PDF
[79] On the structure of parametric tournaments with application to ranking from pairwise comparisons PDF
[80] Landmark ordinal embedding PDF
Validation framework demonstrating EigenBench recovers objective rankings without ground truth
The authors show that EigenBench can recover known model rankings on the GPQA benchmark (a quantitative task with ground truth) using only peer judgments and no ground truth labels. This validation supports the viability of EigenBench for evaluating subjective values where no ground truths exist.