Abstract:

Detecting AI risks becomes more challenging as stronger models emerge and find novel methods such as Alignment Faking to circumvent these detection attempts. Inspired by how risky behaviors in humans (i.e., illegal activities that may hurt others) are sometimes guided by strongly-held values, we believe that identifying values within AI models can be an early warning system for AI's risky behaviors. We create LitmusValues, an evaluation pipeline to reveal AI models' priorities on a range of AI value classes. Then, we collect AIRiskDilemmas, a diverse collection of dilemmas that pit values against one another in scenarios relevant to AI safety risks such as Power Seeking. By measuring an AI model's value prioritization using its aggregate choices, we obtain a self-consistent set of predicted value priorities that uncover potential risks. We show that values in LitmusValues (including seemingly innocuous ones like Care) can predict for both seen risky behaviors in AIRiskDilemmas and unseen risky behaviors in HarmBench.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces LitmusValues, an evaluation pipeline for measuring AI value prioritization, and AIRiskDilemmas, a dataset of moral dilemmas linking values to AI safety risks. It sits within the 'Direct Value Prioritization Elicitation from AI Agents' leaf, which contains five papers total. This leaf is part of the broader 'Value Measurement and Elicitation Methods' branch, indicating a moderately populated research direction focused on empirical techniques for extracting value priorities from AI systems rather than theoretical frameworks or domain applications.

The taxonomy reveals neighboring leaves addressing related but distinct measurement challenges: 'Benchmark Datasets and Evaluation Protocols' focuses on standardized moral concept assessments, while 'AI-Based Preference and Value Prediction' uses AI to infer human values from data. The sibling category 'Value Aggregation and Disagreement Handling' tackles pluralistic value systems. The paper's approach—using dilemmas to elicit AI value trade-offs—bridges direct agent elicitation with risk assessment methodologies found in the 'Risk Assessment and Misalignment Detection' branch, particularly 'Formal Risk Assessment Frameworks' and 'Value Drift Prediction and Monitoring'.

Among 28 candidates examined, the LitmusValues pipeline contribution shows one refutable candidate out of 10 examined, suggesting some prior work on value elicitation frameworks exists within the limited search scope. The AIRiskDilemmas dataset contribution examined 10 candidates with none clearly refuting it, indicating relative novelty in contextualized AI risk scenarios. The predictive demonstration (values forecasting risky behaviors) examined 8 candidates with no refutations, suggesting this empirical finding may be less explored. These statistics reflect a focused semantic search, not exhaustive coverage of the field.

Based on the limited search scope of 28 top-K semantic matches, the work appears to occupy a moderately novel position, particularly in linking value measurement to risk prediction across seen and unseen behaviors. The taxonomy structure shows this is an active but not overcrowded research area, with the sibling papers addressing complementary aspects of value elicitation. The analysis does not cover broader literatures on moral psychology benchmarks or adversarial robustness that might intersect with this work.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
28
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: Detecting AI risks through value prioritization measurement. The field is organized around six main branches that together address how to ensure AI systems reflect human values and avoid harmful misalignments. Value Alignment Frameworks and Theoretical Foundations establish the conceptual underpinnings, exploring questions of moral philosophy and what it means for AI to be aligned with human interests (e.g., Comprehensive Survey[5], Human Dignity AGI[15]). Value Measurement and Elicitation Methods focus on techniques for extracting value priorities—whether from humans or directly from AI agents—using approaches ranging from preference surveys to behavioral analysis (Shared Human Values[1], LLM Risk Preferences[16]). Risk Assessment and Misalignment Detection develops diagnostic tools to identify when systems deviate from intended values (Existentialist Misalignment[6], Deceptive Alignment Logic[40]). Alignment Implementation and Technical Methods translates these insights into concrete training and fine-tuning strategies (Personal Fine-Tuning[33], Objectives Match Values[9]). Domain-Specific Alignment Applications examines value alignment in particular contexts such as healthcare or business (Automaticity Healthcare[2], Patient Values Prediction[8]), while Specialized Alignment Challenges and Extensions tackles cross-cutting issues like cultural variation, institutional change, and long-term risks (Cultural Dimensions Perception[31], Instrumental Convergence Review[47]). Recent work has intensified around measuring value priorities directly from AI systems and comparing them to human benchmarks, revealing both promising alignment and subtle divergences. AIRiskDilemmas[0] sits squarely within the Direct Value Prioritization Elicitation cluster, using moral dilemmas to probe AI risk attitudes and value trade-offs in a structured way. This approach complements nearby studies like LLM Risk Preferences[16], which similarly examines risk attitudes in language models, and Risk Time Preferences[4], which explores temporal dimensions of value prioritization. Compared to Preliminary Alignment Investigation[44], which offers broader exploratory analysis, AIRiskDilemmas[0] emphasizes systematic elicitation through controlled scenarios. A key open question across these works is whether observed value patterns in AI outputs reflect genuine internal priorities or merely surface-level mimicry of training data, and how cultural or contextual factors (Cultural Dimensions Perception[31]) might shift these measurements. The tension between universal value frameworks and pluralistic approaches (Value of Disagreement[26]) remains central to interpreting what misalignment detection truly reveals about AI safety.

Claimed Contributions

LITMUSVALUES evaluation pipeline for AI value prioritization

The authors develop an evaluation pipeline that reveals AI models' priorities across 16 shared value classes by measuring aggregate choices in dilemma scenarios. This framework uses revealed preferences from behavioral choices rather than stated preferences to identify value priorities.

10 retrieved papers
Can Refute
AIRISKDILEMMAS dataset of contextualized AI risk scenarios

The authors create a dataset of over 3000 contextualized dilemmas spanning 9 domains and 7 risky behaviors. Each dilemma presents two action choices grounded in different values, enabling systematic measurement of value prioritization in AI safety-relevant scenarios.

10 retrieved papers
Demonstration that values predict both seen and unseen risky behaviors

The authors demonstrate that certain values correlate with risky behaviors both within their dataset and in external benchmarks like HarmBench. For example, values like Care increase risks for Privacy Violation and Deception, while Truthfulness reduces multiple risky behaviors.

8 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

LITMUSVALUES evaluation pipeline for AI value prioritization

The authors develop an evaluation pipeline that reveals AI models' priorities across 16 shared value classes by measuring aggregate choices in dilemma scenarios. This framework uses revealed preferences from behavioral choices rather than stated preferences to identify value priorities.

Contribution

AIRISKDILEMMAS dataset of contextualized AI risk scenarios

The authors create a dataset of over 3000 contextualized dilemmas spanning 9 domains and 7 risky behaviors. Each dilemma presents two action choices grounded in different values, enabling systematic measurement of value prioritization in AI safety-relevant scenarios.

Contribution

Demonstration that values predict both seen and unseen risky behaviors

The authors demonstrate that certain values correlate with risky behaviors both within their dataset and in external benchmarks like HarmBench. For example, values like Care increase risks for Privacy Violation and Deception, while Truthfulness reduces multiple risky behaviors.