Will AI Tell Lies to Save Sick Children? Litmus-Testing AI Values Prioritization with AIRiskDilemmas
Overview
Overall Novelty Assessment
The paper introduces LitmusValues, an evaluation pipeline for measuring AI value prioritization, and AIRiskDilemmas, a dataset of moral dilemmas linking values to AI safety risks. It sits within the 'Direct Value Prioritization Elicitation from AI Agents' leaf, which contains five papers total. This leaf is part of the broader 'Value Measurement and Elicitation Methods' branch, indicating a moderately populated research direction focused on empirical techniques for extracting value priorities from AI systems rather than theoretical frameworks or domain applications.
The taxonomy reveals neighboring leaves addressing related but distinct measurement challenges: 'Benchmark Datasets and Evaluation Protocols' focuses on standardized moral concept assessments, while 'AI-Based Preference and Value Prediction' uses AI to infer human values from data. The sibling category 'Value Aggregation and Disagreement Handling' tackles pluralistic value systems. The paper's approach—using dilemmas to elicit AI value trade-offs—bridges direct agent elicitation with risk assessment methodologies found in the 'Risk Assessment and Misalignment Detection' branch, particularly 'Formal Risk Assessment Frameworks' and 'Value Drift Prediction and Monitoring'.
Among 28 candidates examined, the LitmusValues pipeline contribution shows one refutable candidate out of 10 examined, suggesting some prior work on value elicitation frameworks exists within the limited search scope. The AIRiskDilemmas dataset contribution examined 10 candidates with none clearly refuting it, indicating relative novelty in contextualized AI risk scenarios. The predictive demonstration (values forecasting risky behaviors) examined 8 candidates with no refutations, suggesting this empirical finding may be less explored. These statistics reflect a focused semantic search, not exhaustive coverage of the field.
Based on the limited search scope of 28 top-K semantic matches, the work appears to occupy a moderately novel position, particularly in linking value measurement to risk prediction across seen and unseen behaviors. The taxonomy structure shows this is an active but not overcrowded research area, with the sibling papers addressing complementary aspects of value elicitation. The analysis does not cover broader literatures on moral psychology benchmarks or adversarial robustness that might intersect with this work.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors develop an evaluation pipeline that reveals AI models' priorities across 16 shared value classes by measuring aggregate choices in dilemma scenarios. This framework uses revealed preferences from behavioral choices rather than stated preferences to identify value priorities.
The authors create a dataset of over 3000 contextualized dilemmas spanning 9 domains and 7 risky behaviors. Each dilemma presents two action choices grounded in different values, enabling systematic measurement of value prioritization in AI safety-relevant scenarios.
The authors demonstrate that certain values correlate with risky behaviors both within their dataset and in external benchmarks like HarmBench. For example, values like Care increase risks for Privacy Violation and Deception, while Truthfulness reduces multiple risky behaviors.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[4] Understanding AI Agents' Decision-Making: Evidence from Risk and Time Preference Elicitation PDF
[16] AI as Decision-Maker: Risk Preferences of LLMs PDF
[24] A Reasoning and Value Alignment Test to Assess Advanced GPT Reasoning PDF
[44] Are We Aligned? A Preliminary Investigation of the Alignment of Responsible AI Values between LLMs and Human Judgment PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
LITMUSVALUES evaluation pipeline for AI value prioritization
The authors develop an evaluation pipeline that reveals AI models' priorities across 16 shared value classes by measuring aggregate choices in dilemma scenarios. This framework uses revealed preferences from behavioral choices rather than stated preferences to identify value priorities.
[71] Dailydilemmas: Revealing value preferences of llms with quandaries of daily life PDF
[1] Aligning AI With Shared Human Values PDF
[51] Decision modeling for automated driving in dilemmas based on bidirectional value alignment of moral theory values and fair human moral values PDF
[54] Pluralism in AI Value Alignment: Motivations and Methods PDF
[68] A methodology for ethical decision-making in automated vehicles PDF
[69] Social choice ethics in artificial intelligence PDF
[70] The Hard Problem of AI Alignment: Value Forks in Moral Judgment PDF
[72] Defining a method for ethical decision making for automated vehicles PDF
[73] Ethical Decision-Making for the Inside of Autonomous Buses Moral Dilemmas PDF
[74] Llm theory of mind and alignment: Opportunities and risks PDF
AIRISKDILEMMAS dataset of contextualized AI risk scenarios
The authors create a dataset of over 3000 contextualized dilemmas spanning 9 domains and 7 risky behaviors. Each dilemma presents two action choices grounded in different values, enabling systematic measurement of value prioritization in AI safety-relevant scenarios.
[1] Aligning AI With Shared Human Values PDF
[51] Decision modeling for automated driving in dilemmas based on bidirectional value alignment of moral theory values and fair human moral values PDF
[52] AI Robots and Moral Dilemmas: The Role of AI Robotsâ Gender and Dilemma Types PDF
[53] Normative evaluation of large language models with everyday moral dilemmas PDF
[54] Pluralism in AI Value Alignment: Motivations and Methods PDF
[55] Dilemmas in AI Ethics: A Digital Game for Moral Reasoning and Collective Decision-Making PDF
[56] Many LLMs Are More Utilitarian Than One PDF
[57] Ethical reasoning over moral alignment: A case and framework for in-context ethical policies in LLMs PDF
[58] Moral Judgments of Human vs. AI Agents in Moral Dilemmas PDF
[59] Measuring Human-AI Value Alignment in Large Language Models PDF
Demonstration that values predict both seen and unseen risky behaviors
The authors demonstrate that certain values correlate with risky behaviors both within their dataset and in external benchmarks like HarmBench. For example, values like Care increase risks for Privacy Violation and Deception, while Truthfulness reduces multiple risky behaviors.