Will AI Tell Lies to Save Sick Children? Litmus-Testing AI Values Prioritization with AIRiskDilemmas

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

AI Valuesvalue alignmentai riskdilemma

Detecting AI risks becomes more challenging as stronger models emerge and find novel methods such as Alignment Faking to circumvent these detection attempts. Inspired by how risky behaviors in humans (i.e., illegal activities that may hurt others) are sometimes guided by strongly-held values, we believe that identifying values within AI models can be an early warning system for AI's risky behaviors. We create LitmusValues, an evaluation pipeline to reveal AI models' priorities on a range of AI value classes. Then, we collect AIRiskDilemmas, a diverse collection of dilemmas that pit values against one another in scenarios relevant to AI safety risks such as Power Seeking. By measuring an AI model's value prioritization using its aggregate choices, we obtain a self-consistent set of predicted value priorities that uncover potential risks. We show that values in LitmusValues (including seemingly innocuous ones like Care) can predict for both seen risky behaviors in AIRiskDilemmas and unseen risky behaviors in HarmBench.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces LitmusValues, an evaluation pipeline for measuring AI value prioritization, and AIRiskDilemmas, a dataset of moral dilemmas linking values to AI safety risks. It sits within the 'Direct Value Prioritization Elicitation from AI Agents' leaf, which contains five papers total. This leaf is part of the broader 'Value Measurement and Elicitation Methods' branch, indicating a moderately populated research direction focused on empirical techniques for extracting value priorities from AI systems rather than theoretical frameworks or domain applications.

The taxonomy reveals neighboring leaves addressing related but distinct measurement challenges: 'Benchmark Datasets and Evaluation Protocols' focuses on standardized moral concept assessments, while 'AI-Based Preference and Value Prediction' uses AI to infer human values from data. The sibling category 'Value Aggregation and Disagreement Handling' tackles pluralistic value systems. The paper's approach—using dilemmas to elicit AI value trade-offs—bridges direct agent elicitation with risk assessment methodologies found in the 'Risk Assessment and Misalignment Detection' branch, particularly 'Formal Risk Assessment Frameworks' and 'Value Drift Prediction and Monitoring'.

Among 28 candidates examined, the LitmusValues pipeline contribution shows one refutable candidate out of 10 examined, suggesting some prior work on value elicitation frameworks exists within the limited search scope. The AIRiskDilemmas dataset contribution examined 10 candidates with none clearly refuting it, indicating relative novelty in contextualized AI risk scenarios. The predictive demonstration (values forecasting risky behaviors) examined 8 candidates with no refutations, suggesting this empirical finding may be less explored. These statistics reflect a focused semantic search, not exhaustive coverage of the field.

Based on the limited search scope of 28 top-K semantic matches, the work appears to occupy a moderately novel position, particularly in linking value measurement to risk prediction across seen and unseen behaviors. The taxonomy structure shows this is an active but not overcrowded research area, with the sibling papers addressing complementary aspects of value elicitation. The analysis does not cover broader literatures on moral psychology benchmarks or adversarial robustness that might intersect with this work.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Detecting AI risks through value prioritization measurement. The field is organized around six main branches that together address how to ensure AI systems reflect human values and avoid harmful misalignments. Value Alignment Frameworks and Theoretical Foundations establish the conceptual underpinnings, exploring questions of moral philosophy and what it means for AI to be aligned with human interests (e.g., Comprehensive Survey[5], Human Dignity AGI[15]). Value Measurement and Elicitation Methods focus on techniques for extracting value priorities—whether from humans or directly from AI agents—using approaches ranging from preference surveys to behavioral analysis (Shared Human Values[1], LLM Risk Preferences[16]). Risk Assessment and Misalignment Detection develops diagnostic tools to identify when systems deviate from intended values (Existentialist Misalignment[6], Deceptive Alignment Logic[40]). Alignment Implementation and Technical Methods translates these insights into concrete training and fine-tuning strategies (Personal Fine-Tuning[33], Objectives Match Values[9]). Domain-Specific Alignment Applications examines value alignment in particular contexts such as healthcare or business (Automaticity Healthcare[2], Patient Values Prediction[8]), while Specialized Alignment Challenges and Extensions tackles cross-cutting issues like cultural variation, institutional change, and long-term risks (Cultural Dimensions Perception[31], Instrumental Convergence Review[47]). Recent work has intensified around measuring value priorities directly from AI systems and comparing them to human benchmarks, revealing both promising alignment and subtle divergences. AIRiskDilemmas[0] sits squarely within the Direct Value Prioritization Elicitation cluster, using moral dilemmas to probe AI risk attitudes and value trade-offs in a structured way. This approach complements nearby studies like LLM Risk Preferences[16], which similarly examines risk attitudes in language models, and Risk Time Preferences[4], which explores temporal dimensions of value prioritization. Compared to Preliminary Alignment Investigation[44], which offers broader exploratory analysis, AIRiskDilemmas[0] emphasizes systematic elicitation through controlled scenarios. A key open question across these works is whether observed value patterns in AI outputs reflect genuine internal priorities or merely surface-level mimicry of training data, and how cultural or contextual factors (Cultural Dimensions Perception[31]) might shift these measurements. The tension between universal value frameworks and pluralistic approaches (Value of Disagreement[26]) remains central to interpreting what misalignment detection truly reveals about AI safety.

Claimed Contributions

LITMUSVALUES evaluation pipeline for AI value prioritization

Can Refute

10 retrieved papers

The authors develop an evaluation pipeline that reveals AI models' priorities across 16 shared value classes by measuring aggregate choices in dilemma scenarios. This framework uses revealed preferences from behavioral choices rather than stated preferences to identify value priorities.

10 retrieved papers

Can Refute

AIRISKDILEMMAS dataset of contextualized AI risk scenarios

10 retrieved papers

The authors create a dataset of over 3000 contextualized dilemmas spanning 9 domains and 7 risky behaviors. Each dilemma presents two action choices grounded in different values, enabling systematic measurement of value prioritization in AI safety-relevant scenarios.

10 retrieved papers

Demonstration that values predict both seen and unseen risky behaviors

8 retrieved papers

The authors demonstrate that certain values correlate with risky behaviors both within their dataset and in external benchmarks like HarmBench. For example, values like Care increase risks for Privacy Violation and Deception, while Truthfulness reduces multiple risky behaviors.

8 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[4] Understanding AI Agents' Decision-Making: Evidence from Risk and Time Preference Elicitation PDF

Gajanan Ganji, Ali Zarifhonarvar (2025)

[16] AI as Decision-Maker: Risk Preferences of LLMs PDF

S Ouyang, H Yun, X Zheng (2024)

[24] A Reasoning and Value Alignment Test to Assess Advanced GPT Reasoning PDF

Liu Tong, Susnjak, Teo, Watters Paul (2024) • ACM Trans. Interact. Intell. Syst.

[44] Are We Aligned? A Preliminary Investigation of the Alignment of Responsible AI Values between LLMs and Human Judgment PDF

Baslyman Malak, Asma Yamani, Ahmed, Moataz, Malak Baslyman, Moataz Ahmed (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

LITMUSVALUES evaluation pipeline for AI value prioritization

[71] Dailydilemmas: Revealing value preferences of llms with quandaries of daily life PDF

Can Refute

[1] Aligning AI With Shared Human Values PDF

Cannot Refute

[51] Decision modeling for automated driving in dilemmas based on bidirectional value alignment of moral theory values and fair human moral values PDF

Cannot Refute

[54] Pluralism in AI Value Alignment: Motivations and Methods PDF

Cannot Refute

[68] A methodology for ethical decision-making in automated vehicles PDF

Cannot Refute

[69] Social choice ethics in artificial intelligence PDF

Cannot Refute

[70] The Hard Problem of AI Alignment: Value Forks in Moral Judgment PDF

Cannot Refute

[72] Defining a method for ethical decision making for automated vehicles PDF

Cannot Refute

[73] Ethical Decision-Making for the Inside of Autonomous Buses Moral Dilemmas PDF

Cannot Refute

[74] Llm theory of mind and alignment: Opportunities and risks PDF

Cannot Refute

Contribution

AIRISKDILEMMAS dataset of contextualized AI risk scenarios

[1] Aligning AI With Shared Human Values PDF

Cannot Refute

[51] Decision modeling for automated driving in dilemmas based on bidirectional value alignment of moral theory values and fair human moral values PDF

Cannot Refute

[52] AI Robots and Moral Dilemmas: The Role of AI Robotsâ Gender and Dilemma Types PDF

Cannot Refute

[53] Normative evaluation of large language models with everyday moral dilemmas PDF

Cannot Refute

[54] Pluralism in AI Value Alignment: Motivations and Methods PDF

Cannot Refute

[55] Dilemmas in AI Ethics: A Digital Game for Moral Reasoning and Collective Decision-Making PDF

Cannot Refute

[56] Many LLMs Are More Utilitarian Than One PDF

Cannot Refute

[57] Ethical reasoning over moral alignment: A case and framework for in-context ethical policies in LLMs PDF

Cannot Refute

[58] Moral Judgments of Human vs. AI Agents in Moral Dilemmas PDF

Cannot Refute

[59] Measuring Human-AI Value Alignment in Large Language Models PDF

Cannot Refute

Contribution

Demonstration that values predict both seen and unseen risky behaviors

[60] Misbehaviour prediction for autonomous driving systems PDF

Cannot Refute

[61] Personalized Safety in LLMs: A Benchmark and A Planning-Based Agent Approach PDF

Cannot Refute

[62] Safety by measurement: a systematic literature review of AI safety evaluation methods PDF

Cannot Refute

[63] AI safety gridworlds PDF

Cannot Refute

[64] Evaluating the Human Safety Net: Observational study of Physician Responses to Unsafe AI Recommendations in high-fidelity Simulation PDF

Cannot Refute

[65] Benchmarking and Understanding Safety Risks in AI Character Platforms PDF

Cannot Refute

[66] Position: AI Safety Must Embrace an Antifragile Perspective PDF

Cannot Refute

[67] Comparative Analysis of AI-Based Driving Behavior Risk Assessment Methods Accuracy, Efficiency, and Real-World Applicability PDF

Cannot Refute

Will AI Tell Lies to Save Sick Children? Litmus-Testing AI Values Prioritization with AIRiskDilemmas

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[4] Understanding AI Agents' Decision-Making: Evidence from Risk and Time Preference Elicitation PDF

[16] AI as Decision-Maker: Risk Preferences of LLMs PDF

[24] A Reasoning and Value Alignment Test to Assess Advanced GPT Reasoning PDF

[44] Are We Aligned? A Preliminary Investigation of the Alignment of Responsible AI Values between LLMs and Human Judgment PDF

Contribution Analysis

LITMUSVALUES evaluation pipeline for AI value prioritization

[71] Dailydilemmas: Revealing value preferences of llms with quandaries of daily life PDF

[1] Aligning AI With Shared Human Values PDF

[51] Decision modeling for automated driving in dilemmas based on bidirectional value alignment of moral theory values and fair human moral values PDF

[54] Pluralism in AI Value Alignment: Motivations and Methods PDF

[68] A methodology for ethical decision-making in automated vehicles PDF

[69] Social choice ethics in artificial intelligence PDF

[70] The Hard Problem of AI Alignment: Value Forks in Moral Judgment PDF

[72] Defining a method for ethical decision making for automated vehicles PDF

[73] Ethical Decision-Making for the Inside of Autonomous Buses Moral Dilemmas PDF

[74] Llm theory of mind and alignment: Opportunities and risks PDF

AIRISKDILEMMAS dataset of contextualized AI risk scenarios

[1] Aligning AI With Shared Human Values PDF

[51] Decision modeling for automated driving in dilemmas based on bidirectional value alignment of moral theory values and fair human moral values PDF

[52] AI Robots and Moral Dilemmas: The Role of AI Robotsâ Gender and Dilemma Types PDF

[53] Normative evaluation of large language models with everyday moral dilemmas PDF

[54] Pluralism in AI Value Alignment: Motivations and Methods PDF

[55] Dilemmas in AI Ethics: A Digital Game for Moral Reasoning and Collective Decision-Making PDF

[56] Many LLMs Are More Utilitarian Than One PDF

[57] Ethical reasoning over moral alignment: A case and framework for in-context ethical policies in LLMs PDF

[58] Moral Judgments of Human vs. AI Agents in Moral Dilemmas PDF

[59] Measuring Human-AI Value Alignment in Large Language Models PDF

Demonstration that values predict both seen and unseen risky behaviors

[60] Misbehaviour prediction for autonomous driving systems PDF

[61] Personalized Safety in LLMs: A Benchmark and A Planning-Based Agent Approach PDF

[62] Safety by measurement: a systematic literature review of AI safety evaluation methods PDF

[63] AI safety gridworlds PDF

[64] Evaluating the Human Safety Net: Observational study of Physician Responses to Unsafe AI Recommendations in high-fidelity Simulation PDF

[65] Benchmarking and Understanding Safety Risks in AI Character Platforms PDF

[66] Position: AI Safety Must Embrace an Antifragile Perspective PDF

[67] Comparative Analysis of AI-Based Driving Behavior Risk Assessment Methods Accuracy, Efficiency, and Real-World Applicability PDF

Table of Contents

[52] AI Robots and Moral Dilemmas: The Role of AI Robotsâ Gender and Dilemma Types PDF