Generative Value Conflicts Reveal LLM Priorities

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

LLM alignmentvalue alignmentevaluationmoral dilemmas

Past work seeks to align large language model (LLM)-based assistants with a target set of values, but such assistants are frequently forced to make tradeoffs between values when deployed. In response to the scarcity of value conflict in existing alignment datasets, we introduce ConflictScope, an automatic pipeline to evaluate how LLMs prioritize different values. Given a user-defined value set, ConflictScope automatically generates scenarios in which a language model faces a conflict between two values sampled from the set. It then prompts target models with an LLM-written ``user prompt'' and evaluates their free-text responses to elicit a ranking over values in the value set. Comparing results between multiple-choice and open-ended evaluations, we find that models shift away from supporting protective values, such as harmlessness, and toward supporting personal values, such as user autonomy, in more open-ended value conflict settings. However, including detailed value orderings in models' system prompts improves alignment with a target ranking by 14%, showing that system prompting can achieve moderate success at aligning LLM behavior under value conflict. Our work demonstrates the importance of evaluating value prioritization in models and provides a foundation for future work in this area.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces ConflictScope, an automated pipeline for generating value conflict scenarios and evaluating how language models prioritize competing values. It sits within the 'Automated Conflict Scenario Generation' leaf of the taxonomy, which contains only two papers total. This is a relatively sparse research direction compared to more crowded areas like multi-objective reinforcement learning (three papers) or moral dilemma datasets (three papers). The work addresses a recognized gap in alignment datasets that lack sufficient value conflict scenarios, positioning itself as a methodological contribution to conflict evaluation infrastructure.

The taxonomy reveals that ConflictScope's nearest neighbors include manually curated moral dilemma datasets (AI Risk Dilemmas, DailyDilemmas, Moral Scenarios) and value prioritization evaluation protocols that measure ranking consistency and human-AI alignment. The automated generation approach contrasts with manual curation efforts, aiming for scalable coverage of diverse value combinations. The work also connects to inference-time alignment methods through its system prompting experiments, though it focuses on evaluation rather than developing new alignment techniques. The taxonomy's scope notes clarify that this leaf excludes manually curated datasets and pure evaluation protocols, emphasizing the generative automation aspect.

Among thirty candidates examined across three contributions, none were identified as clearly refuting the work's novelty. The ConflictScope pipeline contribution examined ten candidates with zero refutable matches, as did the open-ended evaluation method and the value ranking elicitation methodology. This suggests that within the limited search scope, the specific combination of automated scenario generation, open-ended response evaluation, and value ranking elicitation appears relatively unexplored. The finding that models shift from protective to personal values in open-ended settings, and that system prompting improves alignment by fourteen percent, represents empirical observations rather than methodological claims subject to direct refutation.

Based on the limited literature search of thirty semantically similar papers, the work appears to occupy a methodologically distinct position within value conflict evaluation. The analysis cannot assess whether larger-scale searches or domain-specific venues might reveal closer prior work. The taxonomy structure suggests this is an emerging research direction with room for methodological innovation, though the field overall shows substantial activity across related evaluation and alignment challenges.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: evaluating language model value prioritization under conflict. The field addresses how language models navigate situations where multiple values or objectives cannot be simultaneously satisfied. The taxonomy organizes research into several major branches: Multi-Objective Alignment Methods develop techniques for balancing competing objectives during training (e.g., Multi Objective GRPO[13], Pareto Multi Objective[14]); Value Conflict Characterization and Evaluation Frameworks focus on defining, measuring, and generating scenarios where values clash (including works like AI Risk Dilemmas[12] and Generative Value Conflicts[0]); Domain-Specific Value Alignment examines conflicts in particular contexts such as cultural differences (Multi National Alignment[19]) or application areas; Specialized Alignment Challenges tackle issues like instruction hierarchies (Instruction Hierarchy[7]) and honesty-helpfulness trade-offs (Honesty Helpfulness Conflicts[6]); Supporting Resources provide datasets and methodologies (DailyDilemmas[22], Synthetic Moral Fables[23]); and Empirical Value Conflict Studies investigate how models actually behave when values compete (Privacy Prosocial Conflict[34], Right vs Right[35]). Several active research directions reveal key tensions in the field. One line explores whether conflicts can be resolved through better training objectives versus whether fundamental trade-offs are unavoidable (Fundamental Alignment Limitations[2], Safe RLHF[4]). Another examines how context should shape prioritization decisions (Contextual Value Alignment[9], Application Driven Alignment[3]). Generative Value Conflicts[0] sits within the Value Conflict Dataset Construction cluster, specifically focusing on automated conflict scenario generation. This work shares methodological kinship with AI Risk Dilemmas[12], which also constructs evaluative scenarios, but emphasizes generative approaches to produce diverse conflict cases at scale. Compared to manual curation efforts like DailyDilemmas[22], the automated generation strategy aims for broader coverage of the conflict space, though it faces distinct challenges in ensuring scenario realism and value representation fidelity.

Claimed Contributions

ConflictScope automated pipeline for value conflict scenario generation and evaluation

10 retrieved papers

The authors present ConflictScope, an automated system that generates realistic scenarios where language models face conflicts between pairs of values from a user-defined set, then evaluates model responses in open-ended settings to elicit value rankings. The pipeline includes scenario creation, filtering, and open-ended evaluation with simulated users.

10 retrieved papers

Open-ended evaluation method using simulated user interaction

10 retrieved papers

The authors introduce an evaluation approach that moves beyond multiple-choice questioning by simulating realistic user interactions. An LLM generates user prompts based on scenario context, target models respond, and a judge LLM determines which value-aligned action was taken, enabling comparison of expressed versus revealed preferences.

10 retrieved papers

Methodology for eliciting and steering value rankings from language models

10 retrieved papers

The authors develop a method to aggregate model preferences across value conflict scenarios into complete value rankings using Bradley-Terry models, and demonstrate how system prompts can steer models toward target rankings with moderate success (14% improvement in alignment).

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[12] Will AI Tell Lies to Save Sick Children? Litmus-Testing AI Values Prioritization with AIRiskDilemmas PDF

Wang Zhi-lin, Yu Ying Chiu, Zhilin Wang, Choi, Yejin, Sharan Maiya, Yejin Choi, Levine, Sydney, Kyle Fish, Hubinger, Evan, Sydney Levine, Evan Hubinger (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

ConflictScope automated pipeline for value conflict scenario generation and evaluation

[61] Culturepark: Boosting cross-cultural understanding in large language models PDF

Cannot Refute

[62] The moral integrity corpus: A benchmark for ethical dialogue systems PDF

Cannot Refute

[63] Toward Value Scenario Generation Through Large Language Models PDF

Cannot Refute

[64] Is LLM a reliable reviewer? a comprehensive evaluation of LLM on automatic paper reviewing tasks PDF

Cannot Refute

[65] " Pull or Not to Pull?'': Investigating Moral Biases in Leading Large Language Models Across Ethical Dilemmas PDF

Cannot Refute

[66] Ethical reasoning over moral alignment: A case and framework for in-context ethical policies in LLMs PDF

Cannot Refute

[67] Ethical-Advice Taker: Do Language Models Understand Natural Language Interventions? PDF

Cannot Refute

[68] Measuring ethical behavior with AI and natural language processing to assess business success PDF

Cannot Refute

[69] Are LLMs complicated ethical dilemma analyzers? PDF

Cannot Refute

[70] Natural-Language Mediation Versus Numerical Aggregation in Multi-Stakeholder AI Governance: Capability Boundaries and Architectural Requirements PDF

Cannot Refute

Contribution

Open-ended evaluation method using simulated user interaction

[51] Rethinking the evaluation for conversational recommendation in the era of large language models PDF

Cannot Refute

[52] Flipping the dialogue: Training and evaluating user language models PDF

Cannot Refute

[53] Out of One, Many: Using Language Models to Simulate Human Samples PDF

Cannot Refute

[54] The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery PDF

Cannot Refute

[55] Open-ended instructable embodied agents with memory-augmented large language models PDF

Cannot Refute

[56] Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models PDF

Cannot Refute

[57] Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies PDF

Cannot Refute

[58] EvalLM: Interactive Evaluation of Large Language Model Prompts on User-Defined Criteria PDF

Cannot Refute

[59] An evaluation framework for clinical use of large language models in patient interaction tasks PDF

Cannot Refute

[60] Simulbench: Evaluating language models with creative simulation tasks PDF

Cannot Refute

Contribution

Methodology for eliciting and steering value rankings from language models

[71] Self-Play Preference Optimization for Language Model Alignment PDF

Cannot Refute

[72] Token-level direct preference optimization PDF

Cannot Refute

[73] Advancing Preference Learning in AI: Beyond Pairwise Comparisons PDF

Cannot Refute

[74] Elo uncovered: Robustness and best practices in language model evaluation PDF

Cannot Refute

[75] Helpsteer2-preference: Complementing ratings with preferences PDF

Cannot Refute

[76] Rethinking reward modeling in preference-based large language model alignment PDF

Cannot Refute

[77] On softmax direct preference optimization for recommendation PDF

Cannot Refute

[78] Direct preference optimization: Your language model is secretly a reward model PDF

Cannot Refute

[79] Disentangling length bias in preference learning via response-conditioned modeling PDF

Cannot Refute

[80] Improving LLM General Preference Alignment via Optimistic Online Mirror Descent PDF

Cannot Refute

Generative Value Conflicts Reveal LLM Priorities

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[12] Will AI Tell Lies to Save Sick Children? Litmus-Testing AI Values Prioritization with AIRiskDilemmas PDF

Contribution Analysis

ConflictScope automated pipeline for value conflict scenario generation and evaluation

[61] Culturepark: Boosting cross-cultural understanding in large language models PDF

[62] The moral integrity corpus: A benchmark for ethical dialogue systems PDF

[63] Toward Value Scenario Generation Through Large Language Models PDF

[64] Is LLM a reliable reviewer? a comprehensive evaluation of LLM on automatic paper reviewing tasks PDF

[65] " Pull or Not to Pull?'': Investigating Moral Biases in Leading Large Language Models Across Ethical Dilemmas PDF

[66] Ethical reasoning over moral alignment: A case and framework for in-context ethical policies in LLMs PDF

[67] Ethical-Advice Taker: Do Language Models Understand Natural Language Interventions? PDF

[68] Measuring ethical behavior with AI and natural language processing to assess business success PDF

[69] Are LLMs complicated ethical dilemma analyzers? PDF

[70] Natural-Language Mediation Versus Numerical Aggregation in Multi-Stakeholder AI Governance: Capability Boundaries and Architectural Requirements PDF

Open-ended evaluation method using simulated user interaction

[51] Rethinking the evaluation for conversational recommendation in the era of large language models PDF

[52] Flipping the dialogue: Training and evaluating user language models PDF

[53] Out of One, Many: Using Language Models to Simulate Human Samples PDF

[54] The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery PDF

[55] Open-ended instructable embodied agents with memory-augmented large language models PDF

[56] Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models PDF

[57] Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies PDF

[58] EvalLM: Interactive Evaluation of Large Language Model Prompts on User-Defined Criteria PDF

[59] An evaluation framework for clinical use of large language models in patient interaction tasks PDF

[60] Simulbench: Evaluating language models with creative simulation tasks PDF

Methodology for eliciting and steering value rankings from language models

[71] Self-Play Preference Optimization for Language Model Alignment PDF

[72] Token-level direct preference optimization PDF

[73] Advancing Preference Learning in AI: Beyond Pairwise Comparisons PDF

[74] Elo uncovered: Robustness and best practices in language model evaluation PDF

[75] Helpsteer2-preference: Complementing ratings with preferences PDF

[76] Rethinking reward modeling in preference-based large language model alignment PDF

[77] On softmax direct preference optimization for recommendation PDF

[78] Direct preference optimization: Your language model is secretly a reward model PDF

[79] Disentangling length bias in preference learning via response-conditioned modeling PDF

[80] Improving LLM General Preference Alignment via Optimistic Online Mirror Descent PDF

Table of Contents