Using cognitive models to reveal value trade-offs in language models

ICLR 2026 Conference SubmissionAnonymous Authors
cognitive modelingvalue tradeoffsRLHF training dynamics
Abstract:

Value trade-offs are an integral part of human decision-making and language use, however, current tools for interpreting such dynamic and multi-faceted notions of values in LLMs are limited. In cognitive science, so-called “cognitive models” provide formal accounts of such trade-offs in humans, by modeling the weighting of a speaker’s competing utility functions in choosing an action or utterance. Here we use a leading cognitive model of polite speech to systematically evaluate value trade-offs in two encompassing model settings: degrees of reasoning “effort” in frontier black-box models, and RL post-training dynamics of open-source models. Our results highlight patterns of higher informational utility than social utility in reasoning models’ default behavior, and demonstrate that these patterns shift in predictable ways when models are prompted to prioritize certain goals over others. Our findings from LLMs’ training dynamics suggest large shifts in utility values early on in training with persistent effects of the choice of base model and pretraining data, compared to feedback dataset or alignment method. We show that our method is responsive to diverse aspects of the rapidly evolving LLM landscape, with insights for forming hypotheses about other social behaviors such as sycophancy, and shaping training regimes that better control trade-offs between values during model development

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper applies cognitive models from human decision-making research to interpret value trade-offs in LLMs, specifically using a politeness model to quantify informational versus social utility. It resides in the Cognitive Model-Based Value Trade-off Interpretation leaf, which contains only two papers total. This represents a sparse research direction within the broader Behavioral Trade-off Analysis branch, suggesting the cognitive modeling approach to LLM value alignment is relatively underexplored compared to empirical benchmarking or technical intervention methods that dominate neighboring areas.

The taxonomy reveals that most behavioral trade-off work focuses on specific technical tensions—safety versus capability, accuracy versus fairness, privacy versus utility—rather than cognitive frameworks for interpreting multi-dimensional value conflicts. The paper's leaf sits alongside general Behavioral Trade-off Analysis but diverges from purely outcome-focused evaluations by emphasizing mechanistic accounts of how models weight competing utilities. Neighboring leaves like Safety-Capability Trade-offs and Accuracy-Fairness Trade-offs examine similar tensions but lack the cognitive modeling lens, while Value Alignment Assessment branches focus on measurement frameworks rather than interpretive models of decision processes.

Among thirty candidates examined across three contributions, none were identified as clearly refuting the work. The first contribution—applying cognitive models to LLMs—examined ten candidates with zero refutable matches, suggesting limited prior work directly combining cognitive science frameworks with LLM value analysis at this level of formalism. The second contribution on reasoning effort and training dynamics similarly found no refutations across ten candidates, indicating the systematic evaluation of utility shifts across model settings may be novel. The third contribution's method for hypothesis formation about social behaviors also showed no overlapping prior work among ten examined papers, though the limited search scope means exhaustive coverage cannot be claimed.

Given the sparse taxonomy position and absence of refutations within the examined candidate set, the work appears to occupy relatively unexplored methodological territory. However, the analysis is constrained by top-thirty semantic search results and does not guarantee comprehensive coverage of adjacent cognitive science or interpretability literature. The novelty assessment reflects what is visible within this limited scope rather than an exhaustive field survey.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: evaluating value trade-offs in language model behavior. The field has organized itself into several major branches that reflect different facets of alignment research. Value Alignment Assessment and Measurement focuses on benchmarking and quantifying how well models reflect human values, often through structured evaluations and cultural or ethical probes (e.g., Cultural Value Alignment[5], ValueCompass[19]). Behavioral Trade-off Analysis examines the inherent tensions that arise when models must balance competing objectives—such as accuracy versus fairness (Accuracy Fairness Tradeoff[3]), safety versus capability (Safety Capability Tradeoffs[7]), or privacy versus utility (Privacy Utility Efficiency[12]). Alignment Interventions and Training Dynamics investigates how different training regimes and fine-tuning strategies shape value priorities, while Robustness and Adversarial Challenges explores how alignment holds up under attack or distributional shift. Application-Specific Value Alignment tailors these concerns to domains like mental health or agentic systems, and Technical Infrastructure provides the methodological scaffolding—datasets, metrics, and frameworks—that underpin empirical work across all branches. Within Behavioral Trade-off Analysis, a particularly active line of work examines how models navigate conflicting values in realistic decision scenarios, often drawing on moral dilemmas (DailyDilemmas[25], CLASH Dilemmas[39]) or game-theoretic settings (Machiavelli Rewards Ethics[1]). Another strand investigates technical trade-offs such as watermarking's impact on generation quality (Watermarking Performance Tradeoffs[10]) or the interplay between safety constraints and model expressiveness (Tunable Safety Performance[24]). Cognitive Value Tradeoffs[0] sits within the Cognitive Model-Based Value Trade-off Interpretation cluster, emphasizing how cognitive frameworks can illuminate the internal reasoning processes that lead to particular trade-off resolutions. This approach contrasts with purely behavioral or outcome-focused methods, offering a more mechanistic lens on why models prioritize certain values over others. Nearby work like Cognitive Wolves[38] similarly explores cognitive architectures, suggesting a small but growing interest in interpretability-driven accounts of value conflict resolution.

Claimed Contributions

Application of cognitive models to reveal value trade-offs in LLMs

The authors apply a well-established cognitive model from cognitive science (the Rational Speech Acts model of polite speech) to interpret and quantify value trade-offs in large language models. This method is used to analyze both closed-source reasoning models and open-source models across different training stages.

10 retrieved papers
Systematic evaluation of reasoning effort and training dynamics on utility trade-offs

The authors provide empirical findings on how reasoning budgets and goal-based prompts affect utility weightings in frontier models, and how base model choice and pretraining data influence utility trade-offs more than feedback datasets or alignment methods during RL post-training.

10 retrieved papers
Method for forming hypotheses about social behaviors and shaping training regimes

The authors demonstrate that their cognitive modeling approach can be used to generate testable hypotheses about high-level social behaviors like sycophancy and to inform the design of training procedures that better manage value trade-offs in LLM development.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Application of cognitive models to reveal value trade-offs in LLMs

The authors apply a well-established cognitive model from cognitive science (the Rational Speech Acts model of polite speech) to interpret and quantify value trade-offs in large language models. This method is used to analyze both closed-source reasoning models and open-source models across different training stages.

Contribution

Systematic evaluation of reasoning effort and training dynamics on utility trade-offs

The authors provide empirical findings on how reasoning budgets and goal-based prompts affect utility weightings in frontier models, and how base model choice and pretraining data influence utility trade-offs more than feedback datasets or alignment methods during RL post-training.

Contribution

Method for forming hypotheses about social behaviors and shaping training regimes

The authors demonstrate that their cognitive modeling approach can be used to generate testable hypotheses about high-level social behaviors like sycophancy and to inform the design of training procedures that better manage value trade-offs in LLM development.