Using cognitive models to reveal value trade-offs in language models

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 7.0 Download Report PDF

cognitive modelingvalue tradeoffsRLHF training dynamics

Value trade-offs are an integral part of human decision-making and language use, however, current tools for interpreting such dynamic and multi-faceted notions of values in LLMs are limited. In cognitive science, so-called “cognitive models” provide formal accounts of such trade-offs in humans, by modeling the weighting of a speaker’s competing utility functions in choosing an action or utterance. Here we use a leading cognitive model of polite speech to systematically evaluate value trade-offs in two encompassing model settings: degrees of reasoning “effort” in frontier black-box models, and RL post-training dynamics of open-source models. Our results highlight patterns of higher informational utility than social utility in reasoning models’ default behavior, and demonstrate that these patterns shift in predictable ways when models are prompted to prioritize certain goals over others. Our findings from LLMs’ training dynamics suggest large shifts in utility values early on in training with persistent effects of the choice of base model and pretraining data, compared to feedback dataset or alignment method. We show that our method is responsive to diverse aspects of the rapidly evolving LLM landscape, with insights for forming hypotheses about other social behaviors such as sycophancy, and shaping training regimes that better control trade-offs between values during model development

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper applies cognitive models from human decision-making research to interpret value trade-offs in LLMs, specifically using a politeness model to quantify informational versus social utility. It resides in the Cognitive Model-Based Value Trade-off Interpretation leaf, which contains only two papers total. This represents a sparse research direction within the broader Behavioral Trade-off Analysis branch, suggesting the cognitive modeling approach to LLM value alignment is relatively underexplored compared to empirical benchmarking or technical intervention methods that dominate neighboring areas.

The taxonomy reveals that most behavioral trade-off work focuses on specific technical tensions—safety versus capability, accuracy versus fairness, privacy versus utility—rather than cognitive frameworks for interpreting multi-dimensional value conflicts. The paper's leaf sits alongside general Behavioral Trade-off Analysis but diverges from purely outcome-focused evaluations by emphasizing mechanistic accounts of how models weight competing utilities. Neighboring leaves like Safety-Capability Trade-offs and Accuracy-Fairness Trade-offs examine similar tensions but lack the cognitive modeling lens, while Value Alignment Assessment branches focus on measurement frameworks rather than interpretive models of decision processes.

Among thirty candidates examined across three contributions, none were identified as clearly refuting the work. The first contribution—applying cognitive models to LLMs—examined ten candidates with zero refutable matches, suggesting limited prior work directly combining cognitive science frameworks with LLM value analysis at this level of formalism. The second contribution on reasoning effort and training dynamics similarly found no refutations across ten candidates, indicating the systematic evaluation of utility shifts across model settings may be novel. The third contribution's method for hypothesis formation about social behaviors also showed no overlapping prior work among ten examined papers, though the limited search scope means exhaustive coverage cannot be claimed.

Given the sparse taxonomy position and absence of refutations within the examined candidate set, the work appears to occupy relatively unexplored methodological territory. However, the analysis is constrained by top-thirty semantic search results and does not guarantee comprehensive coverage of adjacent cognitive science or interpretability literature. The novelty assessment reflects what is visible within this limited scope rather than an exhaustive field survey.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: evaluating value trade-offs in language model behavior. The field has organized itself into several major branches that reflect different facets of alignment research. Value Alignment Assessment and Measurement focuses on benchmarking and quantifying how well models reflect human values, often through structured evaluations and cultural or ethical probes (e.g., Cultural Value Alignment[5], ValueCompass[19]). Behavioral Trade-off Analysis examines the inherent tensions that arise when models must balance competing objectives—such as accuracy versus fairness (Accuracy Fairness Tradeoff[3]), safety versus capability (Safety Capability Tradeoffs[7]), or privacy versus utility (Privacy Utility Efficiency[12]). Alignment Interventions and Training Dynamics investigates how different training regimes and fine-tuning strategies shape value priorities, while Robustness and Adversarial Challenges explores how alignment holds up under attack or distributional shift. Application-Specific Value Alignment tailors these concerns to domains like mental health or agentic systems, and Technical Infrastructure provides the methodological scaffolding—datasets, metrics, and frameworks—that underpin empirical work across all branches. Within Behavioral Trade-off Analysis, a particularly active line of work examines how models navigate conflicting values in realistic decision scenarios, often drawing on moral dilemmas (DailyDilemmas[25], CLASH Dilemmas[39]) or game-theoretic settings (Machiavelli Rewards Ethics[1]). Another strand investigates technical trade-offs such as watermarking's impact on generation quality (Watermarking Performance Tradeoffs[10]) or the interplay between safety constraints and model expressiveness (Tunable Safety Performance[24]). Cognitive Value Tradeoffs[0] sits within the Cognitive Model-Based Value Trade-off Interpretation cluster, emphasizing how cognitive frameworks can illuminate the internal reasoning processes that lead to particular trade-off resolutions. This approach contrasts with purely behavioral or outcome-focused methods, offering a more mechanistic lens on why models prioritize certain values over others. Nearby work like Cognitive Wolves[38] similarly explores cognitive architectures, suggesting a small but growing interest in interpretability-driven accounts of value conflict resolution.

Claimed Contributions

Application of cognitive models to reveal value trade-offs in LLMs

10 retrieved papers

The authors apply a well-established cognitive model from cognitive science (the Rational Speech Acts model of polite speech) to interpret and quantify value trade-offs in large language models. This method is used to analyze both closed-source reasoning models and open-source models across different training stages.

10 retrieved papers

Systematic evaluation of reasoning effort and training dynamics on utility trade-offs

10 retrieved papers

The authors provide empirical findings on how reasoning budgets and goal-based prompts affect utility weightings in frontier models, and how base model choice and pretraining data influence utility trade-offs more than feedback datasets or alignment methods during RL post-training.

10 retrieved papers

Method for forming hypotheses about social behaviors and shaping training regimes

10 retrieved papers

The authors demonstrate that their cognitive modeling approach can be used to generate testable hypotheses about high-level social behaviors like sycophancy and to inform the design of training procedures that better manage value trade-offs in LLM development.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[38] Inside you are many wolves: Using cognitive models to interpret value trade-offs in LLMs PDF

SK Murthy, R Zhao, J Hu, S Kakade (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Application of cognitive models to reveal value trade-offs in LLMs

[16] Structured Moral Reasoning in Language Models: A Value-Grounded Evaluation Framework PDF

Cannot Refute

[38] Inside you are many wolves: Using cognitive models to interpret value trade-offs in LLMs PDF

Cannot Refute

[70] Computational analysis of 100 K choice dilemmas: Decision attributes, trade-off structures, and model-based prediction PDF

Cannot Refute

[71] Parallel trade-offs in human cognition and neural networks: The dynamic interplay between in-context and in-weight learning PDF

Cannot Refute

[72] How do large language models navigate conflicts between honesty and helpfulness? PDF

Cannot Refute

[73] CognAlign: A Multi-Agent Cognitive-Alignment Framework for Transparent, Bias-Aware Medical Triage Using Small Language Models PDF

Cannot Refute

[74] Stability-Plasticity Trade-Off in Large Language Models for Health Chatbot Applications PDF

Cannot Refute

[75] Analogies versus rules in cognitive architecture PDF

Cannot Refute

[76] Machine Reasoning Framework for Large Language Models PDF

Cannot Refute

[77] Neuro-symbolic models of human moral judgment: LLMs as automatic feature extractors PDF

Cannot Refute

Contribution

Systematic evaluation of reasoning effort and training dynamics on utility trade-offs

[38] Inside you are many wolves: Using cognitive models to interpret value trade-offs in LLMs PDF

Cannot Refute

[61] Llm post-training: A deep dive into reasoning large language models PDF

Cannot Refute

[62] ThinkDial: An Open Recipe for Controlling Reasoning Effort in Large Language Models PDF

Cannot Refute

[63] AMFT: Aligning LLM Reasoners by Meta-Learning the Optimal Imitation-Exploration Balance PDF

Cannot Refute

[64] Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Multi-Stage RL PDF

Cannot Refute

[65] AdaCtrl: Towards Adaptive and Controllable Reasoning via Difficulty-Aware Budgeting PDF

Cannot Refute

[66] Training and Inference Time Dynamics of Artificial Neural Networks PDF

Cannot Refute

[67] Thinking Sparks!: Emergent Attention Heads in Reasoning Models During Post Training PDF

Cannot Refute

[68] Adaptive Dual Reasoner: Large Reasoning Models Can Think Efficiently by Hybrid Reasoning PDF

Cannot Refute

[69] Scalable Graph Neural Networks for Global Knowledge Representation and Reasoning PDF

Cannot Refute

Contribution

Method for forming hypotheses about social behaviors and shaping training regimes

[51] LLM Social Simulations Are a Promising Research Method PDF

Cannot Refute

[52] Social Behavioral Theory PDF

Cannot Refute

[53] SLEAP: A deep learning system for multi-animal pose tracking PDF

Cannot Refute

[54] Collective Constitutional AI: Aligning a Language Model with Public Input PDF

Cannot Refute

[55] Human behavior atlas: Benchmarking unified psychological and social behavior understanding PDF

Cannot Refute

[56] Machine-assisted social psychology hypothesis generation. PDF

Cannot Refute

[57] VLM-Social-Nav: Socially Aware Robot Navigation Through Scoring Using Vision-Language Models PDF

Cannot Refute

[58] Simple Behavioral Analysis (SimBA) â an open source toolkit for computer classification of complex social behaviors in experimental animals PDF

Cannot Refute

[59] Self Control Analysis of Adolescent Prosocial Behavior Based on Optimized Random Forest Algorithm PDF

Cannot Refute

[60] Align on the Fly: Adapting Chatbot Behavior to Established Norms PDF

Cannot Refute

Using cognitive models to reveal value trade-offs in language models

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[38] Inside you are many wolves: Using cognitive models to interpret value trade-offs in LLMs PDF

Contribution Analysis

Application of cognitive models to reveal value trade-offs in LLMs

[16] Structured Moral Reasoning in Language Models: A Value-Grounded Evaluation Framework PDF

[38] Inside you are many wolves: Using cognitive models to interpret value trade-offs in LLMs PDF

[70] Computational analysis of 100 K choice dilemmas: Decision attributes, trade-off structures, and model-based prediction PDF

[71] Parallel trade-offs in human cognition and neural networks: The dynamic interplay between in-context and in-weight learning PDF

[72] How do large language models navigate conflicts between honesty and helpfulness? PDF

[73] CognAlign: A Multi-Agent Cognitive-Alignment Framework for Transparent, Bias-Aware Medical Triage Using Small Language Models PDF

[74] Stability-Plasticity Trade-Off in Large Language Models for Health Chatbot Applications PDF

[75] Analogies versus rules in cognitive architecture PDF

[76] Machine Reasoning Framework for Large Language Models PDF

[77] Neuro-symbolic models of human moral judgment: LLMs as automatic feature extractors PDF

Systematic evaluation of reasoning effort and training dynamics on utility trade-offs

[38] Inside you are many wolves: Using cognitive models to interpret value trade-offs in LLMs PDF

[61] Llm post-training: A deep dive into reasoning large language models PDF

[62] ThinkDial: An Open Recipe for Controlling Reasoning Effort in Large Language Models PDF

[63] AMFT: Aligning LLM Reasoners by Meta-Learning the Optimal Imitation-Exploration Balance PDF

[64] Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Multi-Stage RL PDF

[65] AdaCtrl: Towards Adaptive and Controllable Reasoning via Difficulty-Aware Budgeting PDF

[66] Training and Inference Time Dynamics of Artificial Neural Networks PDF

[67] Thinking Sparks!: Emergent Attention Heads in Reasoning Models During Post Training PDF

[68] Adaptive Dual Reasoner: Large Reasoning Models Can Think Efficiently by Hybrid Reasoning PDF

[69] Scalable Graph Neural Networks for Global Knowledge Representation and Reasoning PDF

Method for forming hypotheses about social behaviors and shaping training regimes

[51] LLM Social Simulations Are a Promising Research Method PDF

[52] Social Behavioral Theory PDF

[53] SLEAP: A deep learning system for multi-animal pose tracking PDF

[54] Collective Constitutional AI: Aligning a Language Model with Public Input PDF

[55] Human behavior atlas: Benchmarking unified psychological and social behavior understanding PDF

[56] Machine-assisted social psychology hypothesis generation. PDF

[57] VLM-Social-Nav: Socially Aware Robot Navigation Through Scoring Using Vision-Language Models PDF

[58] Simple Behavioral Analysis (SimBA) â an open source toolkit for computer classification of complex social behaviors in experimental animals PDF

[59] Self Control Analysis of Adolescent Prosocial Behavior Based on Optimized Random Forest Algorithm PDF

[60] Align on the Fly: Adapting Chatbot Behavior to Established Norms PDF

Table of Contents

[58] Simple Behavioral Analysis (SimBA) â an open source toolkit for computer classification of complex social behaviors in experimental animals PDF