Reward Models Inherit Value Biases from Pretraining

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

reward modelsvalue alignmentfinetuningpreference learninglarge language modelsRLHFAI safetybiaspretraining

Reward models (RMs) are central to aligning large language models (LLMs) with human values but have received less attention than pre-trained and post-trained LLMs themselves. Because RMs are initialized from LLMs, they inherit representations that shape their behavior, but the nature and extent of this influence remain understudied. In a comprehensive study of 10 leading open-weight RMs using validated psycholinguistic corpora, we show that RMs exhibit significant differences along multiple dimensions of human value as a function of their base model. Using the "Big Two" psychological axes, we show a robust preference of Llama RMs for "agency" and a corresponding robust preference of Gemma RMs for "communion." This phenomenon holds even when the preference data and finetuning process are identical, and we trace it back to the logits of the respective instruction-tuned and pre-trained models. These log-probability differences themselves can be formulated as an implicit RM; we derive usable implicit reward scores and show that they exhibit the very same agency/communion difference. We run experiments training RMs with ablations for preference data source and quantity, which demonstrate that this effect is not only repeatable but surprisingly durable. Despite RMs being designed to represent human preferences, our evidence shows that their outputs are influenced by the pretrained LLMs on which they are based. This work underscores the importance of safety and alignment efforts at the pretraining stage, and makes clear that open-source developers' choice of base model is as much a consideration of values as of performance.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

This paper investigates how reward models inherit value biases from their pretrained base language models, specifically demonstrating systematic differences along psychological dimensions of agency and communion across Llama and Gemma model families. It resides in the 'Inherited Value Biases from Pretraining' leaf, which contains only three papers total, indicating a relatively sparse research direction within the broader taxonomy. The work sits within the 'Bias Characterization and Measurement in Reward Models' branch, which itself represents one of several major organizational pillars in the field alongside mitigation strategies, benchmarking frameworks, and alignment optimization methods.

The taxonomy reveals neighboring leaves examining related but distinct bias phenomena: 'Idiosyncratic and Superficial Feature Biases' focuses on length and style preferences, while 'Training-Induced and Distribution Biases' addresses biases from preference annotation and fine-tuning procedures. The paper's emphasis on upstream pretrained representations distinguishes it from these adjacent directions. Nearby branches include 'Reward Model Interpretability and Analysis' and 'Implicit Reward Models and Alternative Formulations,' suggesting potential connections between understanding inherited biases and developing alternative reward formulations. The taxonomy's scope notes clarify that biases arising from preference data belong elsewhere, positioning this work specifically at the pretraining-to-reward-model inheritance boundary.

Among twenty-one candidates examined through limited semantic search, the contribution on implicit reward model formulation encountered three potentially refutable papers, while the psycholinguistic interpretability method examined only one candidate with no clear refutation, and the controlled experiments on bias replicability examined ten candidates with none providing overlapping prior work. The implicit reward formulation appears to have more substantial related literature within this limited search scope, though the analysis does not claim exhaustive coverage. The experimental demonstration of bias persistence across identical preference data and fine-tuning processes appears less directly addressed in the examined candidates, suggesting potential novelty in the controlled ablation methodology.

Based on the limited search scope of twenty-one semantically similar papers, the work appears to occupy a relatively under-explored intersection between pretrained model analysis and reward model behavior. The taxonomy structure indicates this is a sparse research direction with only two sibling papers, though the implicit reward formulation connects to existing work on alternative reward model architectures. The analysis covers top-K semantic matches and does not represent comprehensive field coverage, leaving open questions about related work in adjacent communities or recent preprints.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: value biases in reward models from pretrained language models. The field has organized itself around several complementary perspectives. One major branch focuses on characterizing and measuring biases—examining how reward models inherit or amplify unwanted preferences from their pretrained foundations. Another branch develops benchmarking and evaluation frameworks (e.g., RewardBench[2], RM-Bench[4]) to systematically assess reward model quality and fairness. Mitigation strategies form a third pillar, exploring debiasing techniques and fairness-aware reward composition (Group Fairness Rewards[6], Fair Reward Composition[10]). Additional branches address interpretability and analysis, alignment optimization methods that leverage these models, alternative formulations such as implicit rewards or Q-function approaches, safety-oriented work targeting toxicity (Mitigating Toxicity Transformers[14]), domain-specific extensions, and theoretical surveys (Foundation Models Survey[18], Reward Modeling Landscape[21]) that synthesize emerging insights. Within the bias characterization branch, a particularly active line of inquiry examines how pretraining corpora embed value judgments that persist through reward modeling. Value Biases Pretraining[0] sits squarely in this cluster, investigating inherited biases alongside neighbors like Relative Value Encoding[7] and Biased Reinforcement Learners[11]. While Relative Value Encoding[7] explores how models encode comparative preferences, Value Biases Pretraining[0] emphasizes the upstream origins of these preferences in pretrained representations. Biased Reinforcement Learners[11] complements this by studying how such biases propagate during reinforcement learning. Across the broader landscape, open questions persist about whether mitigation should occur at pretraining, reward modeling, or policy optimization stages, and how to balance debiasing with maintaining task performance—a tension visible in works like Causal Rewards[5] and Calibrated Self-Rewarding[1].

Claimed Contributions

RM interpretability method using psycholinguistics

1 retrieved paper

The authors introduce a method that combines exhaustive token search with psycholinguistic corpora (Big Two and Moral Foundations Dictionary) to quantify value biases in reward models. This approach maps token-level rewards to psychological constructs representing dimensions of human value.

1 retrieved paper

Implicit reward model formulation from log-probability differences

Can Refute

10 retrieved papers

The authors formalize the difference between two language models' log probabilities as an implicit reward model and introduce a mixture-weighted log-ratio (MWLR) score to make these implicit rewards empirically usable. They demonstrate that these implicit reward scores reveal the same agency/communion biases observed in explicit reward models.

10 retrieved papers

Can Refute

Controlled experiments demonstrating replicability and durability of inherited value biases

10 retrieved papers

The authors conduct systematic experiments training reward models from different base models (Llama and Gemma) with identical hyperparameters and controlled variations in preference data source and quantity. These experiments demonstrate that value biases inherited from pretraining persist through reward modeling and require substantial preference data to mitigate.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[7] Relative Value Encoding in Large Language Models: A Multi-Task, Multi-Model Investigation PDF

William M. Hayes, Nicolas Yax, Stefano Palminteri (2025)

[11] Large language models are biased reinforcement learners PDF

William M. Hayes, Nicolas Yax, palminteri stefano, Stefano Palminteri (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

RM interpretability method using psycholinguistics

[31] Polysemy through the lens of psycholinguistic variables: a dataset and an evaluation of static and contextualized language models PDF

Cannot Refute

Contribution

Implicit reward model formulation from log-probability differences

[39] Direct preference optimization: Your language model is secretly a reward model PDF

Can Refute

[40] Free process rewards without process labels PDF

Can Refute

[46] Self-Rewarding PPO: Aligning Large Language Models with Demonstrations Only PDF

Can Refute

[3] Rethinking reward modeling in preference-based large language model alignment PDF

Cannot Refute

[41] Getting more juice out of the sft data: Reward learning from human demonstration improves sft for llm alignment PDF

Cannot Refute

[42] Robust Preference Optimization through Reward Model Distillation PDF

Cannot Refute

[43] Language models can articulate their implicit goals PDF

Cannot Refute

[44] Selective preference optimization via token-level reward function estimation PDF

Cannot Refute

[45] Minor DPO reject penalty to increase training robustness PDF

Cannot Refute

[47] ConfPO: Exploiting Policy Model Confidence for Critical Token Selection in Large Language Model Preference Optimization PDF

Cannot Refute

Contribution

Controlled experiments demonstrating replicability and durability of inherited value biases

[3] Rethinking reward modeling in preference-based large language model alignment PDF

Cannot Refute

[5] Beyond Reward Hacking: Causal Rewards for Large Language Model Alignment PDF

Cannot Refute

[12] Reward Model Interpretability via Optimal and Pessimal Tokens PDF

Cannot Refute

[32] Raft: Reward ranked finetuning for generative foundation model alignment PDF

Cannot Refute

[33] A survey on fairness in large language models PDF

Cannot Refute

[34] Post-hoc reward calibration: A case study on length bias PDF

Cannot Refute

[35] Fine-Tuning a Biased Model for Improving Fairness PDF

Cannot Refute

[36] On Transferability of Bias Mitigation Effects in Language Model Fine-Tuning PDF

Cannot Refute

[37] Erasing the bias: Fine-tuning foundation models for semi-supervised learning PDF

Cannot Refute

[38] Stereotype and Skew: Quantifying Gender Bias in Pre-trained and Fine-tuned Language Models PDF

Cannot Refute

Reward Models Inherit Value Biases from Pretraining

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[7] Relative Value Encoding in Large Language Models: A Multi-Task, Multi-Model Investigation PDF

[11] Large language models are biased reinforcement learners PDF

Contribution Analysis

RM interpretability method using psycholinguistics

[31] Polysemy through the lens of psycholinguistic variables: a dataset and an evaluation of static and contextualized language models PDF

Implicit reward model formulation from log-probability differences

[39] Direct preference optimization: Your language model is secretly a reward model PDF

[40] Free process rewards without process labels PDF

[46] Self-Rewarding PPO: Aligning Large Language Models with Demonstrations Only PDF

[3] Rethinking reward modeling in preference-based large language model alignment PDF

[41] Getting more juice out of the sft data: Reward learning from human demonstration improves sft for llm alignment PDF

[42] Robust Preference Optimization through Reward Model Distillation PDF

[43] Language models can articulate their implicit goals PDF

[44] Selective preference optimization via token-level reward function estimation PDF

[45] Minor DPO reject penalty to increase training robustness PDF

[47] ConfPO: Exploiting Policy Model Confidence for Critical Token Selection in Large Language Model Preference Optimization PDF

Controlled experiments demonstrating replicability and durability of inherited value biases

[3] Rethinking reward modeling in preference-based large language model alignment PDF

[5] Beyond Reward Hacking: Causal Rewards for Large Language Model Alignment PDF

[12] Reward Model Interpretability via Optimal and Pessimal Tokens PDF

[32] Raft: Reward ranked finetuning for generative foundation model alignment PDF

[33] A survey on fairness in large language models PDF

[34] Post-hoc reward calibration: A case study on length bias PDF

[35] Fine-Tuning a Biased Model for Improving Fairness PDF

[36] On Transferability of Bias Mitigation Effects in Language Model Fine-Tuning PDF

[37] Erasing the bias: Fine-tuning foundation models for semi-supervised learning PDF

[38] Stereotype and Skew: Quantifying Gender Bias in Pre-trained and Fine-tuned Language Models PDF

Table of Contents