Reward Models Inherit Value Biases from Pretraining

ICLR 2026 Conference SubmissionAnonymous Authors
reward modelsvalue alignmentfinetuningpreference learninglarge language modelsRLHFAI safetybiaspretraining
Abstract:

Reward models (RMs) are central to aligning large language models (LLMs) with human values but have received less attention than pre-trained and post-trained LLMs themselves. Because RMs are initialized from LLMs, they inherit representations that shape their behavior, but the nature and extent of this influence remain understudied. In a comprehensive study of 10 leading open-weight RMs using validated psycholinguistic corpora, we show that RMs exhibit significant differences along multiple dimensions of human value as a function of their base model. Using the "Big Two" psychological axes, we show a robust preference of Llama RMs for "agency" and a corresponding robust preference of Gemma RMs for "communion." This phenomenon holds even when the preference data and finetuning process are identical, and we trace it back to the logits of the respective instruction-tuned and pre-trained models. These log-probability differences themselves can be formulated as an implicit RM; we derive usable implicit reward scores and show that they exhibit the very same agency/communion difference. We run experiments training RMs with ablations for preference data source and quantity, which demonstrate that this effect is not only repeatable but surprisingly durable. Despite RMs being designed to represent human preferences, our evidence shows that their outputs are influenced by the pretrained LLMs on which they are based. This work underscores the importance of safety and alignment efforts at the pretraining stage, and makes clear that open-source developers' choice of base model is as much a consideration of values as of performance.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

This paper investigates how reward models inherit value biases from their pretrained base language models, specifically demonstrating systematic differences along psychological dimensions of agency and communion across Llama and Gemma model families. It resides in the 'Inherited Value Biases from Pretraining' leaf, which contains only three papers total, indicating a relatively sparse research direction within the broader taxonomy. The work sits within the 'Bias Characterization and Measurement in Reward Models' branch, which itself represents one of several major organizational pillars in the field alongside mitigation strategies, benchmarking frameworks, and alignment optimization methods.

The taxonomy reveals neighboring leaves examining related but distinct bias phenomena: 'Idiosyncratic and Superficial Feature Biases' focuses on length and style preferences, while 'Training-Induced and Distribution Biases' addresses biases from preference annotation and fine-tuning procedures. The paper's emphasis on upstream pretrained representations distinguishes it from these adjacent directions. Nearby branches include 'Reward Model Interpretability and Analysis' and 'Implicit Reward Models and Alternative Formulations,' suggesting potential connections between understanding inherited biases and developing alternative reward formulations. The taxonomy's scope notes clarify that biases arising from preference data belong elsewhere, positioning this work specifically at the pretraining-to-reward-model inheritance boundary.

Among twenty-one candidates examined through limited semantic search, the contribution on implicit reward model formulation encountered three potentially refutable papers, while the psycholinguistic interpretability method examined only one candidate with no clear refutation, and the controlled experiments on bias replicability examined ten candidates with none providing overlapping prior work. The implicit reward formulation appears to have more substantial related literature within this limited search scope, though the analysis does not claim exhaustive coverage. The experimental demonstration of bias persistence across identical preference data and fine-tuning processes appears less directly addressed in the examined candidates, suggesting potential novelty in the controlled ablation methodology.

Based on the limited search scope of twenty-one semantically similar papers, the work appears to occupy a relatively under-explored intersection between pretrained model analysis and reward model behavior. The taxonomy structure indicates this is a sparse research direction with only two sibling papers, though the implicit reward formulation connects to existing work on alternative reward model architectures. The analysis covers top-K semantic matches and does not represent comprehensive field coverage, leaving open questions about related work in adjacent communities or recent preprints.

Taxonomy

Core-task Taxonomy Papers
30
3
Claimed Contributions
21
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: value biases in reward models from pretrained language models. The field has organized itself around several complementary perspectives. One major branch focuses on characterizing and measuring biases—examining how reward models inherit or amplify unwanted preferences from their pretrained foundations. Another branch develops benchmarking and evaluation frameworks (e.g., RewardBench[2], RM-Bench[4]) to systematically assess reward model quality and fairness. Mitigation strategies form a third pillar, exploring debiasing techniques and fairness-aware reward composition (Group Fairness Rewards[6], Fair Reward Composition[10]). Additional branches address interpretability and analysis, alignment optimization methods that leverage these models, alternative formulations such as implicit rewards or Q-function approaches, safety-oriented work targeting toxicity (Mitigating Toxicity Transformers[14]), domain-specific extensions, and theoretical surveys (Foundation Models Survey[18], Reward Modeling Landscape[21]) that synthesize emerging insights. Within the bias characterization branch, a particularly active line of inquiry examines how pretraining corpora embed value judgments that persist through reward modeling. Value Biases Pretraining[0] sits squarely in this cluster, investigating inherited biases alongside neighbors like Relative Value Encoding[7] and Biased Reinforcement Learners[11]. While Relative Value Encoding[7] explores how models encode comparative preferences, Value Biases Pretraining[0] emphasizes the upstream origins of these preferences in pretrained representations. Biased Reinforcement Learners[11] complements this by studying how such biases propagate during reinforcement learning. Across the broader landscape, open questions persist about whether mitigation should occur at pretraining, reward modeling, or policy optimization stages, and how to balance debiasing with maintaining task performance—a tension visible in works like Causal Rewards[5] and Calibrated Self-Rewarding[1].

Claimed Contributions

RM interpretability method using psycholinguistics

The authors introduce a method that combines exhaustive token search with psycholinguistic corpora (Big Two and Moral Foundations Dictionary) to quantify value biases in reward models. This approach maps token-level rewards to psychological constructs representing dimensions of human value.

1 retrieved paper
Implicit reward model formulation from log-probability differences

The authors formalize the difference between two language models' log probabilities as an implicit reward model and introduce a mixture-weighted log-ratio (MWLR) score to make these implicit rewards empirically usable. They demonstrate that these implicit reward scores reveal the same agency/communion biases observed in explicit reward models.

10 retrieved papers
Can Refute
Controlled experiments demonstrating replicability and durability of inherited value biases

The authors conduct systematic experiments training reward models from different base models (Llama and Gemma) with identical hyperparameters and controlled variations in preference data source and quantity. These experiments demonstrate that value biases inherited from pretraining persist through reward modeling and require substantial preference data to mitigate.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

RM interpretability method using psycholinguistics

The authors introduce a method that combines exhaustive token search with psycholinguistic corpora (Big Two and Moral Foundations Dictionary) to quantify value biases in reward models. This approach maps token-level rewards to psychological constructs representing dimensions of human value.

Contribution

Implicit reward model formulation from log-probability differences

The authors formalize the difference between two language models' log probabilities as an implicit reward model and introduce a mixture-weighted log-ratio (MWLR) score to make these implicit rewards empirically usable. They demonstrate that these implicit reward scores reveal the same agency/communion biases observed in explicit reward models.

Contribution

Controlled experiments demonstrating replicability and durability of inherited value biases

The authors conduct systematic experiments training reward models from different base models (Llama and Gemma) with identical hyperparameters and controlled variations in preference data source and quantity. These experiments demonstrate that value biases inherited from pretraining persist through reward modeling and require substantial preference data to mitigate.

Reward Models Inherit Value Biases from Pretraining | Novelty Validation