Eliminating Inductive Bias in Reward Models with Information-Theoretic Guidance

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.5 Download Report PDF

LLMRLHFReward HackingDebias

Reward models (RMs) are crucial in reinforcement learning from human feedback (RLHF) to align large language models (LLMs) with human values. However, RM training data is commonly recognized as low-quality, always containing preference conflicts and inductive biases, such as response length or speaking style, which can easily lead to reward overfitting and hacking. A few recent RM debiasing methods either target merely a single specific type of preference bias or only address simple linear bias relations such as Pearson coefficients. To mitigate more complicated inductive bias of reward modeling, inspired by the information bottleneck, we introduce a novel information-theoretic debiasing method called Debiasing via Information optimization for RM (DIR). More specifically, our method trains RMs by maximizing the mutual information (MI) between preference prediction and input response pairs, while minimizing the MI between RM outputs and biased attributes of preference inputs. With the theoretical justification of information theory, DIR can handle different types of bias with more comprehensive non-linear correlations, enlarging its real-world application scenarios. In experiments, we verify the effectiveness of DIR with three types of inductive biases: response length, sycophancy, and format. Based on the numerical results, we discover that DIR can not only effectively diminish target inductive biases but also improve RLHF performances on various benchmarks with better generalization abilities.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces DIR, an information-theoretic framework for mitigating multiple types of inductive bias in reward models through mutual information optimization. It resides in the Information-Theoretic Debiasing leaf, which contains only two papers including this one. This represents a relatively sparse research direction within the broader taxonomy of 20 papers across multiple bias mitigation strategies. The sibling paper in this leaf shares the information-theoretic foundation, suggesting this is an emerging but not yet crowded approach to the general problem of reward model debiasing.

The taxonomy reveals that DIR sits within Multi-Attribute and General Inductive Bias Mitigation, distinguishing it from the more populated Length Bias Mitigation branch which addresses single-attribute confounders. Neighboring leaves include Reward Shaping Techniques and LLM-Based Reward Shaping, which tackle bias through different mechanisms rather than information-theoretic optimization. The scope note explicitly positions information-theoretic methods as handling nonlinear correlations across multiple bias types, while excluding approaches not grounded in information theory. This boundary suggests DIR occupies a methodologically distinct niche compared to adjacent reward shaping or disentanglement strategies found in other branches.

Among 24 candidates examined across three contributions, the analysis found limited prior work overlap. The core information-theoretic framework examined 10 candidates with 1 appearing refutable, while the dual-bound optimization strategy examined 4 candidates with none refutable. The explicit information-theoretic debiasing contribution examined 10 candidates with 2 potentially overlapping works. These statistics indicate that within the top-24 semantic matches, most contributions face minimal direct precedent, though the framework-level contribution shows some prior exploration. The dual-bound optimization appears most distinctive among the examined candidates, while the broader information-theoretic framing has more established antecedents.

Based on this limited search scope of 24 candidates, the work appears to occupy a methodologically distinct position emphasizing information theory for multi-attribute bias mitigation. The sparse population of its taxonomy leaf and low refutation rates suggest relative novelty, though the analysis cannot claim exhaustiveness beyond top-ranked semantic matches. The framework's generality to nonlinear correlations differentiates it from single-bias methods, but the extent of this advantage over the examined prior work remains an open empirical question.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Mitigating inductive bias in reward models for reinforcement learning from human feedback. The field addresses systematic distortions that arise when reward models learn spurious correlations from human preferences rather than true alignment objectives. The taxonomy reveals a diverse landscape organized around several major themes. Length Bias Mitigation tackles the well-known tendency of models to favor longer responses regardless of quality, while Multi-Attribute and General Inductive Bias Mitigation encompasses broader debiasing strategies that address multiple confounding factors simultaneously. Human Feedback Quality and Heterogeneity examines challenges stemming from noisy or inconsistent annotator signals, and Algorithmic and Optimization Biases focuses on distortions introduced by training procedures themselves. Self-Alignment and Minimal Supervision explores methods that reduce reliance on extensive human labeling, Domain-Specific and Application-Driven Methods adapt techniques to particular use cases, Human Cognitive Inductive Biases in RL investigates how human decision-making patterns influence feedback, and Prior-Based Guidance for Feedback Reduction leverages existing knowledge to improve sample efficiency. Representative works such as Adaptive Length Bias[4] and Reward Shaping Mitigation[5] illustrate targeted interventions, while approaches like SALMON[10] and Heterogeneous Feedback[11] address broader structural issues. A particularly active line of inquiry centers on information-theoretic and model-centric debiasing strategies that aim to disentangle true preferences from spurious attributes. Eliminating Inductive Bias[0] sits within this cluster, employing information-theoretic principles to isolate and remove confounding signals in reward learning. This approach contrasts with neighboring work such as Reward Overoptimization Diffusion[8], which addresses bias through diffusion-based regularization, and Model Inductive Bias[3], which examines how architectural choices themselves introduce systematic distortions. While many methods in Length Bias Mitigation and Algorithmic Bias RLHF[19] target specific known confounders, the information-theoretic branch seeks more general frameworks that can handle multiple or unknown bias sources simultaneously. Open questions remain about the trade-offs between targeted interventions and general debiasing, the extent to which human cognitive patterns should be modeled or corrected, and how to balance feedback efficiency with robustness across diverse deployment contexts.

Claimed Contributions

Information-theoretic debiasing framework for reward models (DIR)

Can Refute

10 retrieved papers

The authors propose DIR, a framework that trains reward models by maximizing mutual information between preference predictions and input response pairs while minimizing mutual information between RM outputs and biased attributes. This approach handles different types of bias with comprehensive non-linear correlations using information theory principles.

10 retrieved papers

Can Refute

Dual-bound optimization strategy with comparative regularizer

4 retrieved papers

The authors design a practical implementation using variational bounds (BA lower bound and CLUB upper bound) combined with a comparative regularizer that operates on relative bias attributes between response pairs rather than absolute values, enabling robust handling of diverse biases without distorting the reward landscape.

4 retrieved papers

Explicit information-theoretic framework for targeted debiasing

Can Refute

10 retrieved papers

The authors introduce a principled framework that directly optimizes mutual information objectives to explicitly disentangle bias signals from preference signals, offering theoretical guarantees and a more targeted solution compared to indirect or heuristic approaches.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[8] Confronting reward overoptimization for diffusion models: A perspective of inductive and primacy biases PDF

Zhang Ziyi, Zhang Sen, Ziyi Zhang, Zhan, Yibing, Sen Zhang, Luo Yong, Yibing Zhan, Wen, Yonggang, Yong Luo, Tao, Dacheng, Yonggang Wen, Dacheng Tao (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Information-theoretic debiasing framework for reward models (DIR)

[34] Mitigating Reward Hacking via Information-Theoretic Reward Modeling PDF

Can Refute

[26] Inform: Mitigating reward hacking in rlhf via information-theoretic reward modeling PDF

Cannot Refute

[31] Data-adaptive safety rules for training reward models PDF

Cannot Refute

[32] Intelligent robotic sonographer: Mutual information-based disentangled reward learning from few demonstrations PDF

Cannot Refute

[33] Self-supervised alignment with mutual information: Learning to follow principles without preference labels PDF

Cannot Refute

[35] Fr-train: A mutual information-based approach to fair and robust training PDF

Cannot Refute

[36] Information-theoretic bias reduction via causal view of spurious correlation PDF

Cannot Refute

[37] Learning bias-invariant representation by cross-sample mutual information minimization PDF

Cannot Refute

[38] Debiasing Multimodal Models via Causal Information Minimization PDF

Cannot Refute

[39] Feature selection integrating Shapley values and mutual information in reinforcement learning: an application in the prediction of post-operative outcomes in patients â¦ PDF

Cannot Refute

Contribution

Dual-bound optimization strategy with comparative regularizer

[40] On variational bounds of mutual information PDF

Cannot Refute

[41] Collapsed variational bounds for Bayesian neural networks PDF

Cannot Refute

[42] Auto-encoding variational bayes PDF

Cannot Refute

[43] Mutually-Regularized Dual Collaborative Variational Auto-encoder for Recommendation Systems PDF

Cannot Refute

Contribution

Explicit information-theoretic framework for targeted debiasing

[22] A Novel Information-Theoretic Objective to Disentangle Representations for Fair Classification PDF

Can Refute

[30] Mitigating confounding bias in recommendation via information bottleneck PDF

Can Refute

[21] An information theory approach to detect media bias in news websites PDF

Cannot Refute

[23] A Fisher Information Theory of Aesthetic Preference for Complexity PDF

Cannot Refute

[24] EnD: Entangling and Disentangling deep representations for bias correction PDF

Cannot Refute

[25] A (Dis-)information Theory of Revealed and Unrevealed Preferences: Emerging Deception and Skepticism via Theory of Mind PDF

Cannot Refute

[26] Inform: Mitigating reward hacking in rlhf via information-theoretic reward modeling PDF

Cannot Refute

[27] Information theory and statistical mechanics PDF

Cannot Refute

[28] Information theoretic counterfactual learning from missing-not-at-random feedback PDF

Cannot Refute

[29] Selectional constraints: An information-theoretic model and its computational realization PDF

Cannot Refute

Eliminating Inductive Bias in Reward Models with Information-Theoretic Guidance

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[8] Confronting reward overoptimization for diffusion models: A perspective of inductive and primacy biases PDF

Contribution Analysis

Information-theoretic debiasing framework for reward models (DIR)

[34] Mitigating Reward Hacking via Information-Theoretic Reward Modeling PDF

[26] Inform: Mitigating reward hacking in rlhf via information-theoretic reward modeling PDF

[31] Data-adaptive safety rules for training reward models PDF

[32] Intelligent robotic sonographer: Mutual information-based disentangled reward learning from few demonstrations PDF

[33] Self-supervised alignment with mutual information: Learning to follow principles without preference labels PDF

[35] Fr-train: A mutual information-based approach to fair and robust training PDF

[36] Information-theoretic bias reduction via causal view of spurious correlation PDF

[37] Learning bias-invariant representation by cross-sample mutual information minimization PDF

[38] Debiasing Multimodal Models via Causal Information Minimization PDF

[39] Feature selection integrating Shapley values and mutual information in reinforcement learning: an application in the prediction of post-operative outcomes in patients â¦ PDF

Dual-bound optimization strategy with comparative regularizer

[40] On variational bounds of mutual information PDF

[41] Collapsed variational bounds for Bayesian neural networks PDF

[42] Auto-encoding variational bayes PDF

[43] Mutually-Regularized Dual Collaborative Variational Auto-encoder for Recommendation Systems PDF

Explicit information-theoretic framework for targeted debiasing

[22] A Novel Information-Theoretic Objective to Disentangle Representations for Fair Classification PDF

[30] Mitigating confounding bias in recommendation via information bottleneck PDF

[21] An information theory approach to detect media bias in news websites PDF

[23] A Fisher Information Theory of Aesthetic Preference for Complexity PDF

[24] EnD: Entangling and Disentangling deep representations for bias correction PDF

[25] A (Dis-)information Theory of Revealed and Unrevealed Preferences: Emerging Deception and Skepticism via Theory of Mind PDF

[26] Inform: Mitigating reward hacking in rlhf via information-theoretic reward modeling PDF

[27] Information theory and statistical mechanics PDF

[28] Information theoretic counterfactual learning from missing-not-at-random feedback PDF

[29] Selectional constraints: An information-theoretic model and its computational realization PDF

Table of Contents

[39] Feature selection integrating Shapley values and mutual information in reinforcement learning: an application in the prediction of post-operative outcomes in patients â¦ PDF