Preference Leakage: A Contamination Problem in LLM-as-a-judge

ICLR 2026 Conference SubmissionAnonymous Authors
LLM-as-a-judgePreference LeakageData Contamination
Abstract:

Large Language Models (LLMs) as judges and LLM-based data synthesis have emerged as two fundamental LLM-driven data annotation methods in model development. While their combination significantly enhances the efficiency of model training and evaluation, little attention has been given to the potential contamination brought by this new model development paradigm. In this work, we expose preference leakage, a contamination problem in LLM-as-a-judge caused by the relatedness between the synthetic data generators and LLM-based evaluators. To study this issue, we first define three common relatednesses between the data generator LLM and the judge LLM: being the same model, having an inheritance relationship, and belonging to the same model family. Through extensive experiments, we empirically confirm the bias of judges towards their related student models caused by preference leakage across multiple LLM baselines and benchmarks. Further analysis suggests that preference leakage is a pervasive and real-world problem that is harder to detect compared to previously identified biases in LLM-as-a-judge scenarios. All of these findings imply that preference leakage is a widespread and challenging problem in the area of LLM-as-a-judge.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
0
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Bias detection in LLM-based evaluation systems. The field has organized itself around several major branches that reflect different facets of bias. One branch examines bias in LLM content generation and outputs, focusing on how models produce biased text across domains such as code generation, recommendations, and general language tasks. A second branch targets bias in LLM-as-evaluator systems, where models serve as judges or scorers of other outputs, introducing concerns about self-preference, positional effects, and inconsistent scoring. A third branch addresses evaluation methodology and benchmark quality, questioning whether existing test suites adequately capture bias phenomena or inadvertently perpetuate flawed assumptions. Comprehensive LLM evaluation studies provide broad empirical assessments that cut across multiple bias types, while tangentially related topics touch on fairness in adjacent areas like information retrieval or mental health applications. Representative works such as Not Fair Evaluators[3] and Self-Preference Bias[37] illustrate how evaluator-specific biases can distort automated assessment, whereas surveys like Bias Fairness Survey[1] and Fairness Survey[2] map the broader landscape of bias concerns in LLM deployment. Particularly active lines of work explore the tension between using LLMs as convenient evaluators and the risk that these models favor their own outputs or exhibit systematic preferences tied to model architecture or training lineage. Studies on self-preference and model relatedness biases reveal that evaluators may assign higher scores to responses generated by themselves or by closely related models, undermining the objectivity of automated judgments. Preference Leakage[0] sits squarely within this cluster, investigating how subtle cues in generated text can inadvertently signal a model's identity to an evaluator, thereby triggering preferential scoring. This work complements Self-Preference Bias[37], which documents the phenomenon more broadly, and contrasts with efforts like FairEval[20] that propose mitigation strategies to reduce evaluator partiality. Together, these studies highlight an open question: whether LLM-based evaluation can be made sufficiently robust to replace human judgment, or whether inherent biases will always require careful calibration and transparency measures.

Claimed Contributions

Preference leakage problem definition

The authors formally define preference leakage as a new contamination issue that arises when the LLM used for synthetic data generation and the LLM used as an evaluator are related, causing systematic bias in evaluation scores. They identify three types of relatedness: being the same model, having an inheritance relationship, and belonging to the same model family.

0 retrieved papers
Empirical validation of preference leakage bias

The authors perform comprehensive experiments using multiple LLM baselines and benchmarks (Arena-Hard and AlpacaEval 2.0) to empirically confirm that judge LLMs exhibit systematic bias toward their related student models. They introduce the preference leakage score metric to quantify this bias across different scenarios.

0 retrieved papers
Analysis of preference leakage mechanisms and characteristics

The authors investigate the underlying mechanisms of preference leakage through recognition experiments and category analyses. They demonstrate that preference leakage is particularly hard to detect, especially affecting subjective questions and judgment dimensions, and that judge LLMs cannot reliably recognize their related student models' generations.

0 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Preference leakage problem definition

The authors formally define preference leakage as a new contamination issue that arises when the LLM used for synthetic data generation and the LLM used as an evaluator are related, causing systematic bias in evaluation scores. They identify three types of relatedness: being the same model, having an inheritance relationship, and belonging to the same model family.

Contribution

Empirical validation of preference leakage bias

The authors perform comprehensive experiments using multiple LLM baselines and benchmarks (Arena-Hard and AlpacaEval 2.0) to empirically confirm that judge LLMs exhibit systematic bias toward their related student models. They introduce the preference leakage score metric to quantify this bias across different scenarios.

Contribution

Analysis of preference leakage mechanisms and characteristics

The authors investigate the underlying mechanisms of preference leakage through recognition experiments and category analyses. They demonstrate that preference leakage is particularly hard to detect, especially affecting subjective questions and judgment dimensions, and that judge LLMs cannot reliably recognize their related student models' generations.