TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.5 Download Report PDF

LLM-as-a-JudgeLLM EvaluationLarge Language Models

The adoption of Large Language Models (LLMs) as automated evaluators (LLM-as-a-judge) has revealed critical inconsistencies in current evaluation frameworks. We identify two fundamental types of inconsistencies: (1) \textit{Score-Comparison Inconsistency}, where lower-rated responses outperform higher-scored ones in pairwise comparisons, and (2) \textit{Pairwise Transitivity Inconsistency}, manifested through circular preference chains ( $A\!>\!B\!>\!C\!>\!A$ ) and equivalence contradictions ( $A\!=\!B\!=\!C\!\neq\!A$ ). We argue that these issues come from information loss in discrete rating systems and ambiguous tie judgments during pairwise evaluation. We propose \textbf{TrustJudge}, a probabilistic framework that addresses these limitations through two key innovations: 1) \textit{distribution-sensitive scoring} that computes continuous expectations from discrete rating probabilities, preserving information entropy for more precise scoring, and 2) \textit{likelihood-aware aggregation} that resolves transitivity violations using bidirectional preference probabilities or perplexity. We also formalize the theoretical limitations of current LLM-as-a-judge frameworks and demonstrate how TrustJudge’s components overcome them. When evaluated with Llama-3.1-70B-Instruct as judge using our dataset, TrustJudge reduces Score-Comparison inconsistency by 8.43% (from 23.32% to 14.89%) and Pairwise Transitivity inconsistency by 10.82% (from 15.22% to 4.40%), while maintaining higher evaluation accuracy. Our work provides the first systematic analysis of evaluation framework inconsistencies in LLM-as-a-judge paradigms, offering both theoretical insights and practical solutions for reliable automated assessment. The framework demonstrates consistent improvements across various model architectures and scales, enabling more trustworthy LLM evaluation without requiring additional training or human annotations.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces TrustJudge, a probabilistic framework addressing score-comparison and pairwise transitivity inconsistencies in LLM-as-a-judge systems. It resides in the 'Score Variability and Intra-Rater Reliability' leaf, which contains only three papers total, indicating a relatively sparse research direction within the broader inconsistency literature. This leaf focuses specifically on rating variance and reproducibility issues, distinguishing it from the adjacent 'Logical and Transitivity Violations' category that examines non-transitive preferences more generally. The small sibling count suggests this particular angle—combining score variance with transitivity concerns—remains underexplored.

The taxonomy reveals that inconsistency research divides into score variability studies and logical violation analyses, with TrustJudge bridging both. Neighboring leaves address positional biases, self-preference effects, and task-specific inconsistencies, but the core reliability branch remains compact compared to the heavily populated bias characterization subtopics. The 'Mitigation and Improvement Methods' branch offers calibration and structured frameworks, yet none explicitly target the dual inconsistency types formalized here. This positioning suggests the work occupies a niche intersection between reliability measurement and mitigation, rather than purely characterizing known biases.

Among thirty candidates examined, the first contribution (formalizing two inconsistency types) found zero refutable papers across ten candidates, while the second (TrustJudge framework) similarly encountered no clear prior implementations in ten candidates. The third contribution (theoretical analysis of information loss) identified one potentially overlapping work among ten examined. These statistics indicate that within the limited search scope, the formalization and framework appear relatively novel, though the theoretical grounding has at least one candidate offering comparable analysis. The modest search scale means undiscovered prior work remains possible, particularly in adjacent reliability or calibration literatures.

Given the sparse taxonomy leaf and limited refutation evidence across thirty candidates, the work appears to address a recognized but underserved problem space. The dual focus on score-comparison and transitivity inconsistencies, combined with a probabilistic resolution mechanism, differentiates it from existing variance studies. However, the search scope constraints and the single refutable candidate for theoretical contributions suggest cautious interpretation—comprehensive novelty claims would require broader literature coverage, especially in calibration and statistical correction methods where overlapping ideas may exist.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Inconsistencies in LLM-as-a-judge evaluation frameworks. The field has organized itself around six major branches that collectively address the challenges of using large language models to evaluate other models or outputs. Bias Characterization and Measurement focuses on identifying systematic preferences such as position bias, self-preference, and various domain-specific biases that skew judgments. Inconsistency and Reliability Issues examines score variability, intra-rater reliability, and the fundamental question of whether LLM judges produce stable verdicts across repeated trials. Empirical Validation and Benchmarking develops datasets and meta-evaluation protocols to assess judge quality, often comparing LLM verdicts against human annotations or gold standards. Mitigation and Improvement Methods proposes techniques ranging from prompt engineering and ensemble approaches to fine-tuning specialized judge models. Specialized Judge Applications and Contexts explores how LLM judges perform in narrow domains like code evaluation, moral reasoning, or clinical assessment. Finally, Meta-Evaluation and Theoretical Foundations investigates the deeper principles governing judge behavior, including consistency metrics and the validity of using AI in high-stakes adjudication. A particularly active line of work centers on score variability and the troubling lack of intra-rater reliability. Studies such as Rating Roulette[9] and Unreliable Judges[14] document how the same judge can assign different scores to identical inputs under minor prompt or sampling variations, raising questions about the trustworthiness of automated evaluation. TrustJudge[0] situates itself squarely within this cluster, emphasizing mechanisms to detect and quantify inconsistency rather than merely cataloging biases. In contrast, neighboring efforts like Correctly Report[5] and Right Answer Wrong Score[18] highlight cases where judges produce correct verdicts for the wrong reasons or fail to align scores with actual quality, suggesting that reliability issues extend beyond simple variance to deeper misalignments in reasoning. These contrasting themes underscore an open question: whether inconsistency stems primarily from stochastic sampling artifacts or from fundamental limitations in how LLMs represent evaluative criteria.

Claimed Contributions

Identification and formalization of two fundamental inconsistencies in LLM-as-a-judge frameworks

10 retrieved papers

The authors systematically identify and formally define two critical inconsistencies in current LLM-as-a-judge evaluation frameworks: Score-Comparison Inconsistency (conflicts between single-score and pairwise evaluations) and Pairwise Transitivity Inconsistency (circular preferences and equivalence contradictions). They provide mathematical definitions and quantitative metrics for measuring these inconsistencies.

10 retrieved papers

TrustJudge probabilistic evaluation framework

10 retrieved papers

The authors introduce TrustJudge, a novel probabilistic framework with two main components: distribution-sensitive scoring that preserves judgment entropy by computing continuous expected values from probability distributions over fine-grained scales, and likelihood-aware aggregation methods (PPL-based and bidirectional probability-based) that resolve transitivity violations in pairwise comparisons.

10 retrieved papers

Theoretical analysis of information loss and uncertainty reduction

Can Refute

10 retrieved papers

The authors provide formal theoretical analysis proving that discrete scoring systems suffer from information loss (Theorem 3.1) where distinct probability distributions can yield identical scores, and that TrustJudge's distribution-sensitive scoring preserves this information. They also prove that the PPL-based method reduces uncertainty in ambiguous cases (Proposition 3.2).

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[9] Rating Roulette: Self-Inconsistency in LLM-As-A-Judge Frameworks PDF

Rajarshi Haldar, Julia Hockenmaier, J. Hockenmaier (2025)

[14] Large Language Models Are Unreliable Judges PDF

Jonathan H. Choi (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Identification and formalization of two fundamental inconsistencies in LLM-as-a-judge frameworks

[1] A survey on llm-as-a-judge PDF

Cannot Refute

[7] JudgeLM: Fine-tuned Large Language Models are Scalable Judges PDF

Cannot Refute

[20] Judging the judges: Evaluating alignment and vulnerabilities in llms-as-judges PDF

Cannot Refute

[27] Justice or prejudice? quantifying biases in llm-as-a-judge PDF

Cannot Refute

[34] Large language models are inconsistent and biased evaluators PDF

Cannot Refute

[51] Style over substance: Evaluation biases for large language models PDF

Cannot Refute

[52] Benchmarking cognitive biases in large language models as evaluators PDF

Cannot Refute

[53] An evaluation framework for clinical use of large language models in patient interaction tasks PDF

Cannot Refute

[54] Neither valid nor reliable? investigating the use of llms as judges PDF

Cannot Refute

[55] CourtEval: A courtroom-based multi-agent evaluation framework PDF

Cannot Refute

Contribution

TrustJudge probabilistic evaluation framework

[66] Tcp: Textual-based class-aware prompt tuning for visual-language model PDF

Cannot Refute

[67] Semantic probabilistic control of language models PDF

Cannot Refute

[68] A probabilistic perspective on unlearning and alignment for large language models PDF

Cannot Refute

[69] Distribution aware metrics for conditional natural language generation PDF

Cannot Refute

[70] Evaluating Distributional Distortion in Neural Language Modeling PDF

Cannot Refute

[71] Probabilistic inference in language models via twisted sequential monte carlo PDF

Cannot Refute

[72] What are the odds? language models are capable of probabilistic reasoning PDF

Cannot Refute

[73] Evaluating statistical language models as pragmatic reasoners PDF

Cannot Refute

[74] Short-Context Dominance: How Much Local Context Natural Language Actually Needs? PDF

Cannot Refute

[75] MASTER THESIS| MASTER 'S THESIS PDF

Cannot Refute

Contribution

Theoretical analysis of information loss and uncertainty reduction

[64] Some limitations of qualitative risk rating systems PDF

Can Refute

[56] Modelling inter-rater uncertainty in spoken language assessment PDF

Cannot Refute

[57] Continuous Visual Autoregressive Generation via Score Maximization PDF

Cannot Refute

[58] From uncertainty toward risk: The case of credit ratings PDF

Cannot Refute

[59] Interlevel information loss in hexagonal discrete global grid systems PDF

Cannot Refute

[60] Some Technical Remarks on Negations of Discrete Probability Distributions and Their Information Loss PDF

Cannot Refute

[61] Uncertainties in landscape analysis and ecosystem service assessment PDF

Cannot Refute

[62] Comparison of four scoring methods for the reading span test PDF

Cannot Refute

[63] A data driven ensemble classifier for credit scoring analysis PDF

Cannot Refute

[65] Modeling Uncertainty in Ordinal Regression: The Uncertainty Rating Scale Model PDF

Cannot Refute

TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[9] Rating Roulette: Self-Inconsistency in LLM-As-A-Judge Frameworks PDF

[14] Large Language Models Are Unreliable Judges PDF

Contribution Analysis

Identification and formalization of two fundamental inconsistencies in LLM-as-a-judge frameworks

[1] A survey on llm-as-a-judge PDF

[7] JudgeLM: Fine-tuned Large Language Models are Scalable Judges PDF

[20] Judging the judges: Evaluating alignment and vulnerabilities in llms-as-judges PDF

[27] Justice or prejudice? quantifying biases in llm-as-a-judge PDF

[34] Large language models are inconsistent and biased evaluators PDF

[51] Style over substance: Evaluation biases for large language models PDF

[52] Benchmarking cognitive biases in large language models as evaluators PDF

[53] An evaluation framework for clinical use of large language models in patient interaction tasks PDF

[54] Neither valid nor reliable? investigating the use of llms as judges PDF

[55] CourtEval: A courtroom-based multi-agent evaluation framework PDF

TrustJudge probabilistic evaluation framework

[66] Tcp: Textual-based class-aware prompt tuning for visual-language model PDF

[67] Semantic probabilistic control of language models PDF

[68] A probabilistic perspective on unlearning and alignment for large language models PDF

[69] Distribution aware metrics for conditional natural language generation PDF

[70] Evaluating Distributional Distortion in Neural Language Modeling PDF

[71] Probabilistic inference in language models via twisted sequential monte carlo PDF

[72] What are the odds? language models are capable of probabilistic reasoning PDF

[73] Evaluating statistical language models as pragmatic reasoners PDF

[74] Short-Context Dominance: How Much Local Context Natural Language Actually Needs? PDF

[75] MASTER THESIS| MASTER 'S THESIS PDF

Theoretical analysis of information loss and uncertainty reduction

[64] Some limitations of qualitative risk rating systems PDF

[56] Modelling inter-rater uncertainty in spoken language assessment PDF

[57] Continuous Visual Autoregressive Generation via Score Maximization PDF

[58] From uncertainty toward risk: The case of credit ratings PDF

[59] Interlevel information loss in hexagonal discrete global grid systems PDF

[60] Some Technical Remarks on Negations of Discrete Probability Distributions and Their Information Loss PDF

[61] Uncertainties in landscape analysis and ecosystem service assessment PDF

[62] Comparison of four scoring methods for the reading span test PDF

[63] A data driven ensemble classifier for credit scoring analysis PDF

[65] Modeling Uncertainty in Ordinal Regression: The Uncertainty Rating Scale Model PDF

Table of Contents