TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them

ICLR 2026 Conference SubmissionAnonymous Authors
LLM-as-a-JudgeLLM EvaluationLarge Language Models
Abstract:

The adoption of Large Language Models (LLMs) as automated evaluators (LLM-as-a-judge) has revealed critical inconsistencies in current evaluation frameworks. We identify two fundamental types of inconsistencies: (1) \textit{Score-Comparison Inconsistency}, where lower-rated responses outperform higher-scored ones in pairwise comparisons, and (2) \textit{Pairwise Transitivity Inconsistency}, manifested through circular preference chains (A ⁣> ⁣B ⁣> ⁣C ⁣> ⁣AA\!>\!B\!>\!C\!>\!A) and equivalence contradictions (A ⁣= ⁣B ⁣= ⁣C ⁣ ⁣AA\!=\!B\!=\!C\!\neq\!A). We argue that these issues come from information loss in discrete rating systems and ambiguous tie judgments during pairwise evaluation. We propose \textbf{TrustJudge}, a probabilistic framework that addresses these limitations through two key innovations: 1) \textit{distribution-sensitive scoring} that computes continuous expectations from discrete rating probabilities, preserving information entropy for more precise scoring, and 2) \textit{likelihood-aware aggregation} that resolves transitivity violations using bidirectional preference probabilities or perplexity. We also formalize the theoretical limitations of current LLM-as-a-judge frameworks and demonstrate how TrustJudge’s components overcome them. When evaluated with Llama-3.1-70B-Instruct as judge using our dataset, TrustJudge reduces Score-Comparison inconsistency by 8.43% (from 23.32% to 14.89%) and Pairwise Transitivity inconsistency by 10.82% (from 15.22% to 4.40%), while maintaining higher evaluation accuracy. Our work provides the first systematic analysis of evaluation framework inconsistencies in LLM-as-a-judge paradigms, offering both theoretical insights and practical solutions for reliable automated assessment. The framework demonstrates consistent improvements across various model architectures and scales, enabling more trustworthy LLM evaluation without requiring additional training or human annotations.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces TrustJudge, a probabilistic framework addressing score-comparison and pairwise transitivity inconsistencies in LLM-as-a-judge systems. It resides in the 'Score Variability and Intra-Rater Reliability' leaf, which contains only three papers total, indicating a relatively sparse research direction within the broader inconsistency literature. This leaf focuses specifically on rating variance and reproducibility issues, distinguishing it from the adjacent 'Logical and Transitivity Violations' category that examines non-transitive preferences more generally. The small sibling count suggests this particular angle—combining score variance with transitivity concerns—remains underexplored.

The taxonomy reveals that inconsistency research divides into score variability studies and logical violation analyses, with TrustJudge bridging both. Neighboring leaves address positional biases, self-preference effects, and task-specific inconsistencies, but the core reliability branch remains compact compared to the heavily populated bias characterization subtopics. The 'Mitigation and Improvement Methods' branch offers calibration and structured frameworks, yet none explicitly target the dual inconsistency types formalized here. This positioning suggests the work occupies a niche intersection between reliability measurement and mitigation, rather than purely characterizing known biases.

Among thirty candidates examined, the first contribution (formalizing two inconsistency types) found zero refutable papers across ten candidates, while the second (TrustJudge framework) similarly encountered no clear prior implementations in ten candidates. The third contribution (theoretical analysis of information loss) identified one potentially overlapping work among ten examined. These statistics indicate that within the limited search scope, the formalization and framework appear relatively novel, though the theoretical grounding has at least one candidate offering comparable analysis. The modest search scale means undiscovered prior work remains possible, particularly in adjacent reliability or calibration literatures.

Given the sparse taxonomy leaf and limited refutation evidence across thirty candidates, the work appears to address a recognized but underserved problem space. The dual focus on score-comparison and transitivity inconsistencies, combined with a probabilistic resolution mechanism, differentiates it from existing variance studies. However, the search scope constraints and the single refutable candidate for theoretical contributions suggest cautious interpretation—comprehensive novelty claims would require broader literature coverage, especially in calibration and statistical correction methods where overlapping ideas may exist.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: Inconsistencies in LLM-as-a-judge evaluation frameworks. The field has organized itself around six major branches that collectively address the challenges of using large language models to evaluate other models or outputs. Bias Characterization and Measurement focuses on identifying systematic preferences such as position bias, self-preference, and various domain-specific biases that skew judgments. Inconsistency and Reliability Issues examines score variability, intra-rater reliability, and the fundamental question of whether LLM judges produce stable verdicts across repeated trials. Empirical Validation and Benchmarking develops datasets and meta-evaluation protocols to assess judge quality, often comparing LLM verdicts against human annotations or gold standards. Mitigation and Improvement Methods proposes techniques ranging from prompt engineering and ensemble approaches to fine-tuning specialized judge models. Specialized Judge Applications and Contexts explores how LLM judges perform in narrow domains like code evaluation, moral reasoning, or clinical assessment. Finally, Meta-Evaluation and Theoretical Foundations investigates the deeper principles governing judge behavior, including consistency metrics and the validity of using AI in high-stakes adjudication. A particularly active line of work centers on score variability and the troubling lack of intra-rater reliability. Studies such as Rating Roulette[9] and Unreliable Judges[14] document how the same judge can assign different scores to identical inputs under minor prompt or sampling variations, raising questions about the trustworthiness of automated evaluation. TrustJudge[0] situates itself squarely within this cluster, emphasizing mechanisms to detect and quantify inconsistency rather than merely cataloging biases. In contrast, neighboring efforts like Correctly Report[5] and Right Answer Wrong Score[18] highlight cases where judges produce correct verdicts for the wrong reasons or fail to align scores with actual quality, suggesting that reliability issues extend beyond simple variance to deeper misalignments in reasoning. These contrasting themes underscore an open question: whether inconsistency stems primarily from stochastic sampling artifacts or from fundamental limitations in how LLMs represent evaluative criteria.

Claimed Contributions

Identification and formalization of two fundamental inconsistencies in LLM-as-a-judge frameworks

The authors systematically identify and formally define two critical inconsistencies in current LLM-as-a-judge evaluation frameworks: Score-Comparison Inconsistency (conflicts between single-score and pairwise evaluations) and Pairwise Transitivity Inconsistency (circular preferences and equivalence contradictions). They provide mathematical definitions and quantitative metrics for measuring these inconsistencies.

10 retrieved papers
TrustJudge probabilistic evaluation framework

The authors introduce TrustJudge, a novel probabilistic framework with two main components: distribution-sensitive scoring that preserves judgment entropy by computing continuous expected values from probability distributions over fine-grained scales, and likelihood-aware aggregation methods (PPL-based and bidirectional probability-based) that resolve transitivity violations in pairwise comparisons.

10 retrieved papers
Theoretical analysis of information loss and uncertainty reduction

The authors provide formal theoretical analysis proving that discrete scoring systems suffer from information loss (Theorem 3.1) where distinct probability distributions can yield identical scores, and that TrustJudge's distribution-sensitive scoring preserves this information. They also prove that the PPL-based method reduces uncertainty in ambiguous cases (Proposition 3.2).

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Identification and formalization of two fundamental inconsistencies in LLM-as-a-judge frameworks

The authors systematically identify and formally define two critical inconsistencies in current LLM-as-a-judge evaluation frameworks: Score-Comparison Inconsistency (conflicts between single-score and pairwise evaluations) and Pairwise Transitivity Inconsistency (circular preferences and equivalence contradictions). They provide mathematical definitions and quantitative metrics for measuring these inconsistencies.

Contribution

TrustJudge probabilistic evaluation framework

The authors introduce TrustJudge, a novel probabilistic framework with two main components: distribution-sensitive scoring that preserves judgment entropy by computing continuous expected values from probability distributions over fine-grained scales, and likelihood-aware aggregation methods (PPL-based and bidirectional probability-based) that resolve transitivity violations in pairwise comparisons.

Contribution

Theoretical analysis of information loss and uncertainty reduction

The authors provide formal theoretical analysis proving that discrete scoring systems suffer from information loss (Theorem 3.1) where distinct probability distributions can yield identical scores, and that TrustJudge's distribution-sensitive scoring preserves this information. They also prove that the PPL-based method reduces uncertainty in ambiguous cases (Proposition 3.2).