Doubly-Robust LLM-as-a-Judge: Externally Valid Estimation with Imperfect Personas

ICLR 2026 Conference SubmissionAnonymous Authors
external validityLLM-as-a-Judgelarge language modelsevaluationpersonascausal inferencedoubly-robust estimation
Abstract:

As Generative AI (GenAI) systems see growing adoption, a key concern involves the external validity of evaluations, or the extent to which they generalize from lab-based to real-world deployment conditions. Threats to the external validity of GenAI evaluations arise when the source sample of human raters and system outputs used to obtain a system quality estimate differs from the target distribution at deployment time. In this work, we propose a doubly-robust estimation framework designed to address this evaluation sampling bias. Key to our approach is the use of synthetic "persona" ratings -- produced by prompting an LLM evaluator (i.e., an LLM-as-a-judge) to behave as a human rater with specific sociodemographic characteristics. Our doubly-robust framework combines these informative yet imperfect persona ratings with human ratings obtained under evaluation sampling bias to produce statistically valid system quality estimates. In particular, we show that our approach yields valid system quality estimates when either: (i) a model trained to predict human ratings using persona ratings and source data observed under sampling bias, or (ii) a reweighting model that corrects for sampling bias is of sufficient quality. We validate our framework theoretically and via a novel Persona Simulation Framework (PSF) designed to systematically manipulate persona quality and the degree of evaluation sampling bias present in source data. Our work provides a principled foundation for combining imperfect persona ratings with human ratings observed under sampling bias to obtain valid system quality estimates.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a doubly-robust estimation framework to correct evaluation sampling bias in generative AI systems, combining synthetic persona ratings from LLM-as-a-judge with human ratings obtained under biased conditions. It resides in the 'Statistical Bias Correction in Evaluation' leaf, which contains only two papers total. This represents a sparse research direction within the broader taxonomy of 17 papers across 11 leaf nodes, suggesting the specific intersection of doubly-robust causal inference methods and GenAI evaluation remains relatively unexplored compared to adjacent areas like synthetic data generation or comprehensive quality assessment frameworks.

The taxonomy reveals neighboring work in generalizability assessment across clinical settings and active cost-aware evaluation, but these address different aspects of the external validity problem. Clinical generalizability studies focus on empirical performance measurement across deployment contexts without statistical bias correction machinery, while cost-aware evaluation optimizes resource allocation rather than correcting for sampling discrepancies. The paper's use of persona simulation connects conceptually to synthetic data generation branches, yet differs fundamentally: those methods augment training data or create evaluation samples, whereas this work uses synthetic ratings as auxiliary information within a causal inference framework to debias estimates from real human ratings.

Among 23 candidates examined across three contributions, none were identified as clearly refuting the proposed methods. The core doubly-robust framework examined 10 candidates with zero refutable matches, the M-estimation generalization examined 3 candidates with zero refutations, and the persona simulation framework examined 10 candidates with zero refutations. This suggests that within the limited search scope, the specific combination of doubly-robust estimation theory, LLM-generated persona ratings, and GenAI system quality assessment appears relatively novel. The statistical machinery itself draws from established causal inference literature, but its application to this evaluation context with synthetic persona ratings as surrogates represents a methodological contribution not directly anticipated by the examined prior work.

Based on the top-23 semantic matches and citation expansion, the work appears to occupy a methodologically distinct position at the intersection of causal inference and GenAI evaluation. The limited search scope means potentially relevant work in broader causal inference or survey methodology literatures may not be fully represented. The taxonomy structure indicates this statistical correction approach is less crowded than synthetic data generation or comprehensive quality assessment directions, though the sibling paper on doubly-robust LLM judges suggests emerging interest in applying rigorous statistical frameworks to address evaluation validity challenges in generative AI systems.

Taxonomy

Core-task Taxonomy Papers
17
3
Claimed Contributions
23
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: externally valid generative AI system quality estimation under evaluation sampling bias. The field addresses a fundamental challenge in assessing generative AI systems when evaluation data may not represent real-world deployment conditions. The taxonomy organizes research into several complementary directions. Bias Correction and External Validity Methods focus on statistical techniques to adjust for sampling discrepancies and ensure estimates generalize beyond test sets. Synthetic Data Generation and Augmentation explores how to create representative evaluation data when natural samples are limited or skewed. Model Architecture and Quality Enhancement examines design choices that improve robustness and output quality. Evaluation Frameworks and Quality Assessment develops metrics and protocols for measuring system performance, while Bias Analysis in Generative Models investigates how biases emerge and propagate. User Perception and Identification studies how humans interpret and detect AI-generated content, which influences practical deployment considerations. Several active lines of work reveal key tensions in the field. One cluster emphasizes rigorous statistical correction: Doubly Robust LLM Judge[0] develops causal inference methods to handle evaluation bias, while Cost Optimal AI Evaluation[1] balances estimation accuracy against resource constraints. Another thread focuses on domain-specific challenges where distribution shift is acute, such as AI Radiology Generalizability[3] and Generative Diffusion MRI[4], which must ensure medical imaging models perform reliably across hospitals and patient populations. A third direction examines fairness and bias propagation, exemplified by Generative AI Fairness[5] and Bias Unconditional Generative[9], highlighting how sampling bias can amplify demographic disparities. The original paper sits squarely within the statistical correction branch, sharing methodological kinship with Robust Generative Classifiers[11] but applying modern causal inference tools to large language model evaluation. Its emphasis on doubly robust estimation distinguishes it from simpler reweighting approaches, addressing scenarios where both outcome models and propensity scores may be misspecified.

Claimed Contributions

Doubly-robust estimation framework for GenAI system quality under evaluation sampling bias

The authors introduce a statistical framework that combines persona ratings from LLM-as-a-judge with human ratings observed under sampling bias to produce valid system quality estimates. The framework handles both covariate shift and selection bias, providing valid estimates when either the rating prediction model or the reweighting model is sufficiently accurate.

10 retrieved papers
Generalization of doubly-robust estimation theory to M-estimation with surrogate ratings

The authors extend existing doubly-robust estimation theory to handle M-estimation problems that incorporate surrogate persona ratings. This theoretical advancement enables estimating a richer set of system quality parameters beyond means, such as rating variance and quantiles, while maintaining valid coverage under evaluation sampling bias.

3 retrieved papers
Persona Simulation Framework for systematic evaluation under sampling bias

The authors develop a framework that enables controlled manipulation of covariate shift, selection bias, and persona quality across synthetic, semi-synthetic, and real-world datasets. This framework provides a reusable resource for validating estimation methods designed to address sampling bias in GenAI evaluation.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Doubly-robust estimation framework for GenAI system quality under evaluation sampling bias

The authors introduce a statistical framework that combines persona ratings from LLM-as-a-judge with human ratings observed under sampling bias to produce valid system quality estimates. The framework handles both covariate shift and selection bias, providing valid estimates when either the rating prediction model or the reweighting model is sufficiently accurate.

Contribution

Generalization of doubly-robust estimation theory to M-estimation with surrogate ratings

The authors extend existing doubly-robust estimation theory to handle M-estimation problems that incorporate surrogate persona ratings. This theoretical advancement enables estimating a richer set of system quality parameters beyond means, such as rating variance and quantiles, while maintaining valid coverage under evaluation sampling bias.

Contribution

Persona Simulation Framework for systematic evaluation under sampling bias

The authors develop a framework that enables controlled manipulation of covariate shift, selection bias, and persona quality across synthetic, semi-synthetic, and real-world datasets. This framework provides a reusable resource for validating estimation methods designed to address sampling bias in GenAI evaluation.