Doubly-Robust LLM-as-a-Judge: Externally Valid Estimation with Imperfect Personas

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.7 Download Report PDF

external validityLLM-as-a-Judgelarge language modelsevaluationpersonascausal inferencedoubly-robust estimation

As Generative AI (GenAI) systems see growing adoption, a key concern involves the external validity of evaluations, or the extent to which they generalize from lab-based to real-world deployment conditions. Threats to the external validity of GenAI evaluations arise when the source sample of human raters and system outputs used to obtain a system quality estimate differs from the target distribution at deployment time. In this work, we propose a doubly-robust estimation framework designed to address this evaluation sampling bias. Key to our approach is the use of synthetic "persona" ratings -- produced by prompting an LLM evaluator (i.e., an LLM-as-a-judge) to behave as a human rater with specific sociodemographic characteristics. Our doubly-robust framework combines these informative yet imperfect persona ratings with human ratings obtained under evaluation sampling bias to produce statistically valid system quality estimates. In particular, we show that our approach yields valid system quality estimates when either: (i) a model trained to predict human ratings using persona ratings and source data observed under sampling bias, or (ii) a reweighting model that corrects for sampling bias is of sufficient quality. We validate our framework theoretically and via a novel Persona Simulation Framework (PSF) designed to systematically manipulate persona quality and the degree of evaluation sampling bias present in source data. Our work provides a principled foundation for combining imperfect persona ratings with human ratings observed under sampling bias to obtain valid system quality estimates.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a doubly-robust estimation framework to correct evaluation sampling bias in generative AI systems, combining synthetic persona ratings from LLM-as-a-judge with human ratings obtained under biased conditions. It resides in the 'Statistical Bias Correction in Evaluation' leaf, which contains only two papers total. This represents a sparse research direction within the broader taxonomy of 17 papers across 11 leaf nodes, suggesting the specific intersection of doubly-robust causal inference methods and GenAI evaluation remains relatively unexplored compared to adjacent areas like synthetic data generation or comprehensive quality assessment frameworks.

The taxonomy reveals neighboring work in generalizability assessment across clinical settings and active cost-aware evaluation, but these address different aspects of the external validity problem. Clinical generalizability studies focus on empirical performance measurement across deployment contexts without statistical bias correction machinery, while cost-aware evaluation optimizes resource allocation rather than correcting for sampling discrepancies. The paper's use of persona simulation connects conceptually to synthetic data generation branches, yet differs fundamentally: those methods augment training data or create evaluation samples, whereas this work uses synthetic ratings as auxiliary information within a causal inference framework to debias estimates from real human ratings.

Among 23 candidates examined across three contributions, none were identified as clearly refuting the proposed methods. The core doubly-robust framework examined 10 candidates with zero refutable matches, the M-estimation generalization examined 3 candidates with zero refutations, and the persona simulation framework examined 10 candidates with zero refutations. This suggests that within the limited search scope, the specific combination of doubly-robust estimation theory, LLM-generated persona ratings, and GenAI system quality assessment appears relatively novel. The statistical machinery itself draws from established causal inference literature, but its application to this evaluation context with synthetic persona ratings as surrogates represents a methodological contribution not directly anticipated by the examined prior work.

Based on the top-23 semantic matches and citation expansion, the work appears to occupy a methodologically distinct position at the intersection of causal inference and GenAI evaluation. The limited search scope means potentially relevant work in broader causal inference or survey methodology literatures may not be fully represented. The taxonomy structure indicates this statistical correction approach is less crowded than synthetic data generation or comprehensive quality assessment directions, though the sibling paper on doubly-robust LLM judges suggests emerging interest in applying rigorous statistical frameworks to address evaluation validity challenges in generative AI systems.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: externally valid generative AI system quality estimation under evaluation sampling bias. The field addresses a fundamental challenge in assessing generative AI systems when evaluation data may not represent real-world deployment conditions. The taxonomy organizes research into several complementary directions. Bias Correction and External Validity Methods focus on statistical techniques to adjust for sampling discrepancies and ensure estimates generalize beyond test sets. Synthetic Data Generation and Augmentation explores how to create representative evaluation data when natural samples are limited or skewed. Model Architecture and Quality Enhancement examines design choices that improve robustness and output quality. Evaluation Frameworks and Quality Assessment develops metrics and protocols for measuring system performance, while Bias Analysis in Generative Models investigates how biases emerge and propagate. User Perception and Identification studies how humans interpret and detect AI-generated content, which influences practical deployment considerations. Several active lines of work reveal key tensions in the field. One cluster emphasizes rigorous statistical correction: Doubly Robust LLM Judge[0] develops causal inference methods to handle evaluation bias, while Cost Optimal AI Evaluation[1] balances estimation accuracy against resource constraints. Another thread focuses on domain-specific challenges where distribution shift is acute, such as AI Radiology Generalizability[3] and Generative Diffusion MRI[4], which must ensure medical imaging models perform reliably across hospitals and patient populations. A third direction examines fairness and bias propagation, exemplified by Generative AI Fairness[5] and Bias Unconditional Generative[9], highlighting how sampling bias can amplify demographic disparities. The original paper sits squarely within the statistical correction branch, sharing methodological kinship with Robust Generative Classifiers[11] but applying modern causal inference tools to large language model evaluation. Its emphasis on doubly robust estimation distinguishes it from simpler reweighting approaches, addressing scenarios where both outcome models and propensity scores may be misspecified.

Claimed Contributions

Doubly-robust estimation framework for GenAI system quality under evaluation sampling bias

10 retrieved papers

The authors introduce a statistical framework that combines persona ratings from LLM-as-a-judge with human ratings observed under sampling bias to produce valid system quality estimates. The framework handles both covariate shift and selection bias, providing valid estimates when either the rating prediction model or the reweighting model is sufficiently accurate.

10 retrieved papers

Generalization of doubly-robust estimation theory to M-estimation with surrogate ratings

3 retrieved papers

The authors extend existing doubly-robust estimation theory to handle M-estimation problems that incorporate surrogate persona ratings. This theoretical advancement enables estimating a richer set of system quality parameters beyond means, such as rating variance and quantiles, while maintaining valid coverage under evaluation sampling bias.

3 retrieved papers

Persona Simulation Framework for systematic evaluation under sampling bias

10 retrieved papers

The authors develop a framework that enables controlled manipulation of covariate shift, selection bias, and persona quality across synthetic, semi-synthetic, and real-world datasets. This framework provides a reusable resource for validating estimation methods designed to address sampling bias in GenAI evaluation.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[11] Making generative classifiers robust to selection bias PDF

Andrew T. Smith, Charles Elkan (2007)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Doubly-robust estimation framework for GenAI system quality under evaluation sampling bias

[31] Doubly robust inference with nonprobability survey samples PDF

Cannot Refute

[32] Inference for big data assisted by small area methods: an application on sustainable development goals sensitivity of enterprises in Italy PDF

Cannot Refute

[33] CDR: Conservative doubly robust learning for debiased recommendation PDF

Cannot Refute

[34] Doubly robust joint learning for recommendation on data missing not at random PDF

Cannot Refute

[35] Doubly robust causal inference through penalized bias-reduced estimation: combining non-probability samples with designed surveys PDF

Cannot Refute

[36] Bias-reduced doubly robust estimation PDF

Cannot Refute

[37] Introduction to double robust methods for incomplete data PDF

Cannot Refute

[38] Doubly robust estimation for correcting position bias in click feedback for unbiased learning to rank PDF

Cannot Refute

[39] Doubly Robust Off-Policy Evaluation for Ranking Policies under the Cascade Behavior Model PDF

Cannot Refute

[40] Doubly robust capture-recapture methods for estimating population size PDF

Cannot Refute

Contribution

Generalization of doubly-robust estimation theory to M-estimation with surrogate ratings

[18] Efficient and Robust Semi-supervised Estimation of Average Treatment Effect with Partially Annotated Treatment and PDF

Cannot Refute

[19] Many proxy controls PDF

Cannot Refute

[20] Separating Algorithms from Questions and Causal Inference with Unmeasured Exposures: An Application to Birth Cohort Studies of Early BMI Rebound. PDF

Cannot Refute

Contribution

Persona Simulation Framework for systematic evaluation under sampling bias

[21] Wilds: A benchmark of in-the-wild distribution shifts PDF

Cannot Refute

[22] A fine-grained analysis on distribution shift PDF

Cannot Refute

[23] Multi-accurate CATE is robust to unknown covariate shifts PDF

Cannot Refute

[24] Adjustment of selection bias for clinical trials: a simulation study PDF

Cannot Refute

[25] Robust Estimation and Inference in Hybrid Controlled Trials for Binary Outcomes: A Case Study on Non-Small Cell Lung Cancer PDF

Cannot Refute

[26] Bridging Distributionally Robust Learning and Offline RL: An Approach to Mitigate Distribution Shift and Partial Data Coverage PDF

Cannot Refute

[27] Double Machine Learning Evaluation Under Distribution Shift and Selection Bias PDF

Cannot Refute

[28] Estimation of prediction error with known covariate shift PDF

Cannot Refute

[29] Rlsbench: Domain adaptation under relaxed label shift PDF

Cannot Refute

[30] Multi-cate: Multi-accurate conditional average treatment effect estimation robust to unknown covariate shifts PDF

Cannot Refute

Doubly-Robust LLM-as-a-Judge: Externally Valid Estimation with Imperfect Personas

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[11] Making generative classifiers robust to selection bias PDF

Contribution Analysis

Doubly-robust estimation framework for GenAI system quality under evaluation sampling bias

[31] Doubly robust inference with nonprobability survey samples PDF

[32] Inference for big data assisted by small area methods: an application on sustainable development goals sensitivity of enterprises in Italy PDF

[33] CDR: Conservative doubly robust learning for debiased recommendation PDF

[34] Doubly robust joint learning for recommendation on data missing not at random PDF

[35] Doubly robust causal inference through penalized bias-reduced estimation: combining non-probability samples with designed surveys PDF

[36] Bias-reduced doubly robust estimation PDF

[37] Introduction to double robust methods for incomplete data PDF

[38] Doubly robust estimation for correcting position bias in click feedback for unbiased learning to rank PDF

[39] Doubly Robust Off-Policy Evaluation for Ranking Policies under the Cascade Behavior Model PDF

[40] Doubly robust capture-recapture methods for estimating population size PDF

Generalization of doubly-robust estimation theory to M-estimation with surrogate ratings

[18] Efficient and Robust Semi-supervised Estimation of Average Treatment Effect with Partially Annotated Treatment and PDF

[19] Many proxy controls PDF

[20] Separating Algorithms from Questions and Causal Inference with Unmeasured Exposures: An Application to Birth Cohort Studies of Early BMI Rebound. PDF

Persona Simulation Framework for systematic evaluation under sampling bias

[21] Wilds: A benchmark of in-the-wild distribution shifts PDF

[22] A fine-grained analysis on distribution shift PDF

[23] Multi-accurate CATE is robust to unknown covariate shifts PDF

[24] Adjustment of selection bias for clinical trials: a simulation study PDF

[25] Robust Estimation and Inference in Hybrid Controlled Trials for Binary Outcomes: A Case Study on Non-Small Cell Lung Cancer PDF

[26] Bridging Distributionally Robust Learning and Offline RL: An Approach to Mitigate Distribution Shift and Partial Data Coverage PDF

[27] Double Machine Learning Evaluation Under Distribution Shift and Selection Bias PDF

[28] Estimation of prediction error with known covariate shift PDF

[29] Rlsbench: Domain adaptation under relaxed label shift PDF

[30] Multi-cate: Multi-accurate conditional average treatment effect estimation robust to unknown covariate shifts PDF

Table of Contents