Truthfulness Despite Weak Supervision: Evaluating and Training LLMs Using Peer Prediction

ICLR 2026 Conference SubmissionAnonymous Authors
Language Model EvaluationAI AlignmentAI Truthfulness and DeceptionLarge Language Models
Abstract:

The evaluation and post-training of large language models (LLMs) rely on supervision, but strong supervision for difficult tasks is often unavailable, especially when evaluating strong models. In such cases, models have been demonstrated to exploit evaluation schemes built on such imperfect supervision, leading to deceptive results.

However, underutilized in LLM research, a wealth of mechanism design research focuses on game-theoretic incentive compatibility - eliciting honest and informative answers with weak supervision. Drawing from this literature, we introduce the peer prediction method for model evaluation and post-training. It rewards honest and informative answers over deceptive and uninformative ones, using a metric based on mutual predictability and without requiring ground truth labels.

We demonstrate the method's effectiveness and resistance to deception, with both theoretical guarantees and empirical validation on models with up to 405B parameters. We show that training an 8B model with peer prediction-based reward recovers most of the drop in truthfulness due to prior malicious finetuning, even when the reward is produced by a 0.135B language model with no finetuning.

On the evaluation front, in contrast to LLM-as-a-Judge which requires strong and trusted judges, we discover an inverse scaling property in peer prediction, where, surprisingly, resistance to deception is strengthened as the capability gap between the experts and participants widens, enabling reliable evaluation of strong models with weak supervision. In particular, LLM-as-a-Judge become worse than random guess when facing deceptive models 5-20×\times the judge's size, while peer prediction thrives when such gaps are large, including in cases with over 100×\times size difference.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces peer prediction mechanisms for evaluating and training large language models without ground truth labels, leveraging game-theoretic incentive compatibility to elicit truthful responses. It resides in the 'Theoretical Foundations and Core Mechanisms' leaf, which contains only two papers total, indicating a relatively sparse research direction within the broader taxonomy. This leaf focuses specifically on foundational peer prediction methods with theoretical guarantees, distinguishing it from application-oriented branches like multi-source synthesis or policy learning.

The taxonomy reveals three main branches: peer prediction mechanisms, model-based evaluation paradigms, and policy learning with imperfect supervision. The paper's leaf sits within the first branch, adjacent to 'Multi-Source Information Synthesis' and separate from 'LLM-as-Judge' approaches that use models as reviewers without game-theoretic incentives. The taxonomy's scope notes clarify that this work emphasizes theoretical guarantees for truthful elicitation, whereas neighboring branches address practical peer review architectures or reinforcement learning with noisy rewards, suggesting the paper occupies a distinct methodological niche focused on mechanism design rather than empirical evaluation frameworks.

Among nineteen candidates examined, the core contribution of applying peer prediction to LLM evaluation and training appears relatively novel, with zero refutable candidates found across ten examined papers. However, the inverse scaling property for deception resistance shows clear prior work, with one refutable candidate identified from a single examined paper. Theoretical guarantees under prior disagreement also face overlap, with two refutable candidates among eight examined papers. The limited search scope—nineteen total candidates—means these findings reflect top semantic matches rather than exhaustive coverage, suggesting the core methodological contribution may be more novel than the specific theoretical claims.

Based on the limited literature search, the work appears to occupy a sparsely populated research direction, with its main novelty lying in the application of peer prediction to LLM contexts rather than the underlying theoretical mechanisms. The analysis covers top semantic matches and does not claim exhaustive field coverage, leaving open the possibility of additional relevant work in adjacent game theory or mechanism design literatures not captured by LLM-focused search strategies.

Taxonomy

Core-task Taxonomy Papers
4
3
Claimed Contributions
19
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: Evaluating and training language models with weak supervision using peer prediction. The field addresses the challenge of improving LLMs when ground-truth labels are scarce or expensive, organizing itself around three main branches. The first branch, Peer Prediction Mechanisms for LLM Evaluation and Training, develops theoretical foundations that leverage strategic reporting and information elicitation to extract truthful signals from models or human evaluators without direct access to correct answers. The second branch, Model-Based and Peer Review Evaluation Paradigms, explores practical frameworks where models evaluate each other or where peer review processes substitute for expert annotation, as seen in works like Pre Peer Review[1]. The third branch, Policy Learning with Imperfect Supervision, focuses on reinforcement learning and policy optimization when reward signals are noisy or biased, exemplified by approaches such as Policy Weak Supervision[2]. Together, these branches reflect a shift from traditional supervised learning toward mechanisms that can bootstrap quality from strategic interactions or imperfect feedback. Recent work has concentrated on reconciling theoretical guarantees with practical deployment, particularly around incentive alignment and truthfulness elicitation. Incentive Aligned Summaries[3] illustrates efforts to design reward structures that encourage honest reporting in summarization tasks, while Truthfulness Without Supervision[4] explores methods to induce truthful behavior without external labels. Within this landscape, Truthfulness Weak Supervision[0] sits squarely in the theoretical foundations cluster, emphasizing peer prediction mechanisms that provably elicit truthful responses under weak supervision. Compared to Truthfulness Without Supervision[4], which may rely on unsupervised consistency checks, Truthfulness Weak Supervision[0] appears to formalize game-theoretic incentives more explicitly. Its focus on mechanism design distinguishes it from the more empirical, application-driven approaches in the peer review branch, positioning it as a bridge between foundational theory and the practical need for scalable, label-efficient training.

Claimed Contributions

Peer prediction method for LLM evaluation and training

The authors adapt peer prediction mechanisms from mechanism design literature to evaluate and train large language models. The method measures mutual predictability of model answers without ground truth labels, rewarding honest and informative responses while resisting deception through game-theoretic incentive compatibility.

10 retrieved papers
Inverse scaling property for resistance to deception

The authors identify a counterintuitive scaling behavior where peer prediction becomes more resistant to deception as the capability gap between expert and participant models increases. This enables reliable evaluation of strong models using weak supervision, even with over 100× size differences.

1 retrieved paper
Can Refute
Theoretical guarantees under prior disagreement

The authors extend theoretical guarantees of peer prediction beyond the unrealistic shared-prior assumption. They prove that with sufficiently large and diverse pools of experts and participants, the method remains approximately incentive compatible even when agents hold different worldviews or priors.

8 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Peer prediction method for LLM evaluation and training

The authors adapt peer prediction mechanisms from mechanism design literature to evaluate and train large language models. The method measures mutual predictability of model answers without ground truth labels, rewarding honest and informative responses while resisting deception through game-theoretic incentive compatibility.

Contribution

Inverse scaling property for resistance to deception

The authors identify a counterintuitive scaling behavior where peer prediction becomes more resistant to deception as the capability gap between expert and participant models increases. This enables reliable evaluation of strong models using weak supervision, even with over 100× size differences.

Contribution

Theoretical guarantees under prior disagreement

The authors extend theoretical guarantees of peer prediction beyond the unrealistic shared-prior assumption. They prove that with sufficiently large and diverse pools of experts and participants, the method remains approximately incentive compatible even when agents hold different worldviews or priors.