Scalable Oversight via Partitioned Human Supervision

ICLR 2026 Conference SubmissionAnonymous Authors
LLMscalable oversightweak supervisionagentic systems
Abstract:

As artificial intelligence (AI) systems approach and surpass expert human performance across a broad range of tasks, obtaining high-quality human supervision for evaluation and training becomes increasingly challenging. Our focus is on tasks that require deep knowledge and skills of multiple domains. Unfortunately, even the best human experts are knowledgeable only in a single narrow area, and will not be able to evaluate the correctness of advanced AI systems on such superhuman tasks. However, based on their narrow expertise, humans may provide a weak signal, i.e., a complementary label indicating an option that is incorrect. For example, a cardiologist could state that ``this is not related to cardiology,'' even if they cannot identify the true disease. Based on this weak signal, we propose a scalable oversight framework that enables us to evaluate frontier AI systems without the need to prepare the ground truth. We derive an unbiased estimator of top-1 accuracy from complementary labels and quantify how many complementary labels are needed to match the variance of ordinary labels. We further introduce two estimators to combine scarce ordinary labels with abundant complementary labels. We provide finite-sample deviation guarantees for both complementary-only and the mixed estimators. Empirically, we show that we can evaluate the output of large language models without the ground truth, if we have complementary labels. We further show that we can train an AI system with such weak signals: we show how we can design an agentic AI system automatically that can perform better by these partitioned human supervision.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
14
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: evaluating and training AI systems using complementary labels from domain experts. The field organizes around five main branches that reflect different facets of expert-driven supervision. Complementary-Label Learning Frameworks and Theory develops formal methods for learning from partial or indirect annotations, addressing scalability and bias when experts cannot label every instance exhaustively. Medical Imaging and Diagnostic AI Evaluation focuses on radiology, pathology, and clinical imaging tasks where expert radiologists or pathologists provide ground truth or quality assessments, as seen in works like ARTHUR DIANA Ultrasound[4] and Follicular Lymphoma Grading[9]. Clinical Decision Support and Language Model Evaluation examines how large language models can be assessed or guided by clinicians, for instance through LLM Medical Graders[3] or consensus protocols like LLM Clinical Consensus[34]. General AI Evaluation and Annotation Quality tackles cross-domain challenges such as handling inconsistent annotations (Inconsistent Human Annotations[26]), defining what constitutes expertise (Defining Expert[27]), and ensuring robust evaluation pipelines (AI Ground Truth[5]). Domain-Specific Applications and Specialized Tasks spans diverse settings from wildlife monitoring (Wildlife Camera Traps[20]) to surgical skill assessment (Surgical Expertise Assessment[10]), illustrating how expert feedback adapts to varied problem contexts. Several active lines of work explore trade-offs between annotation cost, label quality, and model performance. One recurring theme is how to aggregate or reconcile disagreements among multiple experts (Multiple Expert Annotators[22], Heterogeneous Expert Consistency[7]) while preserving the nuanced information that partial or complementary labels provide. Another thread investigates hybrid supervision strategies that combine weak labels, semi-automated tools, and targeted expert input (Hybrid Supervision Radiographs[40], Semi-Automated Quality Assurance[42]). Partitioned Human Supervision[0] sits within the Scalable Oversight and Unbiased Estimation cluster, emphasizing methods that partition the supervision task to reduce expert burden while maintaining unbiased learning guarantees. This approach contrasts with works like Recycling Weak Labels[39], which reuse noisy annotations more opportunistically, and aligns closely with efforts such as Measurement Error Correction[32] that formally account for imperfect or incomplete expert signals. The central question across these directions is how to design training and evaluation protocols that respect expert time constraints without sacrificing the reliability or interpretability that domain knowledge brings.

Claimed Contributions

Scalable oversight framework via partitioned human supervision using complementary labels

The authors introduce a framework that exploits partitioned human expertise to collect complementary labels (indicating incorrect options) at scale for superhuman tasks. This enables evaluation and training of AI systems without requiring full ground truth or comprehensive expert verification.

3 retrieved papers
Unbiased estimator of top-1 accuracy from complementary labels with variance analysis and mixture estimators

The authors derive an unbiased linear correction estimator for accuracy using only complementary labels, analyze its variance properties, and propose two mixture estimators (inverse-variance weighted and maximum-likelihood) that combine ordinary and complementary labels with finite-sample deviation guarantees.

1 retrieved paper
Can Refute
Demonstration of evaluation and agentic training using complementary labels

The authors empirically validate that their estimators enable both evaluation of large language models without ground truth and training of agentic AI systems by using complementary labels as fitness signals in agent search pipelines, demonstrating improved downstream performance.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Scalable oversight framework via partitioned human supervision using complementary labels

The authors introduce a framework that exploits partitioned human expertise to collect complementary labels (indicating incorrect options) at scale for superhuman tasks. This enables evaluation and training of AI systems without requiring full ground truth or comprehensive expert verification.

Contribution

Unbiased estimator of top-1 accuracy from complementary labels with variance analysis and mixture estimators

The authors derive an unbiased linear correction estimator for accuracy using only complementary labels, analyze its variance properties, and propose two mixture estimators (inverse-variance weighted and maximum-likelihood) that combine ordinary and complementary labels with finite-sample deviation guarantees.

Contribution

Demonstration of evaluation and agentic training using complementary labels

The authors empirically validate that their estimators enable both evaluation of large language models without ground truth and training of agentic AI systems by using complementary labels as fitness signals in agent search pipelines, demonstrating improved downstream performance.