Scalable Oversight via Partitioned Human Supervision

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

LLMscalable oversightweak supervisionagentic systems

As artificial intelligence (AI) systems approach and surpass expert human performance across a broad range of tasks, obtaining high-quality human supervision for evaluation and training becomes increasingly challenging. Our focus is on tasks that require deep knowledge and skills of multiple domains. Unfortunately, even the best human experts are knowledgeable only in a single narrow area, and will not be able to evaluate the correctness of advanced AI systems on such superhuman tasks. However, based on their narrow expertise, humans may provide a weak signal, i.e., a complementary label indicating an option that is incorrect. For example, a cardiologist could state that ``this is not related to cardiology,'' even if they cannot identify the true disease. Based on this weak signal, we propose a scalable oversight framework that enables us to evaluate frontier AI systems without the need to prepare the ground truth. We derive an unbiased estimator of top-1 accuracy from complementary labels and quantify how many complementary labels are needed to match the variance of ordinary labels. We further introduce two estimators to combine scarce ordinary labels with abundant complementary labels. We provide finite-sample deviation guarantees for both complementary-only and the mixed estimators. Empirically, we show that we can evaluate the output of large language models without the ground truth, if we have complementary labels. We further show that we can train an AI system with such weak signals: we show how we can design an agentic AI system automatically that can perform better by these partitioned human supervision.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: evaluating and training AI systems using complementary labels from domain experts. The field organizes around five main branches that reflect different facets of expert-driven supervision. Complementary-Label Learning Frameworks and Theory develops formal methods for learning from partial or indirect annotations, addressing scalability and bias when experts cannot label every instance exhaustively. Medical Imaging and Diagnostic AI Evaluation focuses on radiology, pathology, and clinical imaging tasks where expert radiologists or pathologists provide ground truth or quality assessments, as seen in works like ARTHUR DIANA Ultrasound[4] and Follicular Lymphoma Grading[9]. Clinical Decision Support and Language Model Evaluation examines how large language models can be assessed or guided by clinicians, for instance through LLM Medical Graders[3] or consensus protocols like LLM Clinical Consensus[34]. General AI Evaluation and Annotation Quality tackles cross-domain challenges such as handling inconsistent annotations (Inconsistent Human Annotations[26]), defining what constitutes expertise (Defining Expert[27]), and ensuring robust evaluation pipelines (AI Ground Truth[5]). Domain-Specific Applications and Specialized Tasks spans diverse settings from wildlife monitoring (Wildlife Camera Traps[20]) to surgical skill assessment (Surgical Expertise Assessment[10]), illustrating how expert feedback adapts to varied problem contexts. Several active lines of work explore trade-offs between annotation cost, label quality, and model performance. One recurring theme is how to aggregate or reconcile disagreements among multiple experts (Multiple Expert Annotators[22], Heterogeneous Expert Consistency[7]) while preserving the nuanced information that partial or complementary labels provide. Another thread investigates hybrid supervision strategies that combine weak labels, semi-automated tools, and targeted expert input (Hybrid Supervision Radiographs[40], Semi-Automated Quality Assurance[42]). Partitioned Human Supervision[0] sits within the Scalable Oversight and Unbiased Estimation cluster, emphasizing methods that partition the supervision task to reduce expert burden while maintaining unbiased learning guarantees. This approach contrasts with works like Recycling Weak Labels[39], which reuse noisy annotations more opportunistically, and aligns closely with efforts such as Measurement Error Correction[32] that formally account for imperfect or incomplete expert signals. The central question across these directions is how to design training and evaluation protocols that respect expert time constraints without sacrificing the reliability or interpretability that domain knowledge brings.

Claimed Contributions

Scalable oversight framework via partitioned human supervision using complementary labels

3 retrieved papers

The authors introduce a framework that exploits partitioned human expertise to collect complementary labels (indicating incorrect options) at scale for superhuman tasks. This enables evaluation and training of AI systems without requiring full ground truth or comprehensive expert verification.

3 retrieved papers

Unbiased estimator of top-1 accuracy from complementary labels with variance analysis and mixture estimators

Can Refute

1 retrieved paper

The authors derive an unbiased linear correction estimator for accuracy using only complementary labels, analyze its variance properties, and propose two mixture estimators (inverse-variance weighted and maximum-likelihood) that combine ordinary and complementary labels with finite-sample deviation guarantees.

1 retrieved paper

Can Refute

Demonstration of evaluation and agentic training using complementary labels

10 retrieved papers

The authors empirically validate that their estimators enable both evaluation of large language models without ground truth and training of agentic AI systems by using complementary labels as fitness signals in agent search pipelines, demonstrating improved downstream performance.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Scalable oversight framework via partitioned human supervision using complementary labels

[51] Federated benchmarking of medical artificial intelligence with MedPerf PDF

Cannot Refute

[52] Beyond Manual Annotation: A Human-AI Collaborative Framework for Medical Image Segmentation Using Only âBetter or Worseâ Expert Feedback PDF

Cannot Refute

[53] A Human-Centric Assessment Framework for AI PDF

Cannot Refute

Contribution

Unbiased estimator of top-1 accuracy from complementary labels with variance analysis and mixture estimators

[54] Unbiased Risk Estimators Can Mislead: A Case Study of Learning with Complementary Labels PDF

Can Refute

Contribution

Demonstration of evaluation and agentic training using complementary labels

[56] Weak-to-strong generalization: Eliciting strong capabilities with weak supervision PDF

Cannot Refute

[57] Simvlm: Simple visual language model pretraining with weak supervision PDF

Cannot Refute

[58] Difference-Complementary Learning and Label Reassignment for Multimodal Semi-Supervised Semantic Segmentation of Remote Sensing Images PDF

Cannot Refute

[59] Candidate pseudolabel learning: Enhancing vision-language models by prompt tuning with unlabeled data PDF

Cannot Refute

[60] Pre-trained Vision-Language Models Assisted Noisy Partial Label Learning PDF

Cannot Refute

[61] Cold-start active learning through self-supervised language modeling PDF

Cannot Refute

[62] Pre-Trained Vision-Language Models as Noisy Partial Annotators PDF

Cannot Refute

[63] Language models in the loop: Incorporating prompting into weak supervision PDF

Cannot Refute

[64] Learning with biased complementary labels PDF

Cannot Refute

[65] Combining prompt-based language models and weak supervision for labeling named entity recognition on legal documents: V. Oliveira et al. PDF

Cannot Refute

Scalable Oversight via Partitioned Human Supervision

Overview

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Scalable oversight framework via partitioned human supervision using complementary labels

[51] Federated benchmarking of medical artificial intelligence with MedPerf PDF

[52] Beyond Manual Annotation: A Human-AI Collaborative Framework for Medical Image Segmentation Using Only âBetter or Worseâ Expert Feedback PDF

[53] A Human-Centric Assessment Framework for AI PDF

Unbiased estimator of top-1 accuracy from complementary labels with variance analysis and mixture estimators

[54] Unbiased Risk Estimators Can Mislead: A Case Study of Learning with Complementary Labels PDF

Demonstration of evaluation and agentic training using complementary labels

[56] Weak-to-strong generalization: Eliciting strong capabilities with weak supervision PDF

[57] Simvlm: Simple visual language model pretraining with weak supervision PDF

[58] Difference-Complementary Learning and Label Reassignment for Multimodal Semi-Supervised Semantic Segmentation of Remote Sensing Images PDF

[59] Candidate pseudolabel learning: Enhancing vision-language models by prompt tuning with unlabeled data PDF

[60] Pre-trained Vision-Language Models Assisted Noisy Partial Label Learning PDF

[61] Cold-start active learning through self-supervised language modeling PDF

[62] Pre-Trained Vision-Language Models as Noisy Partial Annotators PDF

[63] Language models in the loop: Incorporating prompting into weak supervision PDF

[64] Learning with biased complementary labels PDF

[65] Combining prompt-based language models and weak supervision for labeling named entity recognition on legal documents: V. Oliveira et al. PDF

Table of Contents

[52] Beyond Manual Annotation: A Human-AI Collaborative Framework for Medical Image Segmentation Using Only âBetter or Worseâ Expert Feedback PDF