Cost-Optimal Active AI Model Evaluation

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.5 Download Report PDF

llmevaluationppiinferenceefficientactivechatbot arenaprediction-powered inferencestatistical inference

The development lifecycle of generative AI systems requires continual evaluation, data acquisition, and annotation, which is costly in both resources and time. In practice, a desire for rapid iteration often makes it necessary to rely on synthetic annotation data because of its low cost, despite the potential for substantial bias. In this paper, we develop a rigorous theoretical framework for novel, cost-aware evaluation pipelines that actively balance the use of a cheap, but often inaccurate, weak rater---such as a model-based autorater that is designed to automatically assess the quality of generated content---with a more expensive, but also more accurate, strong rater such as a human annotator. Building on recent work in active and prediction-powered statistical inference, we theoretically derive a family of cost-optimal policies for allocating a given annotation budget between weak and strong raters so as to maximize statistical efficiency. Next, using synthetic and real-world data, we empirically characterize conditions under which these types of policies can yield significant improvements over classical methods. Finally, we find that practical approximations of the theoretically optimal policies can achieve the same estimation precision at a far lower total annotation budget than standard evaluation methods, especially in tasks where there is high variability in the difficulty of examples.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper develops a theoretical framework for cost-optimal allocation between weak raters (e.g., model-based autoraters) and strong raters (e.g., human annotators) in generative AI evaluation. It resides in the 'Optimal Rater Selection Policies' leaf, which contains only two papers total, indicating a relatively sparse research direction within the broader taxonomy. This leaf sits under 'Active Annotation Budget Allocation,' one of three main branches addressing cost-aware evaluation. The small sibling count suggests the paper targets a focused niche: formal optimization policies for rater selection rather than empirical crowdsourcing systems or automated LLM judges.

The taxonomy reveals neighboring work in 'Active Learning with Cost-Aware Query Strategies' (one paper) and 'Quality-Aware Crowdsourcing Annotation' (four papers across three leaves). The former emphasizes query selection for model training, while the latter focuses on aggregating noisy crowd labels under budget constraints. The paper's theoretical approach to rater allocation distinguishes it from these directions: it does not address training-time active learning nor static crowdsourcing aggregation, but instead derives policies for evaluation-time resource allocation. The taxonomy's scope notes clarify that the paper excludes passive quality control and general active learning, positioning it at the intersection of statistical inference and adaptive evaluation design.

Among the 25 candidates examined, none clearly refuted any of the three contributions. Contribution A (cost-optimal annotation policies) examined 9 candidates with 0 refutable; Contribution B (optimal fixed and active sampling rules) examined 10 with 0 refutable; Contribution C (heterogeneous model evaluation) examined 6 with 0 refutable. This suggests that within the limited search scope—top-K semantic matches plus citation expansion—the specific combination of theoretical cost-optimality, weak-strong rater allocation, and prediction-powered inference appears underexplored. However, the search scale (25 papers) is modest, and the analysis does not claim exhaustive coverage of all related statistical or evaluation literature.

Given the sparse taxonomy leaf and absence of refutable prior work among examined candidates, the paper appears to occupy a relatively novel position within the surveyed literature. The theoretical focus on cost-optimal policies for evaluation (rather than training or crowdsourcing aggregation) differentiates it from neighboring branches. Nonetheless, the limited search scope means this assessment reflects top-K semantic proximity, not a comprehensive field review. Broader statistical inference or active learning communities may contain relevant work not captured here.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: cost-aware evaluation of AI models using weak and strong raters. The field addresses how to obtain high-quality annotations or evaluations under budget constraints by strategically combining raters of varying expertise and cost. The taxonomy organizes work into three main branches. Active Annotation Budget Allocation focuses on policies that decide which instances to label and which rater to assign, often drawing on active learning principles to maximize information gain per dollar spent; representative studies include Active Learning Survey[3] and Cost-Effective Crowd Labeling[4]. Quality-Aware Crowdsourcing Annotation emphasizes modeling annotator reliability and aggregating noisy labels from multiple workers, with methods that account for varying worker quality to improve final label accuracy under fixed budgets, as seen in Crowdsourced Multilabel Budget[1] and Quality Aware Budget[7]. Automated Quality Assurance Using LLMs explores leveraging large language models as cost-effective judges or verifiers, replacing or augmenting human raters for certain evaluation tasks, exemplified by LLM-as-Judge QA[5]. A central tension across these branches is the trade-off between annotation cost and quality: some lines of work prioritize optimal sample selection to reduce labeling volume, while others focus on aggregating many cheap labels to achieve reliable consensus. Within Active Annotation Budget Allocation, a particularly active theme is designing adaptive policies that route difficult instances to expensive expert raters and easy cases to cheaper annotators. Cost-Optimal AI Evaluation[0] sits squarely in this space, specifically under Optimal Rater Selection Policies, and shares methodological ground with LoRA-Guided PPO[6] in exploring how to dynamically allocate evaluation resources. Compared to earlier crowdsourcing studies like Crowdsourced Video Annotation[2], which treat worker pools as relatively static, Cost-Optimal AI Evaluation[0] emphasizes real-time decision-making about rater assignment, reflecting a shift toward more adaptive, model-driven evaluation strategies that balance accuracy and budget in AI system development.

Claimed Contributions

Cost-optimal annotation policies for active model evaluation

9 retrieved papers

The authors develop a theoretical framework that derives annotation policies optimizing the trade-off between cheap but inaccurate weak raters and expensive but accurate strong raters. These policies minimize estimation error subject to budget constraints by determining when to query each rater type.

9 retrieved papers

Optimal fixed and active sampling rules under cost constraints

10 retrieved papers

The paper presents two forms of optimal policies: a best fixed sampling rate (Proposition 1) and a best active sampling rule that depends on covariates (Proposition 2). Unlike prior work with fixed ratios, these policies determine the ratio of cheap to expensive ratings based on cost constraints and distributional properties.

10 retrieved papers

Extension to heterogeneous model evaluation settings

6 retrieved papers

The authors generalize prediction-powered inference beyond the typical human-versus-LLM scenario to any situation combining less expensive, less accurate ratings with more expensive, more accurate ones, including cases where both sources are automated with different cost-performance characteristics.

6 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[6] LoRA-Guided PPO for Cost-Aware and Compute-Efficient Agent Orchestration PDF

A Durai, JC Hu, K Buch, K Zhu, V Sharma (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Cost-optimal annotation policies for active model evaluation

[8] Active Learning from Weak and Strong Labelers PDF

Cannot Refute

[9] Combining prompt-based language models and weak supervision for labeling named entity recognition on legal documents: V. Oliveira et al. PDF

Cannot Refute

[10] Repeated labeling using multiple noisy labelers PDF

Cannot Refute

[11] Cost-Effective Active Learning from Diverse Labelers. PDF

Cannot Refute

[12] Low resource sequence tagging with weak labels PDF

Cannot Refute

[13] Balancing Label Quantity and Quality for Scalable Elicitation PDF

Cannot Refute

[14] Predicting Perceived Gloss: Do Weak Labels Suffice? PDF

Cannot Refute

[15] Active Learning for Noisy Data Streams Using Weak and Strong Labelers PDF

Cannot Refute

[16] Improved Adaptive Algorithm for Scalable Active Learning with Weak Labeler PDF

Cannot Refute

Contribution

Optimal fixed and active sampling rules under cost constraints

[17] On efficient and statistical quality estimation for data annotation PDF

Cannot Refute

[18] OmViD: Omni-supervised active learning for video action detection PDF

Cannot Refute

[19] Minority Reports: Balancing Cost and Quality in Ground Truth Data Annotation PDF

Cannot Refute

[20] Batch Multi-Fidelity Active Learning with Budget Constraints PDF

Cannot Refute

[21] Active Learning with Selective Time-Step Acquisition for PDEs PDF

Cannot Refute

[22] Direct Acquisition Optimization for Low-Budget Active Learning PDF

Cannot Refute

[23] Generalized coverage for more robust low-budget active learning PDF

Cannot Refute

[24] Active label cleaning for improved dataset quality under resource constraints PDF

Cannot Refute

[25] Clean or annotate: How to spend a limited data collection budget PDF

Cannot Refute

[26] Active Ensemble Learning for Knowledge Graph Error Detection PDF

Cannot Refute

Contribution

Extension to heterogeneous model evaluation settings

[27] Large language models as evaluators for recommendation explanations PDF

Cannot Refute

[28] Large Pre-Trained Models and Few-Shot FineTuning for Virtual Metrology: A Framework for Uncertainty-Driven Adaptive Process Control in Semiconductor â¦ PDF

Cannot Refute

[29] A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae) PDF

Cannot Refute

[30] Improving Accuracy in Predicting City-Level Construction Cost Indices by Combining Linear ARIMA and Nonlinear ANNs PDF

Cannot Refute

[31] Bayesian predictive inference of a proportion under a two-fold small area model with heterogeneous correlations PDF

Cannot Refute

[32] Survey-Based Methodologies for Enhanced Assessment of Cause of Death PDF

Cannot Refute

Cost-Optimal Active AI Model Evaluation

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[6] LoRA-Guided PPO for Cost-Aware and Compute-Efficient Agent Orchestration PDF

Contribution Analysis

Cost-optimal annotation policies for active model evaluation

[8] Active Learning from Weak and Strong Labelers PDF

[9] Combining prompt-based language models and weak supervision for labeling named entity recognition on legal documents: V. Oliveira et al. PDF

[10] Repeated labeling using multiple noisy labelers PDF

[11] Cost-Effective Active Learning from Diverse Labelers. PDF

[12] Low resource sequence tagging with weak labels PDF

[13] Balancing Label Quantity and Quality for Scalable Elicitation PDF

[14] Predicting Perceived Gloss: Do Weak Labels Suffice? PDF

[15] Active Learning for Noisy Data Streams Using Weak and Strong Labelers PDF

[16] Improved Adaptive Algorithm for Scalable Active Learning with Weak Labeler PDF

Optimal fixed and active sampling rules under cost constraints

[17] On efficient and statistical quality estimation for data annotation PDF

[18] OmViD: Omni-supervised active learning for video action detection PDF

[19] Minority Reports: Balancing Cost and Quality in Ground Truth Data Annotation PDF

[20] Batch Multi-Fidelity Active Learning with Budget Constraints PDF

[21] Active Learning with Selective Time-Step Acquisition for PDEs PDF

[22] Direct Acquisition Optimization for Low-Budget Active Learning PDF

[23] Generalized coverage for more robust low-budget active learning PDF

[24] Active label cleaning for improved dataset quality under resource constraints PDF

[25] Clean or annotate: How to spend a limited data collection budget PDF

[26] Active Ensemble Learning for Knowledge Graph Error Detection PDF

Extension to heterogeneous model evaluation settings

[27] Large language models as evaluators for recommendation explanations PDF

[28] Large Pre-Trained Models and Few-Shot FineTuning for Virtual Metrology: A Framework for Uncertainty-Driven Adaptive Process Control in Semiconductor â¦ PDF

[29] A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae) PDF

[30] Improving Accuracy in Predicting City-Level Construction Cost Indices by Combining Linear ARIMA and Nonlinear ANNs PDF

[31] Bayesian predictive inference of a proportion under a two-fold small area model with heterogeneous correlations PDF

[32] Survey-Based Methodologies for Enhanced Assessment of Cause of Death PDF

Table of Contents

[28] Large Pre-Trained Models and Few-Shot FineTuning for Virtual Metrology: A Framework for Uncertainty-Driven Adaptive Process Control in Semiconductor â¦ PDF