Cost-Optimal Active AI Model Evaluation

ICLR 2026 Conference SubmissionAnonymous Authors
llmevaluationppiinferenceefficientactivechatbot arenaprediction-powered inferencestatistical inference
Abstract:

The development lifecycle of generative AI systems requires continual evaluation, data acquisition, and annotation, which is costly in both resources and time. In practice, a desire for rapid iteration often makes it necessary to rely on synthetic annotation data because of its low cost, despite the potential for substantial bias. In this paper, we develop a rigorous theoretical framework for novel, cost-aware evaluation pipelines that actively balance the use of a cheap, but often inaccurate, weak rater---such as a model-based autorater that is designed to automatically assess the quality of generated content---with a more expensive, but also more accurate, strong rater such as a human annotator. Building on recent work in active and prediction-powered statistical inference, we theoretically derive a family of cost-optimal policies for allocating a given annotation budget between weak and strong raters so as to maximize statistical efficiency. Next, using synthetic and real-world data, we empirically characterize conditions under which these types of policies can yield significant improvements over classical methods. Finally, we find that practical approximations of the theoretically optimal policies can achieve the same estimation precision at a far lower total annotation budget than standard evaluation methods, especially in tasks where there is high variability in the difficulty of examples.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper develops a theoretical framework for cost-optimal allocation between weak raters (e.g., model-based autoraters) and strong raters (e.g., human annotators) in generative AI evaluation. It resides in the 'Optimal Rater Selection Policies' leaf, which contains only two papers total, indicating a relatively sparse research direction within the broader taxonomy. This leaf sits under 'Active Annotation Budget Allocation,' one of three main branches addressing cost-aware evaluation. The small sibling count suggests the paper targets a focused niche: formal optimization policies for rater selection rather than empirical crowdsourcing systems or automated LLM judges.

The taxonomy reveals neighboring work in 'Active Learning with Cost-Aware Query Strategies' (one paper) and 'Quality-Aware Crowdsourcing Annotation' (four papers across three leaves). The former emphasizes query selection for model training, while the latter focuses on aggregating noisy crowd labels under budget constraints. The paper's theoretical approach to rater allocation distinguishes it from these directions: it does not address training-time active learning nor static crowdsourcing aggregation, but instead derives policies for evaluation-time resource allocation. The taxonomy's scope notes clarify that the paper excludes passive quality control and general active learning, positioning it at the intersection of statistical inference and adaptive evaluation design.

Among the 25 candidates examined, none clearly refuted any of the three contributions. Contribution A (cost-optimal annotation policies) examined 9 candidates with 0 refutable; Contribution B (optimal fixed and active sampling rules) examined 10 with 0 refutable; Contribution C (heterogeneous model evaluation) examined 6 with 0 refutable. This suggests that within the limited search scope—top-K semantic matches plus citation expansion—the specific combination of theoretical cost-optimality, weak-strong rater allocation, and prediction-powered inference appears underexplored. However, the search scale (25 papers) is modest, and the analysis does not claim exhaustive coverage of all related statistical or evaluation literature.

Given the sparse taxonomy leaf and absence of refutable prior work among examined candidates, the paper appears to occupy a relatively novel position within the surveyed literature. The theoretical focus on cost-optimal policies for evaluation (rather than training or crowdsourcing aggregation) differentiates it from neighboring branches. Nonetheless, the limited search scope means this assessment reflects top-K semantic proximity, not a comprehensive field review. Broader statistical inference or active learning communities may contain relevant work not captured here.

Taxonomy

Core-task Taxonomy Papers
7
3
Claimed Contributions
25
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: cost-aware evaluation of AI models using weak and strong raters. The field addresses how to obtain high-quality annotations or evaluations under budget constraints by strategically combining raters of varying expertise and cost. The taxonomy organizes work into three main branches. Active Annotation Budget Allocation focuses on policies that decide which instances to label and which rater to assign, often drawing on active learning principles to maximize information gain per dollar spent; representative studies include Active Learning Survey[3] and Cost-Effective Crowd Labeling[4]. Quality-Aware Crowdsourcing Annotation emphasizes modeling annotator reliability and aggregating noisy labels from multiple workers, with methods that account for varying worker quality to improve final label accuracy under fixed budgets, as seen in Crowdsourced Multilabel Budget[1] and Quality Aware Budget[7]. Automated Quality Assurance Using LLMs explores leveraging large language models as cost-effective judges or verifiers, replacing or augmenting human raters for certain evaluation tasks, exemplified by LLM-as-Judge QA[5]. A central tension across these branches is the trade-off between annotation cost and quality: some lines of work prioritize optimal sample selection to reduce labeling volume, while others focus on aggregating many cheap labels to achieve reliable consensus. Within Active Annotation Budget Allocation, a particularly active theme is designing adaptive policies that route difficult instances to expensive expert raters and easy cases to cheaper annotators. Cost-Optimal AI Evaluation[0] sits squarely in this space, specifically under Optimal Rater Selection Policies, and shares methodological ground with LoRA-Guided PPO[6] in exploring how to dynamically allocate evaluation resources. Compared to earlier crowdsourcing studies like Crowdsourced Video Annotation[2], which treat worker pools as relatively static, Cost-Optimal AI Evaluation[0] emphasizes real-time decision-making about rater assignment, reflecting a shift toward more adaptive, model-driven evaluation strategies that balance accuracy and budget in AI system development.

Claimed Contributions

Cost-optimal annotation policies for active model evaluation

The authors develop a theoretical framework that derives annotation policies optimizing the trade-off between cheap but inaccurate weak raters and expensive but accurate strong raters. These policies minimize estimation error subject to budget constraints by determining when to query each rater type.

9 retrieved papers
Optimal fixed and active sampling rules under cost constraints

The paper presents two forms of optimal policies: a best fixed sampling rate (Proposition 1) and a best active sampling rule that depends on covariates (Proposition 2). Unlike prior work with fixed ratios, these policies determine the ratio of cheap to expensive ratings based on cost constraints and distributional properties.

10 retrieved papers
Extension to heterogeneous model evaluation settings

The authors generalize prediction-powered inference beyond the typical human-versus-LLM scenario to any situation combining less expensive, less accurate ratings with more expensive, more accurate ones, including cases where both sources are automated with different cost-performance characteristics.

6 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Cost-optimal annotation policies for active model evaluation

The authors develop a theoretical framework that derives annotation policies optimizing the trade-off between cheap but inaccurate weak raters and expensive but accurate strong raters. These policies minimize estimation error subject to budget constraints by determining when to query each rater type.

Contribution

Optimal fixed and active sampling rules under cost constraints

The paper presents two forms of optimal policies: a best fixed sampling rate (Proposition 1) and a best active sampling rule that depends on covariates (Proposition 2). Unlike prior work with fixed ratios, these policies determine the ratio of cheap to expensive ratings based on cost constraints and distributional properties.

Contribution

Extension to heterogeneous model evaluation settings

The authors generalize prediction-powered inference beyond the typical human-versus-LLM scenario to any situation combining less expensive, less accurate ratings with more expensive, more accurate ones, including cases where both sources are automated with different cost-performance characteristics.

Cost-Optimal Active AI Model Evaluation | Novelty Validation