AtC: Aggregate-then-Calibrate for Human-centered Assessment

ICLR 2026 Conference SubmissionAnonymous Authors
human-centered assessmentjudgment aggregationcalibrationmisspecificationhuman-AI complementarity
Abstract:

Human-centered assessment tasks, which are essential for systematic decision-making, rely heavily on human judgment and typically lack verifiable ground truth. Existing approaches face a dilemma: methods using only human judgments suffer from heterogeneous expertise and inconsistent rating scales, while methods using only model-generated scores must learn from imperfect proxies or incomplete features. We propose Aggregate-then-Calibrate (AtC), a two-stage framework that combines these complementary sources. Stage-1 aggregates heterogeneous comparative judgments into a consensus ranking π^\hat{\pi} using a rank-aggregation model that accounts for annotator reliability. Stage-2 calibrates any predictive model’s scores by an isotonic projection onto the order π^\hat{\pi}, enforcing ordinal consistency while preserving as much of the model’s quantitative information as possible. Theoretically, we show: (1) modeling annotator heterogeneity yields strictly more efficient consensus estimation than homogeneity; (2) isotonic calibration enjoys risk bounds even when the consensus ranking is misspecified; and (3) AtC asymptotically outperforms model-only assessment. Across semi-synthetic and real-world datasets, AtC consistently improves accuracy and robustness over human-only or model-only assessments. Our results bridge judgment aggregation with model-free calibration, providing a principled recipe for human-centered assessment when ground truth is costly, scarce, or unverifiable. The data and code are available at \url{https://anonymous.4open.science/r/12500_AtC_supp-4F50}.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a two-stage framework that first aggregates heterogeneous human comparative judgments into a consensus ranking, then calibrates model scores via isotonic projection to enforce ordinal consistency. It resides in the 'Calibration and Aggregation Methods for Hybrid Assessment' leaf, which contains only three papers total. This is a relatively sparse research direction within the broader taxonomy of 50 papers across 36 topics, suggesting the specific combination of rank aggregation with isotonic calibration for human-centered assessment represents an underexplored methodological niche.

The taxonomy reveals that neighboring leaves address related but distinct challenges: 'Human-in-the-Loop Learning' focuses on iterative model refinement during training, while 'Selective Prediction and Deferral' examines when to route decisions to humans. The sibling papers in this leaf explore confusion matrix calibration and chimeric forecasting, which blend human and algorithmic signals but differ in sequencing—some calibrate before aggregation, others explore ensemble strategies. The paper's aggregate-then-calibrate ordering distinguishes it from these alternatives, positioning it at the intersection of rank aggregation theory and ordinal calibration techniques.

Among 21 candidates examined across three contributions, the core AtC framework shows one refutable candidate from eight examined, indicating some methodological overlap in the limited search scope. The theoretical efficiency guarantees examined three candidates with none refuting, suggesting this angle may be more novel. The formalization of human-centered assessment problems examined ten candidates with no refutations, though this broader framing naturally intersects with existing evaluation methodology literature. The statistics reflect a focused semantic search rather than exhaustive coverage, so unexamined work in rank aggregation or isotonic regression could present additional overlaps.

Based on the top-21 semantic matches and taxonomy structure, the work appears to occupy a methodologically distinct position within a sparse research direction. The aggregate-then-calibrate sequencing and theoretical analysis of annotator heterogeneity differentiate it from sibling approaches, though the limited search scope means potential overlaps in broader rank aggregation or calibration literature remain unassessed. The taxonomy context suggests genuine contribution to an underexplored methodological intersection, contingent on the boundaries of the examined candidate set.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
21
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: Combining human judgments with model predictions for assessment tasks. This field addresses how to integrate human expertise with automated model outputs to produce more reliable evaluations across diverse domains. The taxonomy reveals three main branches: Human-AI Collaborative Assessment Frameworks explore architectural designs for combining human and machine intelligence, including calibration and aggregation methods that reconcile potentially conflicting signals from both sources; Evaluation Methodologies for Human-Model Alignment develop metrics and protocols to measure how well models align with human judgment, encompassing both automated evaluation techniques like G-eval[1] and benchmarks such as AGIEval[3]; and Application Domains and Task-Specific Implementations demonstrate practical deployments ranging from clinical decision support (Clinical Decision Evaluation[5], Medical Diagnosis Generation[6]) to educational assessment (Automated Essay Scoring[40], Business English Assessment[35]) and creative content evaluation (Story Generation Benchmark[18]). These branches reflect a progression from foundational methods through validation frameworks to real-world instantiations. Particularly active lines of work center on calibration strategies that adjust model confidence to match human reliability, aggregation schemes that optimally weight human and machine contributions, and domain-specific adaptations that account for task characteristics. A key tension emerges between fully automated approaches that minimize human effort and hybrid systems that preserve human oversight for high-stakes decisions, as seen in medical applications (Suicide Risk HITL[23], HPV Screening Strategy[42]) versus more automated pipelines in content moderation or translation metrics (Machine Translation Metrics[36]). Aggregate then Calibrate[0] sits within the calibration and aggregation methods cluster, proposing a two-stage approach that first combines predictions before adjusting for systematic biases. This contrasts with works like Confusion Matrix Calibration[21] that calibrate individual model outputs before aggregation, and Chimeric Forecasting[25] which explores ensemble strategies for blending human and algorithmic forecasts, highlighting ongoing debates about optimal sequencing and weighting in hybrid assessment pipelines.

Claimed Contributions

Aggregate-then-Calibrate (AtC) framework

A two-stage framework that first aggregates heterogeneous comparative human judgments into a consensus ranking using a rank-aggregation model accounting for annotator reliability, then calibrates any predictive model's scores via isotonic projection onto this consensus order. This approach combines ordinal information from human judgments with quantitative information from model predictions.

8 retrieved papers
Can Refute
Theoretical efficiency and optimality guarantees

Three theoretical results establishing that heterogeneous annotator modeling is more statistically efficient than homogeneous methods, isotonic calibration provides risk bounds under ranking misspecification, and the AtC framework asymptotically outperforms using model predictions alone.

3 retrieved papers
Formalization of human-centered assessment problems

A conceptual contribution that formally defines human-centered assessment tasks as problems requiring systematic decision-making based on human judgments when ground truth is costly, unobservable, or only available in the future, distinguishing these from individual preference satisfaction problems.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Aggregate-then-Calibrate (AtC) framework

A two-stage framework that first aggregates heterogeneous comparative human judgments into a consensus ranking using a rank-aggregation model accounting for annotator reliability, then calibrates any predictive model's scores via isotonic projection onto this consensus order. This approach combines ordinal information from human judgments with quantitative information from model predictions.

Contribution

Theoretical efficiency and optimality guarantees

Three theoretical results establishing that heterogeneous annotator modeling is more statistically efficient than homogeneous methods, isotonic calibration provides risk bounds under ranking misspecification, and the AtC framework asymptotically outperforms using model predictions alone.

Contribution

Formalization of human-centered assessment problems

A conceptual contribution that formally defines human-centered assessment tasks as problems requiring systematic decision-making based on human judgments when ground truth is costly, unobservable, or only available in the future, distinguishing these from individual preference satisfaction problems.