AtC: Aggregate-then-Calibrate for Human-centered Assessment

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

human-centered assessmentjudgment aggregationcalibrationmisspecificationhuman-AI complementarity

Human-centered assessment tasks, which are essential for systematic decision-making, rely heavily on human judgment and typically lack verifiable ground truth. Existing approaches face a dilemma: methods using only human judgments suffer from heterogeneous expertise and inconsistent rating scales, while methods using only model-generated scores must learn from imperfect proxies or incomplete features. We propose Aggregate-then-Calibrate (AtC), a two-stage framework that combines these complementary sources. Stage-1 aggregates heterogeneous comparative judgments into a consensus ranking $\hat{\pi}$ using a rank-aggregation model that accounts for annotator reliability. Stage-2 calibrates any predictive model’s scores by an isotonic projection onto the order $\hat{\pi}$ , enforcing ordinal consistency while preserving as much of the model’s quantitative information as possible. Theoretically, we show: (1) modeling annotator heterogeneity yields strictly more efficient consensus estimation than homogeneity; (2) isotonic calibration enjoys risk bounds even when the consensus ranking is misspecified; and (3) AtC asymptotically outperforms model-only assessment. Across semi-synthetic and real-world datasets, AtC consistently improves accuracy and robustness over human-only or model-only assessments. Our results bridge judgment aggregation with model-free calibration, providing a principled recipe for human-centered assessment when ground truth is costly, scarce, or unverifiable. The data and code are available at \url{https://anonymous.4open.science/r/12500_AtC_supp-4F50}.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a two-stage framework that first aggregates heterogeneous human comparative judgments into a consensus ranking, then calibrates model scores via isotonic projection to enforce ordinal consistency. It resides in the 'Calibration and Aggregation Methods for Hybrid Assessment' leaf, which contains only three papers total. This is a relatively sparse research direction within the broader taxonomy of 50 papers across 36 topics, suggesting the specific combination of rank aggregation with isotonic calibration for human-centered assessment represents an underexplored methodological niche.

The taxonomy reveals that neighboring leaves address related but distinct challenges: 'Human-in-the-Loop Learning' focuses on iterative model refinement during training, while 'Selective Prediction and Deferral' examines when to route decisions to humans. The sibling papers in this leaf explore confusion matrix calibration and chimeric forecasting, which blend human and algorithmic signals but differ in sequencing—some calibrate before aggregation, others explore ensemble strategies. The paper's aggregate-then-calibrate ordering distinguishes it from these alternatives, positioning it at the intersection of rank aggregation theory and ordinal calibration techniques.

Among 21 candidates examined across three contributions, the core AtC framework shows one refutable candidate from eight examined, indicating some methodological overlap in the limited search scope. The theoretical efficiency guarantees examined three candidates with none refuting, suggesting this angle may be more novel. The formalization of human-centered assessment problems examined ten candidates with no refutations, though this broader framing naturally intersects with existing evaluation methodology literature. The statistics reflect a focused semantic search rather than exhaustive coverage, so unexamined work in rank aggregation or isotonic regression could present additional overlaps.

Based on the top-21 semantic matches and taxonomy structure, the work appears to occupy a methodologically distinct position within a sparse research direction. The aggregate-then-calibrate sequencing and theoretical analysis of annotator heterogeneity differentiate it from sibling approaches, though the limited search scope means potential overlaps in broader rank aggregation or calibration literature remain unassessed. The taxonomy context suggests genuine contribution to an underexplored methodological intersection, contingent on the boundaries of the examined candidate set.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Combining human judgments with model predictions for assessment tasks. This field addresses how to integrate human expertise with automated model outputs to produce more reliable evaluations across diverse domains. The taxonomy reveals three main branches: Human-AI Collaborative Assessment Frameworks explore architectural designs for combining human and machine intelligence, including calibration and aggregation methods that reconcile potentially conflicting signals from both sources; Evaluation Methodologies for Human-Model Alignment develop metrics and protocols to measure how well models align with human judgment, encompassing both automated evaluation techniques like G-eval[1] and benchmarks such as AGIEval[3]; and Application Domains and Task-Specific Implementations demonstrate practical deployments ranging from clinical decision support (Clinical Decision Evaluation[5], Medical Diagnosis Generation[6]) to educational assessment (Automated Essay Scoring[40], Business English Assessment[35]) and creative content evaluation (Story Generation Benchmark[18]). These branches reflect a progression from foundational methods through validation frameworks to real-world instantiations. Particularly active lines of work center on calibration strategies that adjust model confidence to match human reliability, aggregation schemes that optimally weight human and machine contributions, and domain-specific adaptations that account for task characteristics. A key tension emerges between fully automated approaches that minimize human effort and hybrid systems that preserve human oversight for high-stakes decisions, as seen in medical applications (Suicide Risk HITL[23], HPV Screening Strategy[42]) versus more automated pipelines in content moderation or translation metrics (Machine Translation Metrics[36]). Aggregate then Calibrate[0] sits within the calibration and aggregation methods cluster, proposing a two-stage approach that first combines predictions before adjusting for systematic biases. This contrasts with works like Confusion Matrix Calibration[21] that calibrate individual model outputs before aggregation, and Chimeric Forecasting[25] which explores ensemble strategies for blending human and algorithmic forecasts, highlighting ongoing debates about optimal sequencing and weighting in hybrid assessment pipelines.

Claimed Contributions

Aggregate-then-Calibrate (AtC) framework

Can Refute

8 retrieved papers

A two-stage framework that first aggregates heterogeneous comparative human judgments into a consensus ranking using a rank-aggregation model accounting for annotator reliability, then calibrates any predictive model's scores via isotonic projection onto this consensus order. This approach combines ordinal information from human judgments with quantitative information from model predictions.

8 retrieved papers

Can Refute

Theoretical efficiency and optimality guarantees

3 retrieved papers

Three theoretical results establishing that heterogeneous annotator modeling is more statistically efficient than homogeneous methods, isotonic calibration provides risk bounds under ranking misspecification, and the AtC framework asymptotically outperforms using model predictions alone.

3 retrieved papers

Formalization of human-centered assessment problems

10 retrieved papers

A conceptual contribution that formally defines human-centered assessment tasks as problems requiring systematic decision-making based on human judgments when ground truth is costly, unobservable, or only available in the future, distinguishing these from individual preference satisfaction problems.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[21] Combining human predictions with model probabilities via confusion matrices and calibration PDF

Gavin Kerrigan, Padhraic Smyth, Mark Steyvers, M. Steyvers (2021)

[25] Chimeric forecasting: combining probabilistic predictions from computational models and human judgment PDF

T. McAndrew, Allison Codi, J. Cambeiro, T. Besiroglu, David Braun, Eva Chen, Luis Enrique Urtubey de Cesaris, Thomas Mcandrew, Damon Luk, Juan Cambeiro, Tamay Besiroglu, Luis Enrique Urtubey De CÃ©saris (2022)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Aggregate-then-Calibrate (AtC) framework

[51] Scalable area difficulty assessment with knowledge-enhanced ai for nationwide logistics systems PDF

Can Refute

[52] The ICML 2023 ranking experiment: Examining author self-assessment in ML/AI peer review PDF

Cannot Refute

[53] Beyond correlation: Making sense of the score differences of new MT evaluation metrics PDF

Cannot Refute

[54] AI-Powered Anomaly Detection PDF

Cannot Refute

[55] Query-level learning to rank using isotonic regression PDF

Cannot Refute

[56] Alzheimer's diagnosis from EEG with reliable probabilities: subject-wise, leakage-free evaluation and isotonic calibration PDF

Cannot Refute

[57] LLM as a Judge for Evaluating Contract Graphs: Multi-Judge Benchmarking and Agentic Uncertainty-Aware Refinement PDF

Cannot Refute

[58] Dynamics of Agentic Loops in Large Language Models: A Geometric Theory of Trajectories PDF

Cannot Refute

Contribution

Theoretical efficiency and optimality guarantees

[69] A weakly-supervised deep learning model for fast localisation and delineation of the skeleton, internal organs, and spinal canal on Whole-Body Diffusion-Weighted â¦ PDF

Cannot Refute

[70] Enhancing supervised learning robustness investigating the impact of label noise on algorithm performance PDF

Cannot Refute

[71] Budgeted Probabilistic Early-Warning Expert System for FX Drawdowns: Design and EUR/USD Case Study PDF

Cannot Refute

Contribution

Formalization of human-centered assessment problems

[59] Urethra contours on MRI: multidisciplinary consensus educational atlas and reference standard for artificial intelligence benchmarking PDF

Cannot Refute

[60] A Minimaximalist Approach to Reinforcement Learning from Human Feedback PDF

Cannot Refute

[61] Towards Automatic Evaluation and Selection of PHI De-identification Models via Multi-Agent Collaboration PDF

Cannot Refute

[62] What Matters in Evaluating Book-Length Stories? A Systematic Study of Long Story Evaluation PDF

Cannot Refute

[63] The impact of inconsistent human annotations on AI driven clinical decision making PDF

Cannot Refute

[64] SkillAggregation: Reference-free LLM-Dependent Aggregation PDF

Cannot Refute

[65] Consensus and subjectivity of skin tone annotation for ML fairness PDF

Cannot Refute

[66] Keyphrase Identification Using Minimal Labeled Data with Hierarchical Contexts and Transfer Learning PDF

Cannot Refute

[67] GamELY: Human-in-the loop Framework for Scaling Human Evaluation of LLMs in Healthcare PDF

Cannot Refute

[68] How Much Noise is There in Labels Generated by Humans? A Method to Validate Automatically Generated Bounding Boxes PDF

Cannot Refute

AtC: Aggregate-then-Calibrate for Human-centered Assessment

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[21] Combining human predictions with model probabilities via confusion matrices and calibration PDF

[25] Chimeric forecasting: combining probabilistic predictions from computational models and human judgment PDF

Contribution Analysis

Aggregate-then-Calibrate (AtC) framework

[51] Scalable area difficulty assessment with knowledge-enhanced ai for nationwide logistics systems PDF

[52] The ICML 2023 ranking experiment: Examining author self-assessment in ML/AI peer review PDF

[53] Beyond correlation: Making sense of the score differences of new MT evaluation metrics PDF

[54] AI-Powered Anomaly Detection PDF

[55] Query-level learning to rank using isotonic regression PDF

[56] Alzheimer's diagnosis from EEG with reliable probabilities: subject-wise, leakage-free evaluation and isotonic calibration PDF

[57] LLM as a Judge for Evaluating Contract Graphs: Multi-Judge Benchmarking and Agentic Uncertainty-Aware Refinement PDF

[58] Dynamics of Agentic Loops in Large Language Models: A Geometric Theory of Trajectories PDF

Theoretical efficiency and optimality guarantees

[69] A weakly-supervised deep learning model for fast localisation and delineation of the skeleton, internal organs, and spinal canal on Whole-Body Diffusion-Weighted â¦ PDF

[70] Enhancing supervised learning robustness investigating the impact of label noise on algorithm performance PDF

[71] Budgeted Probabilistic Early-Warning Expert System for FX Drawdowns: Design and EUR/USD Case Study PDF

Formalization of human-centered assessment problems

[59] Urethra contours on MRI: multidisciplinary consensus educational atlas and reference standard for artificial intelligence benchmarking PDF

[60] A Minimaximalist Approach to Reinforcement Learning from Human Feedback PDF

[61] Towards Automatic Evaluation and Selection of PHI De-identification Models via Multi-Agent Collaboration PDF

[62] What Matters in Evaluating Book-Length Stories? A Systematic Study of Long Story Evaluation PDF

[63] The impact of inconsistent human annotations on AI driven clinical decision making PDF

[64] SkillAggregation: Reference-free LLM-Dependent Aggregation PDF

[65] Consensus and subjectivity of skin tone annotation for ML fairness PDF

[66] Keyphrase Identification Using Minimal Labeled Data with Hierarchical Contexts and Transfer Learning PDF

[67] GamELY: Human-in-the loop Framework for Scaling Human Evaluation of LLMs in Healthcare PDF

[68] How Much Noise is There in Labels Generated by Humans? A Method to Validate Automatically Generated Bounding Boxes PDF

Table of Contents

[69] A weakly-supervised deep learning model for fast localisation and delineation of the skeleton, internal organs, and spinal canal on Whole-Body Diffusion-Weighted â¦ PDF