How Can I Publish My LLM Benchmark Without Giving the True Answers Away?
Overview
Overall Novelty Assessment
The paper proposes CapBencher, a method to publish benchmarks while concealing ground-truth answers through randomization and multiple logically correct alternatives. This work resides in the 'Answer-Hiding and Randomization Techniques' leaf of the taxonomy, which currently contains only this paper as its sole member. This positioning reflects a sparse research direction within the broader 'Contamination-Resistant Benchmark Design' branch, suggesting the approach addresses a relatively unexplored strategy compared to more populated areas like dynamic benchmarks or statistical detection methods.
The taxonomy reveals neighboring approaches in sibling leaves: 'Dynamic and Continuously Updated Benchmarks' contains five papers focusing on temporal refresh strategies like LiveBench, while 'Decontaminated and Reconstructed Static Benchmarks' includes three papers on filtering contamination post-hoc. The answer-hiding strategy diverges from these by maintaining static benchmark content while controlling information disclosure rather than updating test instances or removing contaminated data. This positions the work at the intersection of benchmark design and information-theoretic contamination prevention, complementing detection-focused branches that attempt to identify contamination after the fact.
Among twenty-four candidates examined across three contributions, no clearly refuting prior work was identified. The core CapBencher method examined ten candidates with zero refutations, the Bayes accuracy ceiling detection mechanism examined four candidates with zero refutations, and theoretical guarantees examined ten candidates with zero refutations. This suggests that within the limited search scope of top-K semantic matches and citation expansion, the specific combination of answer randomization, Bayes accuracy capping, and contamination detection appears not to have direct precedent in the examined literature, though the search scale remains modest relative to the full field.
The analysis reflects a targeted literature search rather than exhaustive coverage, examining approximately half the papers in the fifty-paper taxonomy. The absence of refuting candidates among this sample indicates potential novelty in the answer-hiding paradigm, particularly given the leaf's isolation within the taxonomy structure. However, the limited search scope and the paper's position as the sole occupant of its leaf warrant cautious interpretation—broader examination might reveal related work in adjacent communities or application domains not captured by semantic similarity to this specific approach.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors introduce CapBencher, a methodology that intentionally reduces the Bayes accuracy of benchmarks by injecting randomness into answers (providing multiple logically correct answers but publishing only one). This allows benchmark creators to release datasets publicly without revealing true answers, while still enabling open evaluation of language models.
The method provides a natural contamination detection mechanism: if a model's accuracy exceeds the deliberately reduced Bayes accuracy (the ceiling), this signals data contamination. The authors propose using a binomial test to statistically determine whether performance above the Bayes accuracy indicates contamination.
The authors establish theoretical results (Theorem 1 and Corollary 1) showing an affine relationship between capped benchmark scores and original scores, and provide an unbiased estimator of original accuracy. This demonstrates that CapBencher preserves the ability to track model improvements despite reducing Bayes accuracy.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
Contribution Analysis
Detailed comparisons for each claimed contribution
CapBencher: A method to publish benchmarks without disclosing ground-truth answers
The authors introduce CapBencher, a methodology that intentionally reduces the Bayes accuracy of benchmarks by injecting randomness into answers (providing multiple logically correct answers but publishing only one). This allows benchmark creators to release datasets publicly without revealing true answers, while still enabling open evaluation of language models.
[65] ⦠education assessment: a multidisciplinary and multi-institutional benchmarking and analysis of this generative artificial intelligence tool to investigate assessment ⦠PDF
[66] DefAn: Definitive Answer Dataset for LLM Hallucination Evaluation PDF
[67] Evaluating Explanation Without Ground Truth in Interpretable Machine Learning PDF
[68] A large, curated, open-source stroke neuroimaging dataset to improve lesion segmentation algorithms PDF
[69] DefAn: Definitive Answer Dataset for LLMs Hallucination Evaluation PDF
[70] AIM 2024 Sparse Neural Rendering Challenge: Dataset and Benchmark PDF
[71] Dataset for the first evaluation on Chinese machine reading comprehension PDF
[72] Benchmarking Counterfactual Image Generation PDF
[73] ORBIT--Open Recommendation Benchmark for Reproducible Research with Hidden Tests PDF
[74] Lessons Learned from the ADRENALIN Load Disaggregation Challenge PDF
Data contamination detection via Bayes accuracy ceiling
The method provides a natural contamination detection mechanism: if a model's accuracy exceeds the deliberately reduced Bayes accuracy (the ceiling), this signals data contamination. The authors propose using a binomial test to statistically determine whether performance above the Bayes accuracy indicates contamination.
[61] Benchmarking computational doublet-detection methods for single-cell RNA sequencing data PDF
[62] Evaluating Classification Model Against Bayes Error Rate PDF
[63] Multivariate Gaussian Bayes classifier with limited data for segmentation of clean and contaminated regions in the small bowel capsule endoscopy images. PDF
[64] Classification under data contamination with application to image mis-registration in remote sensing PDF
Theoretical guarantees relating capped and original scores
The authors establish theoretical results (Theorem 1 and Corollary 1) showing an affine relationship between capped benchmark scores and original scores, and provide an unbiased estimator of original accuracy. This demonstrates that CapBencher preserves the ability to track model improvements despite reducing Bayes accuracy.