How Can I Publish My LLM Benchmark Without Giving the True Answers Away?

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.5 Download Report PDF

benchmarkLLMBayes errorBayes accuracymemorization

Publishing a large language model (LLM) benchmark on the Internet risks contaminating future LLMs: the benchmark may be unintentionally (or intentionally) used to train or select a model. A common mitigation is to keep the benchmark private and let participants submit their models or predictions to the organizers. However, this strategy will require trust in a single organization and still permits test-set overfitting through repeated queries. To overcome this issue, we propose a way to publish benchmarks without completely disclosing the ground-truth answers to the questions, while still maintaining the ability to openly evaluate LLMs. The main underlying idea is to reduces the best possible accuracy, i.e., Bayes accuracy, by injecting randomness to the answers by preparing several logically correct answers, and only include one of them as the solution in the benchmark. Not only is this helpful to keep us from disclosing the ground truth, but this also offers a test for detecting data contamination. In principle, even fully capable models should not surpass the Bayes accuracy. If a model surpasses this ceiling despite this expectation, this is a strong signal of data contamination. We present experimental evidence that our method can detect data contamination accurately on a wide range of benchmarks, models, and training methodologies.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes CapBencher, a method to publish benchmarks while concealing ground-truth answers through randomization and multiple logically correct alternatives. This work resides in the 'Answer-Hiding and Randomization Techniques' leaf of the taxonomy, which currently contains only this paper as its sole member. This positioning reflects a sparse research direction within the broader 'Contamination-Resistant Benchmark Design' branch, suggesting the approach addresses a relatively unexplored strategy compared to more populated areas like dynamic benchmarks or statistical detection methods.

The taxonomy reveals neighboring approaches in sibling leaves: 'Dynamic and Continuously Updated Benchmarks' contains five papers focusing on temporal refresh strategies like LiveBench, while 'Decontaminated and Reconstructed Static Benchmarks' includes three papers on filtering contamination post-hoc. The answer-hiding strategy diverges from these by maintaining static benchmark content while controlling information disclosure rather than updating test instances or removing contaminated data. This positions the work at the intersection of benchmark design and information-theoretic contamination prevention, complementing detection-focused branches that attempt to identify contamination after the fact.

Among twenty-four candidates examined across three contributions, no clearly refuting prior work was identified. The core CapBencher method examined ten candidates with zero refutations, the Bayes accuracy ceiling detection mechanism examined four candidates with zero refutations, and theoretical guarantees examined ten candidates with zero refutations. This suggests that within the limited search scope of top-K semantic matches and citation expansion, the specific combination of answer randomization, Bayes accuracy capping, and contamination detection appears not to have direct precedent in the examined literature, though the search scale remains modest relative to the full field.

The analysis reflects a targeted literature search rather than exhaustive coverage, examining approximately half the papers in the fifty-paper taxonomy. The absence of refuting candidates among this sample indicates potential novelty in the answer-hiding paradigm, particularly given the leaf's isolation within the taxonomy structure. However, the limited search scope and the paper's position as the sole occupant of its leaf warrant cautious interpretation—broader examination might reveal related work in adjacent communities or application domains not captured by semantic similarity to this specific approach.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Detecting data contamination in large language model benchmarks. The field has organized itself around several complementary directions. Contamination Detection Methods encompass techniques for identifying whether test data appeared during pretraining, ranging from membership inference approaches like Detecting Pretraining Data[4] to statistical measures such as perplexity-based detection. Contamination Characterization and Impact Analysis examines how contamination affects model performance and generalization, with works like Generalization or Memorization[30] exploring the boundary between true learning and data leakage. Contamination-Resistant Benchmark Design focuses on creating evaluation protocols that inherently resist contamination through dynamic generation, answer-hiding, or temporal controls, exemplified by LiveBench[26] and LiveCodeBench[15]. Contamination Mitigation and Evaluation Frameworks develop methods to adjust for or remove contamination effects, while Specialized Contamination Contexts address domain-specific challenges in areas like vision-language models or cross-lingual settings, and Benchmark and Evaluation Methodology provides meta-level analysis of evaluation practices themselves. A particularly active tension exists between detection-focused approaches that attempt to identify contamination post-hoc versus design-focused solutions that prevent it proactively. Detection methods face fundamental limitations, as highlighted by Contamination Detection Limitations[9] and Evading Contamination Detection[11], since adversaries can obscure training data or models lack transparency. This has motivated increased interest in contamination-resistant designs. Publishing Without Answers[0] sits squarely within the Answer-Hiding and Randomization Techniques cluster, proposing to withhold ground-truth labels from public benchmarks to prevent direct memorization. This approach contrasts with dynamic benchmark strategies like LiveBench[26] that continuously refresh test instances, and complements detection surveys such as Data Contamination Survey[1] and Benchmark Contamination Survey[3] by shifting emphasis from identifying contamination to architecturally preventing it through information control.

Claimed Contributions

CapBencher: A method to publish benchmarks without disclosing ground-truth answers

10 retrieved papers

The authors introduce CapBencher, a methodology that intentionally reduces the Bayes accuracy of benchmarks by injecting randomness into answers (providing multiple logically correct answers but publishing only one). This allows benchmark creators to release datasets publicly without revealing true answers, while still enabling open evaluation of language models.

10 retrieved papers

Data contamination detection via Bayes accuracy ceiling

4 retrieved papers

The method provides a natural contamination detection mechanism: if a model's accuracy exceeds the deliberately reduced Bayes accuracy (the ceiling), this signals data contamination. The authors propose using a binomial test to statistically determine whether performance above the Bayes accuracy indicates contamination.

4 retrieved papers

Theoretical guarantees relating capped and original scores

10 retrieved papers

The authors establish theoretical results (Theorem 1 and Corollary 1) showing an affine relationship between capped benchmark scores and original scores, and provide an unbiased estimator of original accuracy. This demonstrates that CapBencher preserves the ability to track model improvements despite reducing Bayes accuracy.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

CapBencher: A method to publish benchmarks without disclosing ground-truth answers

[65] â¦ education assessment: a multidisciplinary and multi-institutional benchmarking and analysis of this generative artificial intelligence tool to investigate assessment â¦ PDF

Cannot Refute

[66] DefAn: Definitive Answer Dataset for LLM Hallucination Evaluation PDF

Cannot Refute

[67] Evaluating Explanation Without Ground Truth in Interpretable Machine Learning PDF

Cannot Refute

[68] A large, curated, open-source stroke neuroimaging dataset to improve lesion segmentation algorithms PDF

Cannot Refute

[69] DefAn: Definitive Answer Dataset for LLMs Hallucination Evaluation PDF

Cannot Refute

[70] AIM 2024 Sparse Neural Rendering Challenge: Dataset and Benchmark PDF

Cannot Refute

[71] Dataset for the first evaluation on Chinese machine reading comprehension PDF

Cannot Refute

[72] Benchmarking Counterfactual Image Generation PDF

Cannot Refute

[73] ORBIT--Open Recommendation Benchmark for Reproducible Research with Hidden Tests PDF

Cannot Refute

[74] Lessons Learned from the ADRENALIN Load Disaggregation Challenge PDF

Cannot Refute

Contribution

Data contamination detection via Bayes accuracy ceiling

[61] Benchmarking computational doublet-detection methods for single-cell RNA sequencing data PDF

Cannot Refute

[62] Evaluating Classification Model Against Bayes Error Rate PDF

Cannot Refute

[63] Multivariate Gaussian Bayes classifier with limited data for segmentation of clean and contaminated regions in the small bowel capsule endoscopy images. PDF

Cannot Refute

[64] Classification under data contamination with application to image mis-registration in remote sensing PDF

Cannot Refute

Contribution

Theoretical guarantees relating capped and original scores

[51] Large language model validity via enhanced conformal prediction methods PDF

Cannot Refute

[52] A system for massively parallel hyperparameter tuning PDF

Cannot Refute

[53] Scalable Causal Discovery from Recursive Nonlinear Data via Truncated Basis Function Scores and Tests PDF

Cannot Refute

[54] Price, wage, and fixed commission in on-demand matching PDF

Cannot Refute

[55] EnclaveDB: A secure database using SGX PDF

Cannot Refute

[56] Static posterior inference of bayesian probabilistic programming via polynomial solving PDF

Cannot Refute

[57] Towards More Robust NLP System Evaluation: Handling Missing Scores in Benchmarks PDF

Cannot Refute

[58] A polynomial-time solution for robust registration with extreme outlier rates PDF

Cannot Refute

[59] Robust dual sourcing inventory management: Optimality of capped dual index policies and smoothing PDF

Cannot Refute

[60] Multi-scale quantum harmonic oscillator algorithm with truncated mean stabilization strategy for global numerical optimization problems PDF

Cannot Refute

How Can I Publish My LLM Benchmark Without Giving the True Answers Away?

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

CapBencher: A method to publish benchmarks without disclosing ground-truth answers

[65] â¦ education assessment: a multidisciplinary and multi-institutional benchmarking and analysis of this generative artificial intelligence tool to investigate assessment â¦ PDF

[66] DefAn: Definitive Answer Dataset for LLM Hallucination Evaluation PDF

[67] Evaluating Explanation Without Ground Truth in Interpretable Machine Learning PDF

[68] A large, curated, open-source stroke neuroimaging dataset to improve lesion segmentation algorithms PDF

[69] DefAn: Definitive Answer Dataset for LLMs Hallucination Evaluation PDF

[70] AIM 2024 Sparse Neural Rendering Challenge: Dataset and Benchmark PDF

[71] Dataset for the first evaluation on Chinese machine reading comprehension PDF

[72] Benchmarking Counterfactual Image Generation PDF

[73] ORBIT--Open Recommendation Benchmark for Reproducible Research with Hidden Tests PDF

[74] Lessons Learned from the ADRENALIN Load Disaggregation Challenge PDF

Data contamination detection via Bayes accuracy ceiling

[61] Benchmarking computational doublet-detection methods for single-cell RNA sequencing data PDF

[62] Evaluating Classification Model Against Bayes Error Rate PDF

[63] Multivariate Gaussian Bayes classifier with limited data for segmentation of clean and contaminated regions in the small bowel capsule endoscopy images. PDF

[64] Classification under data contamination with application to image mis-registration in remote sensing PDF

Theoretical guarantees relating capped and original scores

[51] Large language model validity via enhanced conformal prediction methods PDF

[52] A system for massively parallel hyperparameter tuning PDF

[53] Scalable Causal Discovery from Recursive Nonlinear Data via Truncated Basis Function Scores and Tests PDF

[54] Price, wage, and fixed commission in on-demand matching PDF

[55] EnclaveDB: A secure database using SGX PDF

[56] Static posterior inference of bayesian probabilistic programming via polynomial solving PDF

[57] Towards More Robust NLP System Evaluation: Handling Missing Scores in Benchmarks PDF

[58] A polynomial-time solution for robust registration with extreme outlier rates PDF

[59] Robust dual sourcing inventory management: Optimality of capped dual index policies and smoothing PDF

[60] Multi-scale quantum harmonic oscillator algorithm with truncated mean stabilization strategy for global numerical optimization problems PDF

Table of Contents

[65] â¦ education assessment: a multidisciplinary and multi-institutional benchmarking and analysis of this generative artificial intelligence tool to investigate assessment â¦ PDF