How Can I Publish My LLM Benchmark Without Giving the True Answers Away?

ICLR 2026 Conference SubmissionAnonymous Authors
benchmarkLLMBayes errorBayes accuracymemorization
Abstract:

Publishing a large language model (LLM) benchmark on the Internet risks contaminating future LLMs: the benchmark may be unintentionally (or intentionally) used to train or select a model. A common mitigation is to keep the benchmark private and let participants submit their models or predictions to the organizers. However, this strategy will require trust in a single organization and still permits test-set overfitting through repeated queries. To overcome this issue, we propose a way to publish benchmarks without completely disclosing the ground-truth answers to the questions, while still maintaining the ability to openly evaluate LLMs. The main underlying idea is to reduces the best possible accuracy, i.e., Bayes accuracy, by injecting randomness to the answers by preparing several logically correct answers, and only include one of them as the solution in the benchmark. Not only is this helpful to keep us from disclosing the ground truth, but this also offers a test for detecting data contamination. In principle, even fully capable models should not surpass the Bayes accuracy. If a model surpasses this ceiling despite this expectation, this is a strong signal of data contamination. We present experimental evidence that our method can detect data contamination accurately on a wide range of benchmarks, models, and training methodologies.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes CapBencher, a method to publish benchmarks while concealing ground-truth answers through randomization and multiple logically correct alternatives. This work resides in the 'Answer-Hiding and Randomization Techniques' leaf of the taxonomy, which currently contains only this paper as its sole member. This positioning reflects a sparse research direction within the broader 'Contamination-Resistant Benchmark Design' branch, suggesting the approach addresses a relatively unexplored strategy compared to more populated areas like dynamic benchmarks or statistical detection methods.

The taxonomy reveals neighboring approaches in sibling leaves: 'Dynamic and Continuously Updated Benchmarks' contains five papers focusing on temporal refresh strategies like LiveBench, while 'Decontaminated and Reconstructed Static Benchmarks' includes three papers on filtering contamination post-hoc. The answer-hiding strategy diverges from these by maintaining static benchmark content while controlling information disclosure rather than updating test instances or removing contaminated data. This positions the work at the intersection of benchmark design and information-theoretic contamination prevention, complementing detection-focused branches that attempt to identify contamination after the fact.

Among twenty-four candidates examined across three contributions, no clearly refuting prior work was identified. The core CapBencher method examined ten candidates with zero refutations, the Bayes accuracy ceiling detection mechanism examined four candidates with zero refutations, and theoretical guarantees examined ten candidates with zero refutations. This suggests that within the limited search scope of top-K semantic matches and citation expansion, the specific combination of answer randomization, Bayes accuracy capping, and contamination detection appears not to have direct precedent in the examined literature, though the search scale remains modest relative to the full field.

The analysis reflects a targeted literature search rather than exhaustive coverage, examining approximately half the papers in the fifty-paper taxonomy. The absence of refuting candidates among this sample indicates potential novelty in the answer-hiding paradigm, particularly given the leaf's isolation within the taxonomy structure. However, the limited search scope and the paper's position as the sole occupant of its leaf warrant cautious interpretation—broader examination might reveal related work in adjacent communities or application domains not captured by semantic similarity to this specific approach.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
24
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Detecting data contamination in large language model benchmarks. The field has organized itself around several complementary directions. Contamination Detection Methods encompass techniques for identifying whether test data appeared during pretraining, ranging from membership inference approaches like Detecting Pretraining Data[4] to statistical measures such as perplexity-based detection. Contamination Characterization and Impact Analysis examines how contamination affects model performance and generalization, with works like Generalization or Memorization[30] exploring the boundary between true learning and data leakage. Contamination-Resistant Benchmark Design focuses on creating evaluation protocols that inherently resist contamination through dynamic generation, answer-hiding, or temporal controls, exemplified by LiveBench[26] and LiveCodeBench[15]. Contamination Mitigation and Evaluation Frameworks develop methods to adjust for or remove contamination effects, while Specialized Contamination Contexts address domain-specific challenges in areas like vision-language models or cross-lingual settings, and Benchmark and Evaluation Methodology provides meta-level analysis of evaluation practices themselves. A particularly active tension exists between detection-focused approaches that attempt to identify contamination post-hoc versus design-focused solutions that prevent it proactively. Detection methods face fundamental limitations, as highlighted by Contamination Detection Limitations[9] and Evading Contamination Detection[11], since adversaries can obscure training data or models lack transparency. This has motivated increased interest in contamination-resistant designs. Publishing Without Answers[0] sits squarely within the Answer-Hiding and Randomization Techniques cluster, proposing to withhold ground-truth labels from public benchmarks to prevent direct memorization. This approach contrasts with dynamic benchmark strategies like LiveBench[26] that continuously refresh test instances, and complements detection surveys such as Data Contamination Survey[1] and Benchmark Contamination Survey[3] by shifting emphasis from identifying contamination to architecturally preventing it through information control.

Claimed Contributions

CapBencher: A method to publish benchmarks without disclosing ground-truth answers

The authors introduce CapBencher, a methodology that intentionally reduces the Bayes accuracy of benchmarks by injecting randomness into answers (providing multiple logically correct answers but publishing only one). This allows benchmark creators to release datasets publicly without revealing true answers, while still enabling open evaluation of language models.

10 retrieved papers
Data contamination detection via Bayes accuracy ceiling

The method provides a natural contamination detection mechanism: if a model's accuracy exceeds the deliberately reduced Bayes accuracy (the ceiling), this signals data contamination. The authors propose using a binomial test to statistically determine whether performance above the Bayes accuracy indicates contamination.

4 retrieved papers
Theoretical guarantees relating capped and original scores

The authors establish theoretical results (Theorem 1 and Corollary 1) showing an affine relationship between capped benchmark scores and original scores, and provide an unbiased estimator of original accuracy. This demonstrates that CapBencher preserves the ability to track model improvements despite reducing Bayes accuracy.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

CapBencher: A method to publish benchmarks without disclosing ground-truth answers

The authors introduce CapBencher, a methodology that intentionally reduces the Bayes accuracy of benchmarks by injecting randomness into answers (providing multiple logically correct answers but publishing only one). This allows benchmark creators to release datasets publicly without revealing true answers, while still enabling open evaluation of language models.

Contribution

Data contamination detection via Bayes accuracy ceiling

The method provides a natural contamination detection mechanism: if a model's accuracy exceeds the deliberately reduced Bayes accuracy (the ceiling), this signals data contamination. The authors propose using a binomial test to statistically determine whether performance above the Bayes accuracy indicates contamination.

Contribution

Theoretical guarantees relating capped and original scores

The authors establish theoretical results (Theorem 1 and Corollary 1) showing an affine relationship between capped benchmark scores and original scores, and provide an unbiased estimator of original accuracy. This demonstrates that CapBencher preserves the ability to track model improvements despite reducing Bayes accuracy.