On The Fragility of Benchmark Contamination Detection in Reasoning Models

ICLR 2026 Conference SubmissionAnonymous Authors
Benchmark ContaminationLarge Reasoning ModelBenchmark Contamination Detection
Abstract:

Leaderboards for large reasoning models (LRMs) have turned evaluation into a competition, incentivizing developers to optimize directly on benchmark suites. A shortcut to achieving higher rankings is to incorporate evaluation benchmarks into the training data, thereby yielding inflated performance, known as benchmark contamination. Despite that numerous contamination detection approaches have been proposed, surprisingly, our studies find that evading contamination detections for LRMs is alarmingly easy. We focus on the two scenarios where contamination may occur in practice: (I) when the base model evolves into LRM via supervised fine-tuning (SFT) and reinforcement learning (RL), we find that contamination during SFT can be originally identified by contamination detection methods. Yet, even a brief Group Relative Policy Optimization (GRPO) training can markedly \textbf{conceal contamination signals} that most detection methods rely on. Further empirical experiments and theoretical analysis indicate that Proximal Policy Optimization (PPO) style importance sampling and clipping objectives are the root cause of this detection concealment, indicating that \textbf{a broad class of RL methods} may inherently exhibit similar concealment capability; (II) when SFT contamination with CoT is applied to advanced LRMs as the final stage, most contamination detection methods \textbf{perform near random guesses}. Without exposure to non-members, contaminated LRMs would still have more confidence when responding to those unseen samples that share similar distributions to the training set, and thus, evade existing memorization-based detection methods. Together, our findings reveal the unique vulnerability of LRMs evaluations: Model developers could easily contaminate LRMs to achieve inflated leaderboards performance while leaving minimal traces of contamination, thereby strongly undermining the fairness of evaluation and threatening the integrity of public leaderboards. This underscores the urgent need for advanced contamination detection methods and trustworthy evaluation protocols tailored to LRMs.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper investigates how reinforcement learning training can conceal benchmark contamination signals in large reasoning models, focusing on two training stages: supervised fine-tuning and RL optimization. It resides in the 'Evasion Techniques and Vulnerabilities' leaf, which contains only two papers total. This is a notably sparse research direction within the broader taxonomy of 50 papers across 24 leaf nodes, suggesting that adversarial perspectives on contamination detection remain underexplored compared to detection method development or benchmark design.

The taxonomy reveals substantial activity in adjacent areas: the parent category 'Contamination Evasion and Detection Robustness' also includes 'Detection Method Evaluation and Limitations' with two papers examining detection method failures. Meanwhile, sibling branches like 'Contamination Detection Methods' contain 15 papers across multiple detection approaches (black-box statistical methods, performance-based detection, white-box training data analysis). The paper's focus on RL-stage concealment connects to 'Fine-Tuning and RL-Stage Detection' methods but approaches the problem from an adversarial rather than defensive angle, examining how PPO-style objectives enable evasion.

Among 25 candidates examined across three contributions, none were found to clearly refute the paper's claims. The systematic study of contamination across training stages examined 9 candidates with 0 refutations; the RL concealment discovery examined 6 candidates with 0 refutations; and the CoT contamination evasion finding examined 10 candidates with 0 refutations. This suggests that within the limited search scope, the specific focus on RL training as a contamination concealment mechanism and the theoretical analysis of PPO-style importance sampling effects represent relatively unexplored territory, though the search scale precludes definitive conclusions about the broader literature.

The analysis indicates the work addresses a genuine gap in understanding adversarial dynamics between model training and contamination detection, particularly regarding RL optimization phases. However, the limited search scope (25 candidates from semantic search) and the sparse population of the evasion-focused taxonomy leaf mean this assessment reflects top-K semantic matches rather than exhaustive coverage. The novelty appears strongest in mechanistic analysis of how specific RL objectives conceal contamination, though broader claims about detection fragility should be contextualized within the examined candidate set.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
25
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Benchmark contamination detection in large reasoning models. As large language models grow in scale and capability, ensuring that their impressive performance reflects genuine reasoning rather than memorization of benchmark data has become a central concern. The field has organized itself into several complementary branches: Contamination Detection Methods develop techniques to identify whether test data appeared during pretraining, ranging from membership inference approaches like Detecting Pretraining Data[1] to statistical methods such as ConStat[7]. Contamination-Resistant Benchmark Design focuses on creating evaluation sets that remain valid over time, exemplified by continuously updated platforms like LiveCodeBench[2] and dynamic benchmarks. Domain-Specific Contamination Studies examine leakage in specialized areas such as medical reasoning or code generation, while Contamination Surveys and Empirical Analyses provide broad perspectives on the scope and severity of the problem across the ecosystem. Meanwhile, Contamination Evasion and Detection Robustness investigates how models might circumvent detection and how robust current methods truly are. A particularly active tension exists between detection methods and their limitations: while many studies propose contamination indicators, works like Contamination Detection Limitations[12] and Evading Contamination Detection[16] reveal that sophisticated training procedures can produce contamination-like signals without actual leakage, or conversely, evade existing detection schemes. Fragility Contamination Detection[0] sits squarely within this robustness-focused branch, examining how fragile current detection approaches are when models employ evasion strategies. Its emphasis on adversarial scenarios contrasts with more straightforward detection proposals like Data Contamination Quiz[3], which assumes cooperative evaluation settings. By exploring vulnerabilities in detection pipelines, Fragility Contamination Detection[0] complements the adversarial perspective of Evading Contamination Detection[16], together highlighting that the contamination problem extends beyond simply identifying overlap to understanding the strategic dynamics between model developers and evaluators.

Claimed Contributions

Systematic study of benchmark contamination in LRMs across two stages

The authors conduct the first comprehensive investigation of benchmark contamination in Large Reasoning Models, examining two distinct stages: Stage I (pre-LRM) when base models evolve into LRMs via SFT and RL, and Stage II (post-LRM) when contamination with CoT is applied to advanced LRMs as a final step.

9 retrieved papers
Discovery that RL training conceals SFT contamination evidence

The authors demonstrate that while SFT contamination is initially detectable, subsequent GRPO training on clean samples conceals contamination evidence. They provide theoretical analysis showing that PPO-style importance sampling and clipping objectives are the root cause of this concealment.

6 retrieved papers
Finding that CoT contamination on advanced LRMs evades existing detection methods

The authors reveal that contaminating advanced LRMs with chain-of-thought reasoning in the final training stage yields inflated performance while leaving minimal detectable evidence, causing existing memorization-based detection methods to perform near random guessing across all benchmarks.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Systematic study of benchmark contamination in LRMs across two stages

The authors conduct the first comprehensive investigation of benchmark contamination in Large Reasoning Models, examining two distinct stages: Stage I (pre-LRM) when base models evolve into LRMs via SFT and RL, and Stage II (post-LRM) when contamination with CoT is applied to advanced LRMs as a final step.

Contribution

Discovery that RL training conceals SFT contamination evidence

The authors demonstrate that while SFT contamination is initially detectable, subsequent GRPO training on clean samples conceals contamination evidence. They provide theoretical analysis showing that PPO-style importance sampling and clipping objectives are the root cause of this concealment.

Contribution

Finding that CoT contamination on advanced LRMs evades existing detection methods

The authors reveal that contaminating advanced LRMs with chain-of-thought reasoning in the final training stage yields inflated performance while leaving minimal detectable evidence, causing existing memorization-based detection methods to perform near random guessing across all benchmarks.