On The Fragility of Benchmark Contamination Detection in Reasoning Models

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.7 Download Report PDF

Benchmark ContaminationLarge Reasoning ModelBenchmark Contamination Detection

Leaderboards for large reasoning models (LRMs) have turned evaluation into a competition, incentivizing developers to optimize directly on benchmark suites. A shortcut to achieving higher rankings is to incorporate evaluation benchmarks into the training data, thereby yielding inflated performance, known as benchmark contamination. Despite that numerous contamination detection approaches have been proposed, surprisingly, our studies find that evading contamination detections for LRMs is alarmingly easy. We focus on the two scenarios where contamination may occur in practice: (I) when the base model evolves into LRM via supervised fine-tuning (SFT) and reinforcement learning (RL), we find that contamination during SFT can be originally identified by contamination detection methods. Yet, even a brief Group Relative Policy Optimization (GRPO) training can markedly \textbf{conceal contamination signals} that most detection methods rely on. Further empirical experiments and theoretical analysis indicate that Proximal Policy Optimization (PPO) style importance sampling and clipping objectives are the root cause of this detection concealment, indicating that \textbf{a broad class of RL methods} may inherently exhibit similar concealment capability; (II) when SFT contamination with CoT is applied to advanced LRMs as the final stage, most contamination detection methods \textbf{perform near random guesses}. Without exposure to non-members, contaminated LRMs would still have more confidence when responding to those unseen samples that share similar distributions to the training set, and thus, evade existing memorization-based detection methods. Together, our findings reveal the unique vulnerability of LRMs evaluations: Model developers could easily contaminate LRMs to achieve inflated leaderboards performance while leaving minimal traces of contamination, thereby strongly undermining the fairness of evaluation and threatening the integrity of public leaderboards. This underscores the urgent need for advanced contamination detection methods and trustworthy evaluation protocols tailored to LRMs.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper investigates how reinforcement learning training can conceal benchmark contamination signals in large reasoning models, focusing on two training stages: supervised fine-tuning and RL optimization. It resides in the 'Evasion Techniques and Vulnerabilities' leaf, which contains only two papers total. This is a notably sparse research direction within the broader taxonomy of 50 papers across 24 leaf nodes, suggesting that adversarial perspectives on contamination detection remain underexplored compared to detection method development or benchmark design.

The taxonomy reveals substantial activity in adjacent areas: the parent category 'Contamination Evasion and Detection Robustness' also includes 'Detection Method Evaluation and Limitations' with two papers examining detection method failures. Meanwhile, sibling branches like 'Contamination Detection Methods' contain 15 papers across multiple detection approaches (black-box statistical methods, performance-based detection, white-box training data analysis). The paper's focus on RL-stage concealment connects to 'Fine-Tuning and RL-Stage Detection' methods but approaches the problem from an adversarial rather than defensive angle, examining how PPO-style objectives enable evasion.

Among 25 candidates examined across three contributions, none were found to clearly refute the paper's claims. The systematic study of contamination across training stages examined 9 candidates with 0 refutations; the RL concealment discovery examined 6 candidates with 0 refutations; and the CoT contamination evasion finding examined 10 candidates with 0 refutations. This suggests that within the limited search scope, the specific focus on RL training as a contamination concealment mechanism and the theoretical analysis of PPO-style importance sampling effects represent relatively unexplored territory, though the search scale precludes definitive conclusions about the broader literature.

The analysis indicates the work addresses a genuine gap in understanding adversarial dynamics between model training and contamination detection, particularly regarding RL optimization phases. However, the limited search scope (25 candidates from semantic search) and the sparse population of the evasion-focused taxonomy leaf mean this assessment reflects top-K semantic matches rather than exhaustive coverage. The novelty appears strongest in mechanistic analysis of how specific RL objectives conceal contamination, though broader claims about detection fragility should be contextualized within the examined candidate set.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Benchmark contamination detection in large reasoning models. As large language models grow in scale and capability, ensuring that their impressive performance reflects genuine reasoning rather than memorization of benchmark data has become a central concern. The field has organized itself into several complementary branches: Contamination Detection Methods develop techniques to identify whether test data appeared during pretraining, ranging from membership inference approaches like Detecting Pretraining Data[1] to statistical methods such as ConStat[7]. Contamination-Resistant Benchmark Design focuses on creating evaluation sets that remain valid over time, exemplified by continuously updated platforms like LiveCodeBench[2] and dynamic benchmarks. Domain-Specific Contamination Studies examine leakage in specialized areas such as medical reasoning or code generation, while Contamination Surveys and Empirical Analyses provide broad perspectives on the scope and severity of the problem across the ecosystem. Meanwhile, Contamination Evasion and Detection Robustness investigates how models might circumvent detection and how robust current methods truly are. A particularly active tension exists between detection methods and their limitations: while many studies propose contamination indicators, works like Contamination Detection Limitations[12] and Evading Contamination Detection[16] reveal that sophisticated training procedures can produce contamination-like signals without actual leakage, or conversely, evade existing detection schemes. Fragility Contamination Detection[0] sits squarely within this robustness-focused branch, examining how fragile current detection approaches are when models employ evasion strategies. Its emphasis on adversarial scenarios contrasts with more straightforward detection proposals like Data Contamination Quiz[3], which assumes cooperative evaluation settings. By exploring vulnerabilities in detection pipelines, Fragility Contamination Detection[0] complements the adversarial perspective of Evading Contamination Detection[16], together highlighting that the contamination problem extends beyond simply identifying overlap to understanding the strategic dynamics between model developers and evaluators.

Claimed Contributions

Systematic study of benchmark contamination in LRMs across two stages

9 retrieved papers

The authors conduct the first comprehensive investigation of benchmark contamination in Large Reasoning Models, examining two distinct stages: Stage I (pre-LRM) when base models evolve into LRMs via SFT and RL, and Stage II (post-LRM) when contamination with CoT is applied to advanced LRMs as a final step.

9 retrieved papers

Discovery that RL training conceals SFT contamination evidence

6 retrieved papers

The authors demonstrate that while SFT contamination is initially detectable, subsequent GRPO training on clean samples conceals contamination evidence. They provide theoretical analysis showing that PPO-style importance sampling and clipping objectives are the root cause of this concealment.

6 retrieved papers

Finding that CoT contamination on advanced LRMs evades existing detection methods

10 retrieved papers

The authors reveal that contaminating advanced LRMs with chain-of-thought reasoning in the final training stage yields inflated performance while leaving minimal detectable evidence, causing existing memorization-based detection methods to perform near random guessing across all benchmarks.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[16] Evading Data Contamination Detection for Language Models is (too) Easy PDF

MÃ¼ller, Mark Niklas, Jasper Dekoninck, Baader, Maximilian, Mark Niklas Muller, Fischer, Marc, Maximilian Baader, Vechev, Martin, Marc Fischer, Martin T. Vechev (2024) • arXiv.org

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Systematic study of benchmark contamination in LRMs across two stages

[22] Detecting Data Contamination from Reinforcement Learning Post-training for Large Language Models PDF

Cannot Refute

[45] DICE: Detecting In-distribution Contamination in LLM's Fine-tuning Phase for Math Reasoning PDF

Cannot Refute

[51] Pretraining on the test set is no longer all you need: A debate-driven approach to qa benchmarks PDF

Cannot Refute

[52] Benchmarking Benchmark Leakage in Large Language Models PDF

Cannot Refute

[53] Self-Explore: Enhancing Mathematical Reasoning in Language Models with Fine-grained Rewards PDF

Cannot Refute

[54] Mathador-LM: A Dynamic Benchmark for Mathematical Reasoning on Large Language Models PDF

Cannot Refute

[55] Addressing data challenges in LLM-enhanced software engineering PDF

Cannot Refute

[56] Rethinking the Reasonability of the Test Set for Simultaneous Machine Translation PDF

Cannot Refute

[57] Commonsense reasoning with rules, cases, and connectionist models: a paradigmatic comparison PDF

Cannot Refute

Contribution

Discovery that RL training conceals SFT contamination evidence

[68] Training Language Models to Self-Correct via Reinforcement Learning PDF

Cannot Refute

[69] Teaching Large Language Models to Reason with Reinforcement Learning PDF

Cannot Refute

[70] Best-of-venom: Attacking rlhf by injecting poisoned preference data PDF

Cannot Refute

[71] Reinforcement Learning with Supervised Alignment PDF

Cannot Refute

[72] The Impact of Post-training on Data Contamination PDF

Cannot Refute

[73] Removing RLHF Protections in GPT-4 via Fine-Tuning PDF

Cannot Refute

Contribution

Finding that CoT contamination on advanced LRMs evades existing detection methods

[58] Can large language models detect errors in long chain-of-thought reasoning? PDF

Cannot Refute

[59] Darkmind: Latent chain-of-thought backdoor in customized llms PDF

Cannot Refute

[60] When chain of thought is necessary, language models struggle to evade monitors PDF

Cannot Refute

[61] Shadowcot: Cognitive hijacking for stealthy reasoning backdoors in llms PDF

Cannot Refute

[62] GUARD: Dual-Agent based Backdoor Defense on Chain-of-Thought in Neural Code Generation PDF

Cannot Refute

[63] LNE-Blocking: An Efficient Framework for Contamination Mitigation Evaluation on Large Language Models PDF

Cannot Refute

[64] BadThink: Triggered Overthinking Attacks on Chain-of-Thought Reasoning in Large Language Models PDF

Cannot Refute

[65] Chain-of-Thought Hijacking PDF

Cannot Refute

[66] Thought Purity: A Defense Framework For Chain-of-Thought Attack PDF

Cannot Refute

[67] Program Verification to Defend Chain-of-Thought Attacks for LLM Services PDF

Cannot Refute

On The Fragility of Benchmark Contamination Detection in Reasoning Models

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[16] Evading Data Contamination Detection for Language Models is (too) Easy PDF

Contribution Analysis

Systematic study of benchmark contamination in LRMs across two stages

[22] Detecting Data Contamination from Reinforcement Learning Post-training for Large Language Models PDF

[45] DICE: Detecting In-distribution Contamination in LLM's Fine-tuning Phase for Math Reasoning PDF

[51] Pretraining on the test set is no longer all you need: A debate-driven approach to qa benchmarks PDF

[52] Benchmarking Benchmark Leakage in Large Language Models PDF

[53] Self-Explore: Enhancing Mathematical Reasoning in Language Models with Fine-grained Rewards PDF

[54] Mathador-LM: A Dynamic Benchmark for Mathematical Reasoning on Large Language Models PDF

[55] Addressing data challenges in LLM-enhanced software engineering PDF

[56] Rethinking the Reasonability of the Test Set for Simultaneous Machine Translation PDF

[57] Commonsense reasoning with rules, cases, and connectionist models: a paradigmatic comparison PDF

Discovery that RL training conceals SFT contamination evidence

[68] Training Language Models to Self-Correct via Reinforcement Learning PDF

[69] Teaching Large Language Models to Reason with Reinforcement Learning PDF

[70] Best-of-venom: Attacking rlhf by injecting poisoned preference data PDF

[71] Reinforcement Learning with Supervised Alignment PDF

[72] The Impact of Post-training on Data Contamination PDF

[73] Removing RLHF Protections in GPT-4 via Fine-Tuning PDF

Finding that CoT contamination on advanced LRMs evades existing detection methods

[58] Can large language models detect errors in long chain-of-thought reasoning? PDF

[59] Darkmind: Latent chain-of-thought backdoor in customized llms PDF

[60] When chain of thought is necessary, language models struggle to evade monitors PDF

[61] Shadowcot: Cognitive hijacking for stealthy reasoning backdoors in llms PDF

[62] GUARD: Dual-Agent based Backdoor Defense on Chain-of-Thought in Neural Code Generation PDF

[63] LNE-Blocking: An Efficient Framework for Contamination Mitigation Evaluation on Large Language Models PDF

[64] BadThink: Triggered Overthinking Attacks on Chain-of-Thought Reasoning in Large Language Models PDF

[65] Chain-of-Thought Hijacking PDF

[66] Thought Purity: A Defense Framework For Chain-of-Thought Attack PDF

[67] Program Verification to Defend Chain-of-Thought Attacks for LLM Services PDF

Table of Contents