On The Fragility of Benchmark Contamination Detection in Reasoning Models
Overview
Overall Novelty Assessment
The paper investigates how reinforcement learning training can conceal benchmark contamination signals in large reasoning models, focusing on two training stages: supervised fine-tuning and RL optimization. It resides in the 'Evasion Techniques and Vulnerabilities' leaf, which contains only two papers total. This is a notably sparse research direction within the broader taxonomy of 50 papers across 24 leaf nodes, suggesting that adversarial perspectives on contamination detection remain underexplored compared to detection method development or benchmark design.
The taxonomy reveals substantial activity in adjacent areas: the parent category 'Contamination Evasion and Detection Robustness' also includes 'Detection Method Evaluation and Limitations' with two papers examining detection method failures. Meanwhile, sibling branches like 'Contamination Detection Methods' contain 15 papers across multiple detection approaches (black-box statistical methods, performance-based detection, white-box training data analysis). The paper's focus on RL-stage concealment connects to 'Fine-Tuning and RL-Stage Detection' methods but approaches the problem from an adversarial rather than defensive angle, examining how PPO-style objectives enable evasion.
Among 25 candidates examined across three contributions, none were found to clearly refute the paper's claims. The systematic study of contamination across training stages examined 9 candidates with 0 refutations; the RL concealment discovery examined 6 candidates with 0 refutations; and the CoT contamination evasion finding examined 10 candidates with 0 refutations. This suggests that within the limited search scope, the specific focus on RL training as a contamination concealment mechanism and the theoretical analysis of PPO-style importance sampling effects represent relatively unexplored territory, though the search scale precludes definitive conclusions about the broader literature.
The analysis indicates the work addresses a genuine gap in understanding adversarial dynamics between model training and contamination detection, particularly regarding RL optimization phases. However, the limited search scope (25 candidates from semantic search) and the sparse population of the evasion-focused taxonomy leaf mean this assessment reflects top-K semantic matches rather than exhaustive coverage. The novelty appears strongest in mechanistic analysis of how specific RL objectives conceal contamination, though broader claims about detection fragility should be contextualized within the examined candidate set.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors conduct the first comprehensive investigation of benchmark contamination in Large Reasoning Models, examining two distinct stages: Stage I (pre-LRM) when base models evolve into LRMs via SFT and RL, and Stage II (post-LRM) when contamination with CoT is applied to advanced LRMs as a final step.
The authors demonstrate that while SFT contamination is initially detectable, subsequent GRPO training on clean samples conceals contamination evidence. They provide theoretical analysis showing that PPO-style importance sampling and clipping objectives are the root cause of this concealment.
The authors reveal that contaminating advanced LRMs with chain-of-thought reasoning in the final training stage yields inflated performance while leaving minimal detectable evidence, causing existing memorization-based detection methods to perform near random guessing across all benchmarks.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[16] Evading Data Contamination Detection for Language Models is (too) Easy PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Systematic study of benchmark contamination in LRMs across two stages
The authors conduct the first comprehensive investigation of benchmark contamination in Large Reasoning Models, examining two distinct stages: Stage I (pre-LRM) when base models evolve into LRMs via SFT and RL, and Stage II (post-LRM) when contamination with CoT is applied to advanced LRMs as a final step.
[22] Detecting Data Contamination from Reinforcement Learning Post-training for Large Language Models PDF
[45] DICE: Detecting In-distribution Contamination in LLM's Fine-tuning Phase for Math Reasoning PDF
[51] Pretraining on the test set is no longer all you need: A debate-driven approach to qa benchmarks PDF
[52] Benchmarking Benchmark Leakage in Large Language Models PDF
[53] Self-Explore: Enhancing Mathematical Reasoning in Language Models with Fine-grained Rewards PDF
[54] Mathador-LM: A Dynamic Benchmark for Mathematical Reasoning on Large Language Models PDF
[55] Addressing data challenges in LLM-enhanced software engineering PDF
[56] Rethinking the Reasonability of the Test Set for Simultaneous Machine Translation PDF
[57] Commonsense reasoning with rules, cases, and connectionist models: a paradigmatic comparison PDF
Discovery that RL training conceals SFT contamination evidence
The authors demonstrate that while SFT contamination is initially detectable, subsequent GRPO training on clean samples conceals contamination evidence. They provide theoretical analysis showing that PPO-style importance sampling and clipping objectives are the root cause of this concealment.
[68] Training Language Models to Self-Correct via Reinforcement Learning PDF
[69] Teaching Large Language Models to Reason with Reinforcement Learning PDF
[70] Best-of-venom: Attacking rlhf by injecting poisoned preference data PDF
[71] Reinforcement Learning with Supervised Alignment PDF
[72] The Impact of Post-training on Data Contamination PDF
[73] Removing RLHF Protections in GPT-4 via Fine-Tuning PDF
Finding that CoT contamination on advanced LRMs evades existing detection methods
The authors reveal that contaminating advanced LRMs with chain-of-thought reasoning in the final training stage yields inflated performance while leaving minimal detectable evidence, causing existing memorization-based detection methods to perform near random guessing across all benchmarks.