Mapping Overlaps in Benchmarks through Perplexity in the Wild

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

meta-evaluationbenchmark overlapslanguage models

We construct benchmark signatures that capture the capacity required for strong performance to characterize large language model (LLM) benchmarks and their meaningful overlaps. Formally, we define them as sets of salient tokens drawn from in-the-wild corpora whose LLM token perplexity, reflecting training exposure, is highly predictive of benchmark performance. We extract benchmark signatures via stepwise forward selection with linear regression in a large-scale meta-evaluation across 32 LLMs and 89 benchmarks spanning knowledge, coding, logic, instruction following, math, language, reasoning, missing-information detection, and cultural/world modeling. We then analyze how these signatures relate to both the semantic similarity of benchmark questions and the correlation structure of model performance. Performance-level overlaps remain universally high and semantic overlaps stay in a narrow mid-range, but signatures distinguish between benchmarks and illuminate nuanced differences in their capacity demands. For instance, signatures uniquely reveal substantial overlap among knowledge and reasoning benchmarks, whereas humanity- and culture-oriented benchmarks show relatively low similarity, lower even than typical cross-category overlap. Notably, performance-level results are strongly shaped by benchmark-orthogonal factors such as question format, whereas benchmark signatures remain robust to such confounds. We further reveal cross-functional overlaps among logic, math, language, instruction following, and cultural/world modeling, with coding emerging as the least overlapping domain, interacting only moderately with the ability of detecting missing information. Qualitative inspection of signatures shows that only the knowledge signature is aligned with actual knowledge, suggesting that LLMs may exhibit a distinctive semantic organization that differs from that of humans. Together, these findings offer insights into benchmark validity, LLM sensitivities, and the broad landscape of interconnected LLM capacities.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces a framework for characterizing benchmark overlap using 'benchmark signatures'—sets of salient tokens from in-the-wild corpora whose perplexity patterns predict model performance. It sits in the 'Meta-Evaluation of Benchmark Signatures' leaf under 'Benchmark Overlap Characterization via Perplexity Signatures'. This leaf contains only the original paper itself, indicating a sparse research direction. The broader parent category ('Benchmark Overlap Characterization') is also minimally populated, suggesting this approach to using perplexity signatures for overlap analysis represents a relatively unexplored angle within the field.

The taxonomy reveals that most related work clusters in 'Perplexity-Based Contamination Detection Methods', which focuses on identifying training data leakage rather than characterizing benchmark capacity demands. The 'Token-Level Perplexity Analysis for Contamination' leaf contains three papers examining memorization detection, while 'Benchmark Evaluation Frameworks and Decontamination' addresses broader evaluation protocols. The original paper diverges by treating perplexity patterns as diagnostic fingerprints of benchmark overlap rather than contamination signals, positioning it at the intersection of contamination detection methodology and meta-evaluation of benchmark properties. This creates conceptual distance from sibling branches despite shared technical foundations.

Among 20 candidates examined across three contributions, no clearly refuting prior work was identified. The 'Benchmark signatures framework' contribution examined 8 candidates with 0 refutable matches, suggesting limited direct precedent for this specific formulation. The 'Forward selection and regression pipeline' examined only 2 candidates, reflecting a narrower technical scope. The 'Discovery of unexpected cross-functional overlaps' examined 10 candidates with no refutations, indicating that the empirical findings about knowledge-reasoning overlap and culture-oriented benchmark distinctiveness may represent novel observations within this limited search scope. The absence of refutations across all contributions suggests the approach occupies a relatively uncontested niche.

Based on the limited search of 20 candidates, the work appears to introduce a distinctive methodological angle—using perplexity signatures for capacity characterization rather than contamination detection—in a sparsely populated research direction. The taxonomy structure confirms minimal direct competition in this specific framing, though the broader contamination detection literature provides relevant technical context. The analysis cannot rule out related work outside the top-20 semantic matches or in adjacent evaluation methodology domains not captured by the taxonomy.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: characterizing benchmark overlap through token-level perplexity patterns. The field addresses a critical challenge in evaluating large language models: determining whether training data has contaminated evaluation benchmarks, thereby inflating performance metrics. The taxonomy organizes work into several main branches. Perplexity-Based Contamination Detection Methods focus on algorithmic approaches that use perplexity signals to identify potential data leakage, with foundational work like Contamination via Perplexity[8] establishing early techniques. Benchmark Overlap Characterization via Perplexity Signatures examines how perplexity patterns themselves can serve as diagnostic fingerprints of overlap, including meta-evaluation efforts that assess the reliability of these signatures. Benchmark Evaluation Frameworks and Decontamination encompasses broader methodologies for creating clean test sets and validating benchmark integrity, exemplified by efforts like Paloma Benchmark[3]. Finally, Perplexity Applications in Specialized Domains explores how perplexity-based analysis extends beyond contamination detection to domain-specific evaluation challenges. A particularly active line of investigation centers on developing robust contamination metrics and detection protocols. Works like Data Contamination Investigation[1] and Contamination Trustworthy Evaluation[2] explore the reliability and limitations of various detection strategies, while Contamination Metrics Review[6] synthesizes emerging best practices. The original paper, Perplexity Benchmark Overlaps[0], sits within the meta-evaluation cluster, focusing specifically on how perplexity signatures themselves can be characterized and validated as indicators of benchmark overlap. This positions it closely alongside work examining the trustworthiness of contamination signals, such as Contamination Trustworthy Evaluation[2], but with a distinctive emphasis on the diagnostic properties of token-level perplexity patterns rather than broader evaluation frameworks. The central tension across these branches involves balancing detection sensitivity against false positives, and understanding when perplexity anomalies genuinely indicate memorization versus other statistical artifacts.

Claimed Contributions

Benchmark signatures framework for measuring benchmark overlap

8 retrieved papers

The authors propose a novel three-level framework for quantifying overlap among LLM benchmarks. This framework examines benchmarks at the semantic level (question content similarity), performance level (correlated model outcomes), and signature level (perplexity patterns on in-the-wild corpora), providing a more comprehensive characterization than existing approaches.

8 retrieved papers

Forward selection and regression pipeline for extracting benchmark signatures

2 retrieved papers

The authors develop a computational method that combines correlation-based screening with forward selection regression to identify salient tokens from large-scale corpora. These tokens form benchmark signatures whose perplexity patterns across models are highly predictive of benchmark performance, enabling systematic characterization of what capacities each benchmark actually measures.

2 retrieved papers

Discovery of unexpected cross-functional benchmark overlaps

10 retrieved papers

Through their signature-based analysis, the authors reveal that many benchmarks claiming to test specific abilities (like logic) actually measure different or overlapping capabilities (like instruction-following) in practice. This finding exposes potential misalignments between benchmark design intentions and what they actually evaluate, highlighting issues in current benchmark validity and the interconnected nature of LLM capabilities.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Benchmark signatures framework for measuring benchmark overlap

[1] Investigating data contamination in modern benchmarks for large language models PDF

Cannot Refute

[8] Estimating Contamination via Perplexity: Quantifying Memorisation in Language Model Evaluation PDF

Cannot Refute

[20] Measuring self-deceptive consistency boundaries in large language models through spurious semantic closure networks PDF

Cannot Refute

[21] Studying the Role of Input-Neighbor Overlap in Retrieval-Augmented Language Models Training Efficiency PDF

Cannot Refute

[22] Understanding RAG Systems Performance by Profiling Key Factors PDF

Cannot Refute

[23] Benchmarking Benchmark Leakage in Large Language Models PDF

Cannot Refute

[24] CODEMORPH: Mitigating Data Leakage in Large Language Model Assessment PDF

Cannot Refute

[25] StyleCloak: Anonymous Source Coding for Textual Attribute Obfuscation with Semantic Fidelity PDF

Cannot Refute

Contribution

Forward selection and regression pipeline for extracting benchmark signatures

[26] Natural Language Processing Tools for Reading Level Assessment and Text Simplication for Bilingual Education PDF

Cannot Refute

[27] Predicting how it sounds: re-ranking dialogue prompts based on TTS quality for adaptive spoken dialogue systems. PDF

Cannot Refute

Contribution

Discovery of unexpected cross-functional benchmark overlaps

[10] Infobench: Evaluating instruction following ability in large language models PDF

Cannot Refute

[11] Pandalm: An automatic evaluation benchmark for llm instruction tuning optimization PDF

Cannot Refute

[12] Magicbrush: A manually annotated dataset for instruction-guided image editing PDF

Cannot Refute

[13] IFIR: A comprehensive benchmark for evaluating instruction-following in expert-domain information retrieval PDF

Cannot Refute

[14] Evaluating large language models at evaluating instruction following PDF

Cannot Refute

[15] Hq-edit: A high-quality dataset for instruction-based image editing PDF

Cannot Refute

[16] Craftext benchmark: Advancing instruction following in complex multimodal open-ended world PDF

Cannot Refute

[17] Visit-bench: A benchmark for vision-language instruction following inspired by real-world use PDF

Cannot Refute

[18] Ivebench: Modern benchmark suite for instruction-guided video editing assessment PDF

Cannot Refute

[19] When thinking fails: The pitfalls of reasoning for instruction-following in llms PDF

Cannot Refute

Mapping Overlaps in Benchmarks through Perplexity in the Wild

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Benchmark signatures framework for measuring benchmark overlap

[1] Investigating data contamination in modern benchmarks for large language models PDF

[8] Estimating Contamination via Perplexity: Quantifying Memorisation in Language Model Evaluation PDF

[20] Measuring self-deceptive consistency boundaries in large language models through spurious semantic closure networks PDF

[21] Studying the Role of Input-Neighbor Overlap in Retrieval-Augmented Language Models Training Efficiency PDF

[22] Understanding RAG Systems Performance by Profiling Key Factors PDF

[23] Benchmarking Benchmark Leakage in Large Language Models PDF

[24] CODEMORPH: Mitigating Data Leakage in Large Language Model Assessment PDF

[25] StyleCloak: Anonymous Source Coding for Textual Attribute Obfuscation with Semantic Fidelity PDF

Forward selection and regression pipeline for extracting benchmark signatures

[26] Natural Language Processing Tools for Reading Level Assessment and Text Simplication for Bilingual Education PDF

[27] Predicting how it sounds: re-ranking dialogue prompts based on TTS quality for adaptive spoken dialogue systems. PDF

Discovery of unexpected cross-functional benchmark overlaps

[10] Infobench: Evaluating instruction following ability in large language models PDF

[11] Pandalm: An automatic evaluation benchmark for llm instruction tuning optimization PDF

[12] Magicbrush: A manually annotated dataset for instruction-guided image editing PDF

[13] IFIR: A comprehensive benchmark for evaluating instruction-following in expert-domain information retrieval PDF

[14] Evaluating large language models at evaluating instruction following PDF

[15] Hq-edit: A high-quality dataset for instruction-based image editing PDF

[16] Craftext benchmark: Advancing instruction following in complex multimodal open-ended world PDF

[17] Visit-bench: A benchmark for vision-language instruction following inspired by real-world use PDF

[18] Ivebench: Modern benchmark suite for instruction-guided video editing assessment PDF

[19] When thinking fails: The pitfalls of reasoning for instruction-following in llms PDF

Table of Contents