How NOT to benchmark your SITE metric: Beyond Static Leaderboards and Towards Realistic Evaluation.

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

BenchmarkAnanlysisTransferability

Transferability estimation metrics are used to find a high-performing pre-trained model for a given target task without fine-tuning models and without access to the source dataset. Despite the growing interest in developing such metrics, the benchmarks used to measure their progress have gone largely unexamined. In this work, we empirically show the shortcomings of widely used benchmark setups to evaluate transferability estimation metrics. We argue that the benchmarks on which these metrics are evaluated are fundamentally flawed. We empirically demonstrate that their unrealistic model spaces and static performance hierarchies artificially inflate the perceived performance of existing metrics, to the point where simple, dataset-agnostic heuristics can outperform sophisticated methods. Our analysis reveals a critical disconnect between current evaluation protocols and the complexities of real-world model selection. To address this, we provide concrete recommendations for constructing more robust and realistic benchmarks to guide future research in a more meaningful direction.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

This paper critiques the evaluation protocols used to assess transferability estimation metrics, which predict pre-trained model performance on new tasks without fine-tuning. It resides in the 'Benchmark Robustness and Realism' leaf within the 'Benchmark Design and Evaluation Methodology' branch, alongside only two sibling papers examining metric stability and fairness. This represents a relatively sparse research direction within the broader taxonomy of 50 papers, suggesting that critical examination of benchmark validity has received limited attention compared to the development of new transferability metrics themselves.

The taxonomy reveals a field heavily weighted toward metric design (six subtopic clusters under 'Transferability Metric Design and Methodology') and domain-specific applications (four subtopic clusters), while benchmark evaluation occupies just two leaf nodes. The paper's leaf sits adjacent to 'Standardized Evaluation Frameworks,' which proposes comprehensive comparison setups rather than questioning their fundamental assumptions. Neighboring branches focus on empirical transfer learning analysis and layer-wise transferability, addressing what transfers rather than how we measure the measurement process itself. This structural imbalance highlights the paper's positioning in an underexplored but methodologically critical area.

Among 27 candidates examined across three contributions, none were found to clearly refute the paper's claims. The first contribution (demonstrating misleading metric performance) examined 10 candidates with zero refutations, suggesting limited prior work directly challenging benchmark validity in this manner. The second contribution (simple heuristic outperforming sophisticated methods) examined 7 candidates, again with no refutations, indicating novelty in exposing this performance gap. The third contribution (best practices checklist) examined 10 candidates without refutation, though this may reflect the prescriptive nature of recommendations rather than empirical claims. The absence of refutations across all contributions, given the modest search scope, suggests the critique occupies relatively unexplored territory.

Based on examination of 27 semantically related candidates, the work appears to address a gap in how the community validates its own evaluation tools. The limited search scope means potentially relevant work in broader meta-science or benchmark design literature may exist outside this sample. However, within the transferability estimation domain as captured by this taxonomy, the systematic critique of benchmark realism and the demonstration that simple heuristics can exploit structural flaws represent contributions with minimal documented prior overlap.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: evaluating transferability estimation metrics for pre-trained model selection. The field has organized itself around several complementary perspectives. Transferability Metric Design and Methodology focuses on developing computational measures that predict how well a pre-trained model will perform on a new task, with works like LogME[32] and Leep[24] proposing efficient scoring functions. Domain and Task-Specific Transferability Applications explore how these metrics perform in specialized contexts such as medical imaging, object detection, and speech processing, exemplified by studies like Surgical Phase Transferability[1] and Efficient Detector Selection[2]. Benchmark Design and Evaluation Methodology addresses the critical question of how to rigorously assess these metrics themselves, ensuring they generalize beyond narrow experimental settings. Transfer Learning Mechanisms and Model Properties investigates the underlying factors that enable or hinder transfer, while Transfer Learning Enhancement Techniques develops methods to improve transferability through architectural or training modifications. Cross-Domain Evaluation and Ecological Validity examines whether findings hold across diverse real-world scenarios. A particularly active tension exists between developing increasingly sophisticated metrics and ensuring their evaluation reflects practical constraints. Works like Active Transferability[3] and Easy Transferability[8] propose novel scoring approaches, yet questions remain about whether existing benchmarks capture the variability of real deployment conditions. Realistic Evaluation[0] sits squarely within the Benchmark Robustness and Realism cluster, emphasizing that evaluation protocols must account for factors like dataset shift and computational budgets that practitioners face. This contrasts with neighboring efforts such as Stability Evaluation[18] and Fair Evaluation Framework[45], which focus on consistency and fairness of metric comparisons. The original work's emphasis on ecological validity addresses a gap between controlled experiments and the messy realities of model selection, pushing the community toward more representative benchmarking practices that better predict real-world utility.

Claimed Contributions

Empirical demonstration of misleading SITE metric performance in current benchmarks

10 retrieved papers

The authors empirically show that widely used benchmarks for evaluating transferability estimation metrics have fundamental flaws, including unrealistic model spaces and static performance hierarchies that artificially inflate the perceived performance of existing metrics.

10 retrieved papers

Simple static ranking heuristic that outperforms sophisticated SITE metrics

7 retrieved papers

The authors introduce a dataset-agnostic static ranking heuristic that achieves higher weighted Kendall's tau than sophisticated SITE methods on the standard benchmark, revealing that the benchmark rewards memorization of a fixed model hierarchy rather than true task-specific transferability estimation.

7 retrieved papers

Best practices and checklist for constructing robust SITE benchmarks

10 retrieved papers

The authors provide concrete recommendations for building better benchmarks, including guidelines for diverse model spaces, challenging datasets with performance headroom, and engineering for rank dispersion. They also provide a SITE benchmarking and evaluation checklist and a benchmark based on these best practices for vision classification.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[18] How stable are transferability metrics evaluations? PDF

Andrea Agostinelli, Michal PÃ¡ndy, Jasper Uijlings, Thomas Mensink, Vittorio Ferrari (2022)

[45] Benchmarking Transferability: A Framework for Fair and Robust Evaluation PDF

Kazemi Alireza, Baktashmotlagh, Mahsa (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Empirical demonstration of misleading SITE metric performance in current benchmarks

[10] On robustness and transferability of convolutional neural networks PDF

Cannot Refute

[27] Newer is not always better: Rethinking transferability metrics, their peculiarities, stability and performance PDF

Cannot Refute

[51] NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark PDF

Cannot Refute

[52] The transferability limits of static benchmarks PDF

Cannot Refute

[53] The performance of transferability metrics does not translate to medical tasks PDF

Cannot Refute

[54] Quantifying and improving transferability in domain generalization PDF

Cannot Refute

[55] Online Clustering: Algorithms, Evaluation, Metrics, Applications and Benchmarking PDF

Cannot Refute

[56] Effects of Soft-Domain Transfer and Named Entity Information on Deception Detection PDF

Cannot Refute

[57] CHARTOM: A Visual Theory-of-Mind Benchmark for LLMs on Misleading Charts PDF

Cannot Refute

[58] An overview of control performance assessment technology and industrial applications PDF

Cannot Refute

Contribution

Simple static ranking heuristic that outperforms sophisticated SITE metrics

[59] Designing an adaptive production control system using reinforcement learning PDF

Cannot Refute

[60] What is a resistance gene? Ranking risk in resistomes PDF

Cannot Refute

[61] Architectural transformation in large language models through contextual gradient pruning PDF

Cannot Refute

[62] Comprehensive Evaluation of End-Point Free Energy Techniques in Carboxylated-Pillar[6]arene HostâGuest Binding: III. Force-Field Comparison, Three-Trajectory Realization and Further Dielectric Augmentation PDF

Cannot Refute

[63] Transfer Entropy on Rank Vectors PDF

Cannot Refute

[64] Benchmarking Transfer Entropy Methods for the Study of Linear and Nonlinear Cardio-Respiratory Interactions PDF

Cannot Refute

[65] Transfer Entropy Estimation and Directional Coupling Change Detection in Biomedical Time Series PDF

Cannot Refute

Contribution

Best practices and checklist for constructing robust SITE benchmarks

[66] LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning PDF

Cannot Refute

[67] V-PETL Bench: A Unified Visual Parameter-Efficient Transfer Learning Benchmark PDF

Cannot Refute

[68] Sparse autoencoder features for classifications and transferability PDF

Cannot Refute

[69] Evaluating protein transfer learning with TAPE PDF

Cannot Refute

[70] Outstanding challenges in the transferability of ecological models PDF

Cannot Refute

[71] NineRec: A Benchmark Dataset Suite for Evaluating Transferable Recommendation PDF

Cannot Refute

[72] Transfer learning in biomedical natural language processing: an evaluation of BERT and ELMo on ten benchmarking datasets PDF

Cannot Refute

[73] Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination PDF

Cannot Refute

[74] VLUE: a new benchmark and multi-task knowledge transfer learning for Vietnamese natural language understanding PDF

Cannot Refute

[75] PETA: evaluating the impact of protein transfer learning with sub-word tokenization on downstream applications PDF

Cannot Refute

How NOT to benchmark your SITE metric: Beyond Static Leaderboards and Towards Realistic Evaluation.

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[18] How stable are transferability metrics evaluations? PDF

[45] Benchmarking Transferability: A Framework for Fair and Robust Evaluation PDF

Contribution Analysis

Empirical demonstration of misleading SITE metric performance in current benchmarks

[10] On robustness and transferability of convolutional neural networks PDF

[27] Newer is not always better: Rethinking transferability metrics, their peculiarities, stability and performance PDF

[51] NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark PDF

[52] The transferability limits of static benchmarks PDF

[53] The performance of transferability metrics does not translate to medical tasks PDF

[54] Quantifying and improving transferability in domain generalization PDF

[55] Online Clustering: Algorithms, Evaluation, Metrics, Applications and Benchmarking PDF

[56] Effects of Soft-Domain Transfer and Named Entity Information on Deception Detection PDF

[57] CHARTOM: A Visual Theory-of-Mind Benchmark for LLMs on Misleading Charts PDF

[58] An overview of control performance assessment technology and industrial applications PDF

Simple static ranking heuristic that outperforms sophisticated SITE metrics

[59] Designing an adaptive production control system using reinforcement learning PDF

[60] What is a resistance gene? Ranking risk in resistomes PDF

[61] Architectural transformation in large language models through contextual gradient pruning PDF

[62] Comprehensive Evaluation of End-Point Free Energy Techniques in Carboxylated-Pillar[6]arene HostâGuest Binding: III. Force-Field Comparison, Three-Trajectory Realization and Further Dielectric Augmentation PDF

[63] Transfer Entropy on Rank Vectors PDF

[64] Benchmarking Transfer Entropy Methods for the Study of Linear and Nonlinear Cardio-Respiratory Interactions PDF

[65] Transfer Entropy Estimation and Directional Coupling Change Detection in Biomedical Time Series PDF

Best practices and checklist for constructing robust SITE benchmarks

[66] LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning PDF

[67] V-PETL Bench: A Unified Visual Parameter-Efficient Transfer Learning Benchmark PDF

[68] Sparse autoencoder features for classifications and transferability PDF

[69] Evaluating protein transfer learning with TAPE PDF

[70] Outstanding challenges in the transferability of ecological models PDF

[71] NineRec: A Benchmark Dataset Suite for Evaluating Transferable Recommendation PDF

[72] Transfer learning in biomedical natural language processing: an evaluation of BERT and ELMo on ten benchmarking datasets PDF

[73] Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination PDF

[74] VLUE: a new benchmark and multi-task knowledge transfer learning for Vietnamese natural language understanding PDF

[75] PETA: evaluating the impact of protein transfer learning with sub-word tokenization on downstream applications PDF

Table of Contents

[62] Comprehensive Evaluation of End-Point Free Energy Techniques in Carboxylated-Pillar[6]arene HostâGuest Binding: III. Force-Field Comparison, Three-Trajectory Realization and Further Dielectric Augmentation PDF