Abstract:

Transferability estimation metrics are used to find a high-performing pre-trained model for a given target task without fine-tuning models and without access to the source dataset. Despite the growing interest in developing such metrics, the benchmarks used to measure their progress have gone largely unexamined. In this work, we empirically show the shortcomings of widely used benchmark setups to evaluate transferability estimation metrics. We argue that the benchmarks on which these metrics are evaluated are fundamentally flawed. We empirically demonstrate that their unrealistic model spaces and static performance hierarchies artificially inflate the perceived performance of existing metrics, to the point where simple, dataset-agnostic heuristics can outperform sophisticated methods. Our analysis reveals a critical disconnect between current evaluation protocols and the complexities of real-world model selection. To address this, we provide concrete recommendations for constructing more robust and realistic benchmarks to guide future research in a more meaningful direction.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

This paper critiques the evaluation protocols used to assess transferability estimation metrics, which predict pre-trained model performance on new tasks without fine-tuning. It resides in the 'Benchmark Robustness and Realism' leaf within the 'Benchmark Design and Evaluation Methodology' branch, alongside only two sibling papers examining metric stability and fairness. This represents a relatively sparse research direction within the broader taxonomy of 50 papers, suggesting that critical examination of benchmark validity has received limited attention compared to the development of new transferability metrics themselves.

The taxonomy reveals a field heavily weighted toward metric design (six subtopic clusters under 'Transferability Metric Design and Methodology') and domain-specific applications (four subtopic clusters), while benchmark evaluation occupies just two leaf nodes. The paper's leaf sits adjacent to 'Standardized Evaluation Frameworks,' which proposes comprehensive comparison setups rather than questioning their fundamental assumptions. Neighboring branches focus on empirical transfer learning analysis and layer-wise transferability, addressing what transfers rather than how we measure the measurement process itself. This structural imbalance highlights the paper's positioning in an underexplored but methodologically critical area.

Among 27 candidates examined across three contributions, none were found to clearly refute the paper's claims. The first contribution (demonstrating misleading metric performance) examined 10 candidates with zero refutations, suggesting limited prior work directly challenging benchmark validity in this manner. The second contribution (simple heuristic outperforming sophisticated methods) examined 7 candidates, again with no refutations, indicating novelty in exposing this performance gap. The third contribution (best practices checklist) examined 10 candidates without refutation, though this may reflect the prescriptive nature of recommendations rather than empirical claims. The absence of refutations across all contributions, given the modest search scope, suggests the critique occupies relatively unexplored territory.

Based on examination of 27 semantically related candidates, the work appears to address a gap in how the community validates its own evaluation tools. The limited search scope means potentially relevant work in broader meta-science or benchmark design literature may exist outside this sample. However, within the transferability estimation domain as captured by this taxonomy, the systematic critique of benchmark realism and the demonstration that simple heuristics can exploit structural flaws represent contributions with minimal documented prior overlap.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
27
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: evaluating transferability estimation metrics for pre-trained model selection. The field has organized itself around several complementary perspectives. Transferability Metric Design and Methodology focuses on developing computational measures that predict how well a pre-trained model will perform on a new task, with works like LogME[32] and Leep[24] proposing efficient scoring functions. Domain and Task-Specific Transferability Applications explore how these metrics perform in specialized contexts such as medical imaging, object detection, and speech processing, exemplified by studies like Surgical Phase Transferability[1] and Efficient Detector Selection[2]. Benchmark Design and Evaluation Methodology addresses the critical question of how to rigorously assess these metrics themselves, ensuring they generalize beyond narrow experimental settings. Transfer Learning Mechanisms and Model Properties investigates the underlying factors that enable or hinder transfer, while Transfer Learning Enhancement Techniques develops methods to improve transferability through architectural or training modifications. Cross-Domain Evaluation and Ecological Validity examines whether findings hold across diverse real-world scenarios. A particularly active tension exists between developing increasingly sophisticated metrics and ensuring their evaluation reflects practical constraints. Works like Active Transferability[3] and Easy Transferability[8] propose novel scoring approaches, yet questions remain about whether existing benchmarks capture the variability of real deployment conditions. Realistic Evaluation[0] sits squarely within the Benchmark Robustness and Realism cluster, emphasizing that evaluation protocols must account for factors like dataset shift and computational budgets that practitioners face. This contrasts with neighboring efforts such as Stability Evaluation[18] and Fair Evaluation Framework[45], which focus on consistency and fairness of metric comparisons. The original work's emphasis on ecological validity addresses a gap between controlled experiments and the messy realities of model selection, pushing the community toward more representative benchmarking practices that better predict real-world utility.

Claimed Contributions

Empirical demonstration of misleading SITE metric performance in current benchmarks

The authors empirically show that widely used benchmarks for evaluating transferability estimation metrics have fundamental flaws, including unrealistic model spaces and static performance hierarchies that artificially inflate the perceived performance of existing metrics.

10 retrieved papers
Simple static ranking heuristic that outperforms sophisticated SITE metrics

The authors introduce a dataset-agnostic static ranking heuristic that achieves higher weighted Kendall's tau than sophisticated SITE methods on the standard benchmark, revealing that the benchmark rewards memorization of a fixed model hierarchy rather than true task-specific transferability estimation.

7 retrieved papers
Best practices and checklist for constructing robust SITE benchmarks

The authors provide concrete recommendations for building better benchmarks, including guidelines for diverse model spaces, challenging datasets with performance headroom, and engineering for rank dispersion. They also provide a SITE benchmarking and evaluation checklist and a benchmark based on these best practices for vision classification.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Empirical demonstration of misleading SITE metric performance in current benchmarks

The authors empirically show that widely used benchmarks for evaluating transferability estimation metrics have fundamental flaws, including unrealistic model spaces and static performance hierarchies that artificially inflate the perceived performance of existing metrics.

Contribution

Simple static ranking heuristic that outperforms sophisticated SITE metrics

The authors introduce a dataset-agnostic static ranking heuristic that achieves higher weighted Kendall's tau than sophisticated SITE methods on the standard benchmark, revealing that the benchmark rewards memorization of a fixed model hierarchy rather than true task-specific transferability estimation.

Contribution

Best practices and checklist for constructing robust SITE benchmarks

The authors provide concrete recommendations for building better benchmarks, including guidelines for diverse model spaces, challenging datasets with performance headroom, and engineering for rank dispersion. They also provide a SITE benchmarking and evaluation checklist and a benchmark based on these best practices for vision classification.