Can Small Training Runs Reliably Guide Data Curation? Rethinking Proxy-Model Practice
Overview
Overall Novelty Assessment
The paper identifies a critical flaw in standard proxy-model-based data selection: dataset rankings can reverse when proxy training hyperparameters change, and proposes using sufficiently small learning rates to stabilize these rankings. Within the taxonomy, it occupies the 'Hyperparameter Optimization for Proxy Models' leaf under 'Proxy Model Design and Training Strategies'. Notably, this leaf contains only the original paper itself—no sibling papers—indicating this is a relatively sparse research direction. The broader parent category includes three leaves addressing proxy architecture, scaling, and training dynamics, suggesting the field has explored proxy model design from multiple angles but has not deeply investigated hyperparameter sensitivity until now.
The taxonomy reveals neighboring work in 'Proxy Model Architecture and Scaling' (three papers on model size selection) and 'Training Trajectory and Learning Dynamics' (three papers on learnability signals). These adjacent leaves focus on what proxy models to build and how to interpret their training signals, whereas the original paper addresses how to configure proxy training itself. The 'Data Selection Objectives and Optimization' branch (four leaves, seven papers total) explores scoring functions and mixture optimization but does not examine the hyperparameter configurations that produce those scores. This positioning suggests the paper bridges a gap between proxy model construction and optimization frameworks by questioning the reliability of the intermediate training step.
Among seventeen candidates examined, none clearly refute any of the three contributions. The first contribution (hyperparameter sensitivity identification) examined one candidate with no refutation. The second (tiny learning rate strategy) and third (theoretical proof with empirical validation) each examined eight candidates, again with no refutations found. This limited search scope—seventeen papers from semantic search and citation expansion—means the analysis captures nearby work but cannot claim exhaustive coverage. The absence of refutations among these candidates suggests the specific focus on hyperparameter-induced ranking reversals and the tiny learning rate remedy may be novel within the examined literature, though a broader search could reveal additional relevant prior work.
Given the sparse occupancy of the hyperparameter optimization leaf and the lack of refutations among seventeen examined candidates, the work appears to address an underexplored aspect of proxy-based data curation. However, the limited search scope and the paper's position in a relatively new subfield mean this assessment reflects current visibility rather than definitive novelty. The taxonomy structure indicates active research in related proxy model design areas, suggesting the community may soon expand attention to hyperparameter robustness as proxy methods mature.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors reveal that standard proxy-model practices are fragile because dataset rankings can flip when training hyperparameters (especially learning rate) are slightly adjusted. This exposes a critical disconnect between fixed-hyperparameter evaluation and real-world workflows where hyperparameters are tuned per dataset.
The authors propose training proxy models with very small learning rates (e.g., 10^-5 to 10^-6) to improve transferability. This approach yields dataset rankings that remain consistent when models are scaled up and hyperparameters are optimized, addressing the fragility identified in current practices.
The authors provide formal theoretical justification showing that tiny learning rates preserve dataset orderings relative to infinite-width optimal losses in random-feature models. They also conduct extensive experiments spanning multiple architectures, scales, and data curation scenarios to demonstrate dramatic improvements in proxy model reliability.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
Contribution Analysis
Detailed comparisons for each claimed contribution
Identification of hyperparameter sensitivity in proxy-model-based dataset selection
The authors reveal that standard proxy-model practices are fragile because dataset rankings can flip when training hyperparameters (especially learning rate) are slightly adjusted. This exposes a critical disconnect between fixed-hyperparameter evaluation and real-world workflows where hyperparameters are tuned per dataset.
[52] Scalable gaussian process-based transfer surrogates for hyperparameter optimization PDF
Tiny learning rate strategy for reliable proxy model training
The authors propose training proxy models with very small learning rates (e.g., 10^-5 to 10^-6) to improve transferability. This approach yields dataset rankings that remain consistent when models are scaled up and hyperparameters are optimized, addressing the fragility identified in current practices.
[44] Step-by-step enhancement of a graph neural network-based surrogate model for Lagrangian fluid simulations with flexible time step sizes PDF
[45] Meta-learning innovates chemical kinetics: An efficient approach for surrogate model construction PDF
[46] pFedES: Generalized Proxy Feature Extractor Sharing for Model Heterogeneous Personalized Federated Learning PDF
[47] Quantifying the uncertainty of structural parameters using machine learningâbased surrogate models PDF
[48] Layer-Wise Learning Rate Optimization for Task-Dependent Fine-Tuning of Pre-Trained Models: An Evolutionary Approach PDF
[49] An Empirical Study of P Learning Rate Transfer PDF
[50] Classical feature map surrogates and metrics for quantum control landscapes PDF
[51] A Study of Generalization of Stochastic Mirror Descent Algorithms on Overparameterized Nonlinear Models PDF
Theoretical proof for random-feature models and empirical validation across 23 data recipes
The authors provide formal theoretical justification showing that tiny learning rates preserve dataset orderings relative to infinite-width optimal losses in random-feature models. They also conduct extensive experiments spanning multiple architectures, scales, and data curation scenarios to demonstrate dramatic improvements in proxy model reliability.