Abstract:

Data teams at frontier AI companies routinely train small proxy models to make critical decisions about pretraining data recipes for full-scale training. However, the community has a limited understanding of whether and when conclusions drawn from small-scale experiments reliably transfer to large-scale production training. In this work, we uncover a critical issue in the standard practice of training small proxy models on each data recipe with a single set of hyperparameters. We demonstrate that each dataset requires its own optimal training configuration, and that dataset rankings can completely reverse with even minor adjustments to proxy training hyperparameters. Furthermore, this creates a disconnect from the actual model development pipeline, where hyperparameter optimization is a standard step. Consequently, we propose that the objective of data selection should be to identify the dataset that yields the best performance after its own hyperparameter optimization. We introduce a simple yet effective patch to the current proxy-model-based method: training proxy models with sufficiently small learning rates produces dataset rankings that strongly correlate with those obtained when large-scale models are properly tuned for each dataset. Theoretically, we prove that, for random-feature models, this approach preserves the ordering of datasets according to their optimal achievable losses. Empirically, we validate this approach through comprehensive experiments across 23 data recipes covering four critical dimensions of data curation decisions faced in production settings, demonstrating dramatic improvements in proxy model reliability.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper identifies a critical flaw in standard proxy-model-based data selection: dataset rankings can reverse when proxy training hyperparameters change, and proposes using sufficiently small learning rates to stabilize these rankings. Within the taxonomy, it occupies the 'Hyperparameter Optimization for Proxy Models' leaf under 'Proxy Model Design and Training Strategies'. Notably, this leaf contains only the original paper itself—no sibling papers—indicating this is a relatively sparse research direction. The broader parent category includes three leaves addressing proxy architecture, scaling, and training dynamics, suggesting the field has explored proxy model design from multiple angles but has not deeply investigated hyperparameter sensitivity until now.

The taxonomy reveals neighboring work in 'Proxy Model Architecture and Scaling' (three papers on model size selection) and 'Training Trajectory and Learning Dynamics' (three papers on learnability signals). These adjacent leaves focus on what proxy models to build and how to interpret their training signals, whereas the original paper addresses how to configure proxy training itself. The 'Data Selection Objectives and Optimization' branch (four leaves, seven papers total) explores scoring functions and mixture optimization but does not examine the hyperparameter configurations that produce those scores. This positioning suggests the paper bridges a gap between proxy model construction and optimization frameworks by questioning the reliability of the intermediate training step.

Among seventeen candidates examined, none clearly refute any of the three contributions. The first contribution (hyperparameter sensitivity identification) examined one candidate with no refutation. The second (tiny learning rate strategy) and third (theoretical proof with empirical validation) each examined eight candidates, again with no refutations found. This limited search scope—seventeen papers from semantic search and citation expansion—means the analysis captures nearby work but cannot claim exhaustive coverage. The absence of refutations among these candidates suggests the specific focus on hyperparameter-induced ranking reversals and the tiny learning rate remedy may be novel within the examined literature, though a broader search could reveal additional relevant prior work.

Given the sparse occupancy of the hyperparameter optimization leaf and the lack of refutations among seventeen examined candidates, the work appears to address an underexplored aspect of proxy-based data curation. However, the limited search scope and the paper's position in a relatively new subfield mean this assessment reflects current visibility rather than definitive novelty. The taxonomy structure indicates active research in related proxy model design areas, suggesting the community may soon expand attention to hyperparameter robustness as proxy methods mature.

Taxonomy

Core-task Taxonomy Papers
43
3
Claimed Contributions
17
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: data curation using small proxy models for large-scale training. The field addresses the challenge of selecting high-quality training data efficiently by leveraging smaller, computationally cheaper models to guide decisions for much larger target models. The taxonomy reveals several main branches: Proxy Model Design and Training Strategies explores how to construct and optimize these small surrogates, including hyperparameter tuning and architectural choices; Data Selection Objectives and Optimization focuses on the scoring functions and optimization frameworks that determine which examples to retain; Task-Specific Data Selection Applications examines domain-tailored approaches for language modeling, instruction tuning, and specialized tasks; Efficiency and Scalability Techniques investigates methods to handle massive datasets through batching, pruning, and online mixing; while Alternative and Domain-Specific Proxy Applications consider broader uses of proxy models beyond standard data curation. Works such as Selection via Proxy[8] and Smaller Select Data[1] illustrate foundational principles, whereas recent efforts like Nemotron CLIMB[4] and BLISS[19] demonstrate scalable implementations across diverse settings. A particularly active line of inquiry concerns the transferability of proxy model judgments to larger targets: studies like Small to Large[5] and Concept Skill Transferability[7] investigate when and why small models reliably predict large-model performance, while Bad Students Great Teachers[6] highlights scenarios where weaker proxies can paradoxically yield strong curation outcomes. Another contrasting theme involves online versus offline selection strategies, with Online Data Mixing[16] and RL Guided Selection[9] exploring adaptive, feedback-driven approaches. Proxy Model Practice[0] sits within the Hyperparameter Optimization for Proxy Models cluster, emphasizing systematic tuning of proxy configurations to maximize downstream gains. Compared to nearby works such as Regmix[3], which blends multiple proxy signals, or Smaller Can Be Better[2], which examines minimal proxy architectures, Proxy Model Practice[0] focuses specifically on the hyperparameter landscape that governs proxy fidelity and computational trade-offs, offering practitioners guidance on calibrating these small surrogates for diverse large-scale training regimes.

Claimed Contributions

Identification of hyperparameter sensitivity in proxy-model-based dataset selection

The authors reveal that standard proxy-model practices are fragile because dataset rankings can flip when training hyperparameters (especially learning rate) are slightly adjusted. This exposes a critical disconnect between fixed-hyperparameter evaluation and real-world workflows where hyperparameters are tuned per dataset.

1 retrieved paper
Tiny learning rate strategy for reliable proxy model training

The authors propose training proxy models with very small learning rates (e.g., 10^-5 to 10^-6) to improve transferability. This approach yields dataset rankings that remain consistent when models are scaled up and hyperparameters are optimized, addressing the fragility identified in current practices.

8 retrieved papers
Theoretical proof for random-feature models and empirical validation across 23 data recipes

The authors provide formal theoretical justification showing that tiny learning rates preserve dataset orderings relative to infinite-width optimal losses in random-feature models. They also conduct extensive experiments spanning multiple architectures, scales, and data curation scenarios to demonstrate dramatic improvements in proxy model reliability.

8 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Identification of hyperparameter sensitivity in proxy-model-based dataset selection

The authors reveal that standard proxy-model practices are fragile because dataset rankings can flip when training hyperparameters (especially learning rate) are slightly adjusted. This exposes a critical disconnect between fixed-hyperparameter evaluation and real-world workflows where hyperparameters are tuned per dataset.

Contribution

Tiny learning rate strategy for reliable proxy model training

The authors propose training proxy models with very small learning rates (e.g., 10^-5 to 10^-6) to improve transferability. This approach yields dataset rankings that remain consistent when models are scaled up and hyperparameters are optimized, addressing the fragility identified in current practices.

Contribution

Theoretical proof for random-feature models and empirical validation across 23 data recipes

The authors provide formal theoretical justification showing that tiny learning rates preserve dataset orderings relative to infinite-width optimal losses in random-feature models. They also conduct extensive experiments spanning multiple architectures, scales, and data curation scenarios to demonstrate dramatic improvements in proxy model reliability.

Can Small Training Runs Reliably Guide Data Curation? Rethinking Proxy-Model Practice | Novelty Validation