How to train data-efficient LLMs
Overview
Overall Novelty Assessment
The paper proposes two data selection techniques—AskLLM, which uses instruction-tuned LLMs to assess training example quality, and density sampling, which models data distributions for diverse subset selection—and benchmarks 22 curation methods across hundreds of pre-training runs. It resides in the Quality-Based Data Selection leaf, which contains five papers including the original work. This leaf sits within the broader Data Selection and Curation Methods branch, indicating a moderately populated research direction focused on identifying high-value training subsets through quality metrics and model-based assessments.
The taxonomy reveals neighboring leaves addressing diversity-focused sampling (three papers) and data influence modeling (two papers), suggesting the field has organized quality-based, diversity-based, and influence-based selection into distinct but complementary categories. The paper's dual focus on quality (AskLLM) and coverage (density sampling) bridges these categories. Sibling papers in the same leaf include Dataman and Group-level Data Influence, which explore scalable curation pipelines and fine-grained attribution respectively. The taxonomy's scope notes clarify that quality-based selection excludes diversity sampling and domain-specific filtering, positioning this work at the intersection of quality assessment and distribution coverage.
Among 30 candidates examined, none clearly refute the three main contributions: AskLLM sampling (10 candidates, 0 refutable), density sampling (10 candidates, 0 refutable), and the large-scale empirical benchmark (10 candidates, 0 refutable). This suggests that within the limited search scope, the specific combination of LLM-based quality assessment, density-based diversity sampling, and comprehensive benchmarking of 22 techniques appears relatively novel. The absence of refutable candidates across all contributions indicates that the paper's integrated approach and empirical scale may distinguish it from prior work, though the search examined only top-30 semantic matches rather than an exhaustive literature review.
Based on the limited search scope of 30 candidates, the work appears to occupy a moderately explored area with distinct methodological contributions. The taxonomy structure shows active research in quality-based selection (five papers in the leaf), but the specific techniques and large-scale benchmarking approach may offer new empirical insights. The analysis does not cover potential overlaps beyond the top-30 semantic matches or recent concurrent work, so the novelty assessment remains provisional pending broader literature examination.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors propose ASK-LLM, a data selection method that leverages instruction-tuned LLMs to directly assess training example quality through zero-shot reasoning. This technique consistently outperforms other data curation routines and enables training models that exceed full-dataset performance while using only a fraction of the data.
The authors introduce DENSITY, a coverage-maximizing sampler that estimates local density in the embedding space using kernel sums. This method aims to maximize topic coverage by downsampling redundant high-density regions and boosting under-represented portions of the input domain.
The authors conduct an extensive comparative study testing 22 data curation techniques across hundreds of pre-training runs and over a thousand fine-tuning evaluations. This exhaustive benchmark provides new insights into the roles of coverage, quality, and sampling cost in LLM pre-training.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[2] Dataman: Data manager for pre-training large language models PDF
[5] Data-efficient pretraining with group-level data influence modeling PDF
[10] Bertin: Efficient pre-training of a spanish language model using perplexity sampling PDF
[40] Ultra-fineweb: Efficient data filtering and verification for high-quality llm training data PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
ASK-LLM sampling technique
The authors propose ASK-LLM, a data selection method that leverages instruction-tuned LLMs to directly assess training example quality through zero-shot reasoning. This technique consistently outperforms other data curation routines and enables training models that exceed full-dataset performance while using only a fraction of the data.
[51] Instructblip: Towards general-purpose vision-language models with instruction tuning PDF
[52] Evaluating instruction-tuned large language models on code comprehension and generation PDF
[53] Learning to generate instruction tuning datasets for zero-shot task adaptation PDF
[54] A zero-shot and few-shot study of instruction-finetuned large language models applied to clinical and biomedical tasks PDF
[55] Enhancing zero-shot facial expression recognition by llm knowledge transfer PDF
[56] Llms as zero-shot graph learners: Alignment of gnn representations with llm token embeddings PDF
[57] Mods: Model-oriented data selection for instruction tuning PDF
[58] Unsupervised text representation learning via instruction-tuning for zero-shot dense retrieval PDF
[59] EchoQA: A Large Collection of Instruction Tuning Data for Echocardiogram Reports PDF
[60] Instructretro: Instruction tuning post retrieval-augmented pretraining PDF
DENSITY sampling technique
The authors introduce DENSITY, a coverage-maximizing sampler that estimates local density in the embedding space using kernel sums. This method aims to maximize topic coverage by downsampling redundant high-density regions and boosting under-represented portions of the input domain.
[69] Variational Kernel Density Estimation Recommendation Algorithm for Users with Diverse Activity Levels PDF
[70] Exploiting probability density function of deep convolutional autoencoders' latent space for reliable COVID-19 detection on CT scans PDF
[71] Nonparametric Estimation with Kernel Mean Embeddings PDF
[72] Kernel density estimation in metric spaces PDF
[73] Learnable Kernel Density Estimation for Graphs PDF
[74] Entropic Analysis of Time Series through Kernel Density Estimation PDF
[75] Kernel based method for distributed derived feature tracking in high dimensions PDF
[76] Density estimation and modeling on symmetric spaces PDF
[77] Layer-constrained variational autoencoding kernel density estimation model for anomaly detection PDF
[78] Kernel conditional density operators PDF
Large-scale empirical benchmark of data curation techniques
The authors conduct an extensive comparative study testing 22 data curation techniques across hundreds of pre-training runs and over a thousand fine-tuning evaluations. This exhaustive benchmark provides new insights into the roles of coverage, quality, and sampling cost in LLM pre-training.