How to train data-efficient LLMs

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.8 Download Report PDF

datasamplingdata efficiencyLLMsdata curationdata quality

The training of large language models (LLMs) is expensive. In this paper, we study data-efficient approaches for pre-training LLMs, \ie, techniques that aim to optimize the Pareto frontier of model quality and training resource/data consumption. We seek to understand the tradeoffs associated with data selection routines based on (i) expensive-to-compute data-quality estimates, and (ii) maximization of coverage and diversity-based measures in the feature space. Our first technique, AskLLM, leverages the zero-shot reasoning capabilities of instruction-tuned LLMs to directly assess the quality of a training example. To target coverage, we propose density sampling, which models the data distribution to select a diverse sample. Testing the effect of $22$ different data curation techniques on the pre-training of T5-style of models, involving hundreds of pre-training runs and post fine-tuning evaluation tasks, we find that AskLLM and density are the best methods in their respective categories. While coverage sampling techniques often recover the performance of training on the entire dataset, training on data curated via AskLLM consistently outperforms full-data training---even when we sample only $10$ % of the original dataset, while converging up to $70$ % faster.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes two data selection techniques—AskLLM, which uses instruction-tuned LLMs to assess training example quality, and density sampling, which models data distributions for diverse subset selection—and benchmarks 22 curation methods across hundreds of pre-training runs. It resides in the Quality-Based Data Selection leaf, which contains five papers including the original work. This leaf sits within the broader Data Selection and Curation Methods branch, indicating a moderately populated research direction focused on identifying high-value training subsets through quality metrics and model-based assessments.

The taxonomy reveals neighboring leaves addressing diversity-focused sampling (three papers) and data influence modeling (two papers), suggesting the field has organized quality-based, diversity-based, and influence-based selection into distinct but complementary categories. The paper's dual focus on quality (AskLLM) and coverage (density sampling) bridges these categories. Sibling papers in the same leaf include Dataman and Group-level Data Influence, which explore scalable curation pipelines and fine-grained attribution respectively. The taxonomy's scope notes clarify that quality-based selection excludes diversity sampling and domain-specific filtering, positioning this work at the intersection of quality assessment and distribution coverage.

Among 30 candidates examined, none clearly refute the three main contributions: AskLLM sampling (10 candidates, 0 refutable), density sampling (10 candidates, 0 refutable), and the large-scale empirical benchmark (10 candidates, 0 refutable). This suggests that within the limited search scope, the specific combination of LLM-based quality assessment, density-based diversity sampling, and comprehensive benchmarking of 22 techniques appears relatively novel. The absence of refutable candidates across all contributions indicates that the paper's integrated approach and empirical scale may distinguish it from prior work, though the search examined only top-30 semantic matches rather than an exhaustive literature review.

Based on the limited search scope of 30 candidates, the work appears to occupy a moderately explored area with distinct methodological contributions. The taxonomy structure shows active research in quality-based selection (five papers in the leaf), but the specific techniques and large-scale benchmarking approach may offer new empirical insights. The analysis does not cover potential overlaps beyond the top-30 semantic matches or recent concurrent work, so the novelty assessment remains provisional pending broader literature examination.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: data-efficient pre-training of large language models. The field has organized itself around several complementary strategies for reducing the computational and data costs of training large language models. Data Selection and Curation Methods focus on identifying high-quality subsets of training corpora through filtering, deduplication, and influence-based techniques, aiming to maximize model performance with fewer tokens. Data Synthesis and Augmentation explore generating or transforming training examples to enrich limited datasets, while Continual and Domain-Adaptive Pre-Training address how to efficiently update or specialize models for new domains without full retraining. Low-Resource and Cross-Lingual Adaptation tackles the challenge of extending models to languages and settings with scarce data, and Model Compression and Efficient Architectures pursue smaller, faster models through quantization, pruning, and architectural innovations. Training Optimization and Efficiency Techniques improve the training process itself via better optimizers, curriculum learning, and hardware utilization, whereas Post-Training Alignment and Fine-Tuning Efficiency streamline instruction tuning and preference learning. Multimodal and Cross-Domain Adaptation extends these principles beyond text, and Specialized Pre-Training Paradigms and Benchmarks provide controlled settings like the BabyLM Challenge to study data efficiency at small scale. Within Data Selection and Curation Methods, a particularly active line of work examines quality-based filtering and influence estimation to prioritize informative training examples. Data-efficient LLMs[0] situates itself in this quality-focused branch, emphasizing principled data selection to reduce pre-training costs. Nearby efforts such as Dataman[2] and Group-level Data Influence[5] explore complementary angles on measuring and leveraging data quality, with Dataman[2] offering scalable curation pipelines and Group-level Data Influence[5] providing finer-grained attribution of training subsets to model behavior. Ultra-fineweb[40] represents another closely related effort, curating a high-quality web corpus through aggressive filtering. The central tension across these works lies in balancing the computational overhead of quality assessment against the downstream gains from cleaner data, and in determining whether coarse heuristics or fine-grained influence methods yield better trade-offs. Data-efficient LLMs[0] contributes to this landscape by synthesizing quality-based selection strategies, offering a perspective on how careful data curation can substantially reduce the scale requirements for effective pre-training.

Claimed Contributions

ASK-LLM sampling technique

10 retrieved papers

The authors propose ASK-LLM, a data selection method that leverages instruction-tuned LLMs to directly assess training example quality through zero-shot reasoning. This technique consistently outperforms other data curation routines and enables training models that exceed full-dataset performance while using only a fraction of the data.

10 retrieved papers

DENSITY sampling technique

10 retrieved papers

The authors introduce DENSITY, a coverage-maximizing sampler that estimates local density in the embedding space using kernel sums. This method aims to maximize topic coverage by downsampling redundant high-density regions and boosting under-represented portions of the input domain.

10 retrieved papers

Large-scale empirical benchmark of data curation techniques

10 retrieved papers

The authors conduct an extensive comparative study testing 22 data curation techniques across hundreds of pre-training runs and over a thousand fine-tuning evaluations. This exhaustive benchmark provides new insights into the roles of coverage, quality, and sampling cost in LLM pre-training.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[2] Dataman: Data manager for pre-training large language models PDF

Peng Ru, Yang Ke-xin, Ru Peng, Zeng Ya-wen, Kexin Yang, Lin, Junyang, Yawen Zeng, Liu, Dayiheng, Junyang Lin, Zhao, Junbo, Dayiheng Liu, Junbo Zhao (2025)

[5] Data-efficient pretraining with group-level data influence modeling PDF

Zichun Yu, Fei Peng, Jie Lei, Arnold Overwijk, Wen-tau Yih, Chenyan Xiong (2025)

[10] Bertin: Efficient pre-training of a spanish language model using perplexity sampling PDF

de la Rosa, Javier, Javier de la Rosa, Ponferrada, Eduardo G., Eduardo G. Ponferrada, Villegas, Paulo, Paulo Villegas, E. G. Ponferrada, Salas, Pablo Gonzalez de Prado, Pablo Gonzalez de Prado Salas, Romero, Manu, Manu Romero, Pablo GonzÃ¡lez de Prado Salas, Grandury, MarÃa, MarÃa Grandury (2022)

[40] Ultra-fineweb: Efficient data filtering and verification for high-quality llm training data PDF

Wang Yu-dong, FU Zixuan, Yudong Wang, Cai Jie, Zixuan Fu, Tang Pei-jun, Jie Cai, Lyu Hongya, Peijun Tang, Fang, Yewei, Hongya Lyu, Zheng Zhi, Yewei Fang, Zhou Jie, Zhi Zheng, Zeng, Guoyang, Jie Zhou, Xiao, Chaojun, Guoyang Zeng, Han Xu, Chaojun Xiao, Liu Zhi-Yuan, Xu Han, Zhiyuan Liu (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

ASK-LLM sampling technique

[51] Instructblip: Towards general-purpose vision-language models with instruction tuning PDF

Cannot Refute

[52] Evaluating instruction-tuned large language models on code comprehension and generation PDF

Cannot Refute

[53] Learning to generate instruction tuning datasets for zero-shot task adaptation PDF

Cannot Refute

[54] A zero-shot and few-shot study of instruction-finetuned large language models applied to clinical and biomedical tasks PDF

Cannot Refute

[55] Enhancing zero-shot facial expression recognition by llm knowledge transfer PDF

Cannot Refute

[56] Llms as zero-shot graph learners: Alignment of gnn representations with llm token embeddings PDF

Cannot Refute

[57] Mods: Model-oriented data selection for instruction tuning PDF

Cannot Refute

[58] Unsupervised text representation learning via instruction-tuning for zero-shot dense retrieval PDF

Cannot Refute

[59] EchoQA: A Large Collection of Instruction Tuning Data for Echocardiogram Reports PDF

Cannot Refute

[60] Instructretro: Instruction tuning post retrieval-augmented pretraining PDF

Cannot Refute

Contribution

DENSITY sampling technique

[69] Variational Kernel Density Estimation Recommendation Algorithm for Users with Diverse Activity Levels PDF

Cannot Refute

[70] Exploiting probability density function of deep convolutional autoencoders' latent space for reliable COVID-19 detection on CT scans PDF

Cannot Refute

[71] Nonparametric Estimation with Kernel Mean Embeddings PDF

Cannot Refute

[72] Kernel density estimation in metric spaces PDF

Cannot Refute

[73] Learnable Kernel Density Estimation for Graphs PDF

Cannot Refute

[74] Entropic Analysis of Time Series through Kernel Density Estimation PDF

Cannot Refute

[75] Kernel based method for distributed derived feature tracking in high dimensions PDF

Cannot Refute

[76] Density estimation and modeling on symmetric spaces PDF

Cannot Refute

[77] Layer-constrained variational autoencoding kernel density estimation model for anomaly detection PDF

Cannot Refute

[78] Kernel conditional density operators PDF

Cannot Refute

Contribution

Large-scale empirical benchmark of data curation techniques

[2] Dataman: Data manager for pre-training large language models PDF

Cannot Refute

[19] Harnessing diversity for important data selection in pretraining large language models PDF

Cannot Refute

[61] Qurating: Selecting high-quality data for training language models PDF

Cannot Refute

[62] Rule-based data selection for large language models PDF

Cannot Refute

[63] Regmix: Data mixture as regression for language model pre-training PDF

Cannot Refute

[64] Datacomp-lm: In search of the next generation of training sets for language models PDF

Cannot Refute

[65] Mates: Model-aware data selection for efficient pretraining with data influence models PDF

Cannot Refute

[66] Data selection via optimal control for language models PDF

Cannot Refute

[67] Generating datasets with pretrained language models PDF

Cannot Refute

[68] Data selection for language models via importance resampling PDF

Cannot Refute

How to train data-efficient LLMs

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[2] Dataman: Data manager for pre-training large language models PDF

[5] Data-efficient pretraining with group-level data influence modeling PDF

[10] Bertin: Efficient pre-training of a spanish language model using perplexity sampling PDF

[40] Ultra-fineweb: Efficient data filtering and verification for high-quality llm training data PDF

Contribution Analysis

ASK-LLM sampling technique

[51] Instructblip: Towards general-purpose vision-language models with instruction tuning PDF

[52] Evaluating instruction-tuned large language models on code comprehension and generation PDF

[53] Learning to generate instruction tuning datasets for zero-shot task adaptation PDF

[54] A zero-shot and few-shot study of instruction-finetuned large language models applied to clinical and biomedical tasks PDF

[55] Enhancing zero-shot facial expression recognition by llm knowledge transfer PDF

[56] Llms as zero-shot graph learners: Alignment of gnn representations with llm token embeddings PDF

[57] Mods: Model-oriented data selection for instruction tuning PDF

[58] Unsupervised text representation learning via instruction-tuning for zero-shot dense retrieval PDF

[59] EchoQA: A Large Collection of Instruction Tuning Data for Echocardiogram Reports PDF

[60] Instructretro: Instruction tuning post retrieval-augmented pretraining PDF

DENSITY sampling technique

[69] Variational Kernel Density Estimation Recommendation Algorithm for Users with Diverse Activity Levels PDF

[70] Exploiting probability density function of deep convolutional autoencoders' latent space for reliable COVID-19 detection on CT scans PDF

[71] Nonparametric Estimation with Kernel Mean Embeddings PDF

[72] Kernel density estimation in metric spaces PDF

[73] Learnable Kernel Density Estimation for Graphs PDF

[74] Entropic Analysis of Time Series through Kernel Density Estimation PDF

[75] Kernel based method for distributed derived feature tracking in high dimensions PDF

[76] Density estimation and modeling on symmetric spaces PDF

[77] Layer-constrained variational autoencoding kernel density estimation model for anomaly detection PDF

[78] Kernel conditional density operators PDF

Large-scale empirical benchmark of data curation techniques

[2] Dataman: Data manager for pre-training large language models PDF

[19] Harnessing diversity for important data selection in pretraining large language models PDF

[61] Qurating: Selecting high-quality data for training language models PDF

[62] Rule-based data selection for large language models PDF

[63] Regmix: Data mixture as regression for language model pre-training PDF

[64] Datacomp-lm: In search of the next generation of training sets for language models PDF

[65] Mates: Model-aware data selection for efficient pretraining with data influence models PDF

[66] Data selection via optimal control for language models PDF

[67] Generating datasets with pretrained language models PDF

[68] Data selection for language models via importance resampling PDF

Table of Contents