How to train data-efficient LLMs

ICLR 2026 Conference SubmissionAnonymous Authors
datasamplingdata efficiencyLLMsdata curationdata quality
Abstract:

The training of large language models (LLMs) is expensive. In this paper, we study data-efficient approaches for pre-training LLMs, \ie, techniques that aim to optimize the Pareto frontier of model quality and training resource/data consumption. We seek to understand the tradeoffs associated with data selection routines based on (i) expensive-to-compute data-quality estimates, and (ii) maximization of coverage and diversity-based measures in the feature space. Our first technique, AskLLM, leverages the zero-shot reasoning capabilities of instruction-tuned LLMs to directly assess the quality of a training example. To target coverage, we propose density sampling, which models the data distribution to select a diverse sample. Testing the effect of 2222 different data curation techniques on the pre-training of T5-style of models, involving hundreds of pre-training runs and post fine-tuning evaluation tasks, we find that AskLLM and density are the best methods in their respective categories. While coverage sampling techniques often recover the performance of training on the entire dataset, training on data curated via AskLLM consistently outperforms full-data training---even when we sample only 1010% of the original dataset, while converging up to 7070% faster.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes two data selection techniques—AskLLM, which uses instruction-tuned LLMs to assess training example quality, and density sampling, which models data distributions for diverse subset selection—and benchmarks 22 curation methods across hundreds of pre-training runs. It resides in the Quality-Based Data Selection leaf, which contains five papers including the original work. This leaf sits within the broader Data Selection and Curation Methods branch, indicating a moderately populated research direction focused on identifying high-value training subsets through quality metrics and model-based assessments.

The taxonomy reveals neighboring leaves addressing diversity-focused sampling (three papers) and data influence modeling (two papers), suggesting the field has organized quality-based, diversity-based, and influence-based selection into distinct but complementary categories. The paper's dual focus on quality (AskLLM) and coverage (density sampling) bridges these categories. Sibling papers in the same leaf include Dataman and Group-level Data Influence, which explore scalable curation pipelines and fine-grained attribution respectively. The taxonomy's scope notes clarify that quality-based selection excludes diversity sampling and domain-specific filtering, positioning this work at the intersection of quality assessment and distribution coverage.

Among 30 candidates examined, none clearly refute the three main contributions: AskLLM sampling (10 candidates, 0 refutable), density sampling (10 candidates, 0 refutable), and the large-scale empirical benchmark (10 candidates, 0 refutable). This suggests that within the limited search scope, the specific combination of LLM-based quality assessment, density-based diversity sampling, and comprehensive benchmarking of 22 techniques appears relatively novel. The absence of refutable candidates across all contributions indicates that the paper's integrated approach and empirical scale may distinguish it from prior work, though the search examined only top-30 semantic matches rather than an exhaustive literature review.

Based on the limited search scope of 30 candidates, the work appears to occupy a moderately explored area with distinct methodological contributions. The taxonomy structure shows active research in quality-based selection (five papers in the leaf), but the specific techniques and large-scale benchmarking approach may offer new empirical insights. The analysis does not cover potential overlaps beyond the top-30 semantic matches or recent concurrent work, so the novelty assessment remains provisional pending broader literature examination.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: data-efficient pre-training of large language models. The field has organized itself around several complementary strategies for reducing the computational and data costs of training large language models. Data Selection and Curation Methods focus on identifying high-quality subsets of training corpora through filtering, deduplication, and influence-based techniques, aiming to maximize model performance with fewer tokens. Data Synthesis and Augmentation explore generating or transforming training examples to enrich limited datasets, while Continual and Domain-Adaptive Pre-Training address how to efficiently update or specialize models for new domains without full retraining. Low-Resource and Cross-Lingual Adaptation tackles the challenge of extending models to languages and settings with scarce data, and Model Compression and Efficient Architectures pursue smaller, faster models through quantization, pruning, and architectural innovations. Training Optimization and Efficiency Techniques improve the training process itself via better optimizers, curriculum learning, and hardware utilization, whereas Post-Training Alignment and Fine-Tuning Efficiency streamline instruction tuning and preference learning. Multimodal and Cross-Domain Adaptation extends these principles beyond text, and Specialized Pre-Training Paradigms and Benchmarks provide controlled settings like the BabyLM Challenge to study data efficiency at small scale. Within Data Selection and Curation Methods, a particularly active line of work examines quality-based filtering and influence estimation to prioritize informative training examples. Data-efficient LLMs[0] situates itself in this quality-focused branch, emphasizing principled data selection to reduce pre-training costs. Nearby efforts such as Dataman[2] and Group-level Data Influence[5] explore complementary angles on measuring and leveraging data quality, with Dataman[2] offering scalable curation pipelines and Group-level Data Influence[5] providing finer-grained attribution of training subsets to model behavior. Ultra-fineweb[40] represents another closely related effort, curating a high-quality web corpus through aggressive filtering. The central tension across these works lies in balancing the computational overhead of quality assessment against the downstream gains from cleaner data, and in determining whether coarse heuristics or fine-grained influence methods yield better trade-offs. Data-efficient LLMs[0] contributes to this landscape by synthesizing quality-based selection strategies, offering a perspective on how careful data curation can substantially reduce the scale requirements for effective pre-training.

Claimed Contributions

ASK-LLM sampling technique

The authors propose ASK-LLM, a data selection method that leverages instruction-tuned LLMs to directly assess training example quality through zero-shot reasoning. This technique consistently outperforms other data curation routines and enables training models that exceed full-dataset performance while using only a fraction of the data.

10 retrieved papers
DENSITY sampling technique

The authors introduce DENSITY, a coverage-maximizing sampler that estimates local density in the embedding space using kernel sums. This method aims to maximize topic coverage by downsampling redundant high-density regions and boosting under-represented portions of the input domain.

10 retrieved papers
Large-scale empirical benchmark of data curation techniques

The authors conduct an extensive comparative study testing 22 data curation techniques across hundreds of pre-training runs and over a thousand fine-tuning evaluations. This exhaustive benchmark provides new insights into the roles of coverage, quality, and sampling cost in LLM pre-training.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

ASK-LLM sampling technique

The authors propose ASK-LLM, a data selection method that leverages instruction-tuned LLMs to directly assess training example quality through zero-shot reasoning. This technique consistently outperforms other data curation routines and enables training models that exceed full-dataset performance while using only a fraction of the data.

Contribution

DENSITY sampling technique

The authors introduce DENSITY, a coverage-maximizing sampler that estimates local density in the embedding space using kernel sums. This method aims to maximize topic coverage by downsampling redundant high-density regions and boosting under-represented portions of the input domain.

Contribution

Large-scale empirical benchmark of data curation techniques

The authors conduct an extensive comparative study testing 22 data curation techniques across hundreds of pre-training runs and over a thousand fine-tuning evaluations. This exhaustive benchmark provides new insights into the roles of coverage, quality, and sampling cost in LLM pre-training.

How to train data-efficient LLMs | Novelty Validation