Scaling Laws Revisited: Modeling the Role of Data Quality in Language Model Pretraining

ICLR 2026 Conference SubmissionAnonymous Authors
quality ware scaling lawsscaling lawsdata qualityLLM pretraining
Abstract:

Scaling laws for language model training traditionally characterize how performance scales with model size and dataset volume. Prior work has explored architecture variants and data treatments such as dataset filtering and noise injection in language model pretraining; however, these studies have not formalized data quality within a principled scaling law. We introduce a dimensionless data-quality parameter Q, and propose a quality-aware scaling law extending the Chinchilla framework to predict loss as a joint function of model size, data volume, and data quality. The law is motivated by an effective-sample-size and information-theoretic view of noisy or redundant corpora, and it admits two practical estimators for Q: (i) a corruption rate proxy and (ii) a deficiency measure. Through synthetic experiments in neural machine translation and autoregressive modeling--where we systematically control data quality via multiple levels of noise injection--we show that loss scales predictably with data quality and that higher-quality data can substantially reduce model size and hence compute requirements. Our results demonstrate a sublinear decay of effective data with quality and robustness to moderate data corruption; out-of-sample evaluations further validate the predictive form of the law. Unlike prior empirical analyses, our work establishes an explicit, generalizable law for data quality, offering concrete guidance for balancing data curation effort and model scale in large-scale pretraining.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces a dimensionless quality parameter Q into the Chinchilla scaling law framework, enabling joint prediction of loss from model size, data volume, and data quality. It resides in the 'Quality-Parameterized Scaling Law Formulations' leaf, which contains only three papers total. This leaf sits within the broader 'Theoretical Frameworks for Data Quality in Scaling Laws' branch, indicating a relatively sparse research direction focused on formal mathematical integration of quality into scaling equations rather than empirical observation or heuristic filtering.

The taxonomy reveals neighboring theoretical work in 'Information-Theoretic and Capacity-Based Models' (two papers) and 'Loss-to-Loss Transfer and Cross-Domain Scaling' (one paper), alongside a much larger empirical branch with controlled perturbation experiments and observational studies. The paper bridges theory and experiment by grounding its quality parameter in effective sample size and information theory, then validating through controlled noise injection. This positions it at the intersection of formal modeling and systematic empirical validation, distinct from purely observational scaling studies or data curation pipelines in sibling branches.

Among thirty candidates examined, none clearly refute the three core contributions. The quality-aware scaling law formulation (ten candidates, zero refutations), effective sample size derivation and estimators (ten candidates, zero refutations), and controlled experimental validation across NMT and CLM tasks (ten candidates, zero refutations) all appear novel within this limited search scope. The sibling papers in the same taxonomy leaf likely address related parameterizations but do not overlap directly with the specific Q-parameter formulation and dual estimator approach presented here.

Based on top-thirty semantic matches and the sparse taxonomy leaf (three papers), the work appears to occupy a relatively unexplored niche within scaling law theory. The analysis does not cover exhaustive citation networks or domain-specific venues, so additional overlapping work may exist outside this search scope. The controlled experimental design and dual quality estimators strengthen the contribution beyond purely theoretical formulations.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Modeling data quality in language model pretraining scaling laws. The field has evolved to recognize that classical scaling laws—which relate model performance to compute, parameters, and dataset size—must be refined to account for data quality variations. The taxonomy reflects this maturation through several complementary branches. Theoretical Frameworks for Data Quality in Scaling Laws develop formal models that parameterize quality alongside quantity, as seen in Data Quality Scaling[0] and Quality Data Scaling[18]. Empirical Scaling Analysis Under Quality Variations investigates how performance degrades or improves under different quality regimes, including work on repeated data and diversity coefficients. Data Quality Assessment and Measurement provides tools like Meta-rater[12] and QuRating[41] to quantify quality at scale. Data Curation and Selection Approaches, exemplified by Scalingfilter[3] and Dataman[1], focus on filtering and ranking strategies to maximize training efficiency. Synthetic Data Generation and Augmentation explores whether generated data can supplement or replace web-scraped corpora, while Specialized Contexts and Constraints address domain-specific or resource-limited settings. Finally, Architectural and Methodological Factors examine how model design and training procedures interact with data characteristics. Recent work has intensified around the interplay between quality-aware filtering and theoretical scaling predictions. A central tension is whether aggressive curation—removing lower-quality examples—can offset shrinking dataset sizes or whether sheer volume remains paramount. Data Quality Scaling[0] sits squarely within the theoretical branch, proposing explicit quality parameters in scaling law formulations, closely aligned with Quality Data Scaling[18] and Data Quality Training[8], which similarly integrate quality metrics into predictive models. In contrast, empirical studies like Scalingfilter[3] and Dataman[1] emphasize practical filtering pipelines, revealing that quality gains can sometimes compensate for reduced token counts but also highlighting diminishing returns. Open questions persist: how to define quality universally across domains, whether synthetic data (BeyondWeb Synthetic[27], Synthetic Data Scaling[46]) can truly match organic text distributions, and how architectural choices modulate sensitivity to data imperfections. Data Quality Scaling[0] contributes by formalizing these trade-offs, offering a principled lens to predict performance under varying quality regimes.

Claimed Contributions

Quality-aware scaling law with dimensionless quality parameter Q

The authors extend the Chinchilla scaling law by introducing a dimensionless data-quality parameter Q that predicts pretraining loss as a joint function of model size, data volume, and data quality. The law takes the form L(N, D, Q) = A/N^α + B/(D^β Q^γ) + E, explicitly modeling how data quality affects model performance.

10 retrieved papers
Effective sample size derivation and quality estimators

The authors provide a theoretical foundation for the quality parameter Q by deriving it from effective sample size and information-theoretic perspectives. They introduce two practical estimators for Q: a corruption rate proxy and a data deficiency measure, showing that under natural assumptions these lead to the form D_eff = D g(Q) with g(Q) ≈ Q^γ.

10 retrieved papers
Controlled experimental validation across NMT and CLM tasks

The authors perform systematic experiments on neural machine translation and causal language modeling tasks with multiple levels of synthetic noise injection to validate the quality-aware scaling law. The experiments demonstrate that loss scales predictably with the quality parameter and that higher-quality data can compensate for smaller models, particularly relevant for specialized domain applications.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Quality-aware scaling law with dimensionless quality parameter Q

The authors extend the Chinchilla scaling law by introducing a dimensionless data-quality parameter Q that predicts pretraining loss as a joint function of model size, data volume, and data quality. The law takes the form L(N, D, Q) = A/N^α + B/(D^β Q^γ) + E, explicitly modeling how data quality affects model performance.

Contribution

Effective sample size derivation and quality estimators

The authors provide a theoretical foundation for the quality parameter Q by deriving it from effective sample size and information-theoretic perspectives. They introduce two practical estimators for Q: a corruption rate proxy and a data deficiency measure, showing that under natural assumptions these lead to the form D_eff = D g(Q) with g(Q) ≈ Q^γ.

Contribution

Controlled experimental validation across NMT and CLM tasks

The authors perform systematic experiments on neural machine translation and causal language modeling tasks with multiple levels of synthetic noise injection to validate the quality-aware scaling law. The experiments demonstrate that loss scales predictably with the quality parameter and that higher-quality data can compensate for smaller models, particularly relevant for specialized domain applications.