Scaling Laws Revisited: Modeling the Role of Data Quality in Language Model Pretraining

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

quality ware scaling lawsscaling lawsdata qualityLLM pretraining

Scaling laws for language model training traditionally characterize how performance scales with model size and dataset volume. Prior work has explored architecture variants and data treatments such as dataset filtering and noise injection in language model pretraining; however, these studies have not formalized data quality within a principled scaling law. We introduce a dimensionless data-quality parameter Q, and propose a quality-aware scaling law extending the Chinchilla framework to predict loss as a joint function of model size, data volume, and data quality. The law is motivated by an effective-sample-size and information-theoretic view of noisy or redundant corpora, and it admits two practical estimators for Q: (i) a corruption rate proxy and (ii) a deficiency measure. Through synthetic experiments in neural machine translation and autoregressive modeling--where we systematically control data quality via multiple levels of noise injection--we show that loss scales predictably with data quality and that higher-quality data can substantially reduce model size and hence compute requirements. Our results demonstrate a sublinear decay of effective data with quality and robustness to moderate data corruption; out-of-sample evaluations further validate the predictive form of the law. Unlike prior empirical analyses, our work establishes an explicit, generalizable law for data quality, offering concrete guidance for balancing data curation effort and model scale in large-scale pretraining.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces a dimensionless quality parameter Q into the Chinchilla scaling law framework, enabling joint prediction of loss from model size, data volume, and data quality. It resides in the 'Quality-Parameterized Scaling Law Formulations' leaf, which contains only three papers total. This leaf sits within the broader 'Theoretical Frameworks for Data Quality in Scaling Laws' branch, indicating a relatively sparse research direction focused on formal mathematical integration of quality into scaling equations rather than empirical observation or heuristic filtering.

The taxonomy reveals neighboring theoretical work in 'Information-Theoretic and Capacity-Based Models' (two papers) and 'Loss-to-Loss Transfer and Cross-Domain Scaling' (one paper), alongside a much larger empirical branch with controlled perturbation experiments and observational studies. The paper bridges theory and experiment by grounding its quality parameter in effective sample size and information theory, then validating through controlled noise injection. This positions it at the intersection of formal modeling and systematic empirical validation, distinct from purely observational scaling studies or data curation pipelines in sibling branches.

Among thirty candidates examined, none clearly refute the three core contributions. The quality-aware scaling law formulation (ten candidates, zero refutations), effective sample size derivation and estimators (ten candidates, zero refutations), and controlled experimental validation across NMT and CLM tasks (ten candidates, zero refutations) all appear novel within this limited search scope. The sibling papers in the same taxonomy leaf likely address related parameterizations but do not overlap directly with the specific Q-parameter formulation and dual estimator approach presented here.

Based on top-thirty semantic matches and the sparse taxonomy leaf (three papers), the work appears to occupy a relatively unexplored niche within scaling law theory. The analysis does not cover exhaustive citation networks or domain-specific venues, so additional overlapping work may exist outside this search scope. The controlled experimental design and dual quality estimators strengthen the contribution beyond purely theoretical formulations.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Modeling data quality in language model pretraining scaling laws. The field has evolved to recognize that classical scaling laws—which relate model performance to compute, parameters, and dataset size—must be refined to account for data quality variations. The taxonomy reflects this maturation through several complementary branches. Theoretical Frameworks for Data Quality in Scaling Laws develop formal models that parameterize quality alongside quantity, as seen in Data Quality Scaling[0] and Quality Data Scaling[18]. Empirical Scaling Analysis Under Quality Variations investigates how performance degrades or improves under different quality regimes, including work on repeated data and diversity coefficients. Data Quality Assessment and Measurement provides tools like Meta-rater[12] and QuRating[41] to quantify quality at scale. Data Curation and Selection Approaches, exemplified by Scalingfilter[3] and Dataman[1], focus on filtering and ranking strategies to maximize training efficiency. Synthetic Data Generation and Augmentation explores whether generated data can supplement or replace web-scraped corpora, while Specialized Contexts and Constraints address domain-specific or resource-limited settings. Finally, Architectural and Methodological Factors examine how model design and training procedures interact with data characteristics. Recent work has intensified around the interplay between quality-aware filtering and theoretical scaling predictions. A central tension is whether aggressive curation—removing lower-quality examples—can offset shrinking dataset sizes or whether sheer volume remains paramount. Data Quality Scaling[0] sits squarely within the theoretical branch, proposing explicit quality parameters in scaling law formulations, closely aligned with Quality Data Scaling[18] and Data Quality Training[8], which similarly integrate quality metrics into predictive models. In contrast, empirical studies like Scalingfilter[3] and Dataman[1] emphasize practical filtering pipelines, revealing that quality gains can sometimes compensate for reduced token counts but also highlighting diminishing returns. Open questions persist: how to define quality universally across domains, whether synthetic data (BeyondWeb Synthetic[27], Synthetic Data Scaling[46]) can truly match organic text distributions, and how architectural choices modulate sensitivity to data imperfections. Data Quality Scaling[0] contributes by formalizing these trade-offs, offering a principled lens to predict performance under varying quality regimes.

Claimed Contributions

Quality-aware scaling law with dimensionless quality parameter Q

10 retrieved papers

The authors extend the Chinchilla scaling law by introducing a dimensionless data-quality parameter Q that predicts pretraining loss as a joint function of model size, data volume, and data quality. The law takes the form L(N, D, Q) = A/N^α + B/(D^β Q^γ) + E, explicitly modeling how data quality affects model performance.

10 retrieved papers

Effective sample size derivation and quality estimators

10 retrieved papers

The authors provide a theoretical foundation for the quality parameter Q by deriving it from effective sample size and information-theoretic perspectives. They introduce two practical estimators for Q: a corruption rate proxy and a data deficiency measure, showing that under natural assumptions these lead to the form D_eff = D g(Q) with g(Q) ≈ Q^γ.

10 retrieved papers

Controlled experimental validation across NMT and CLM tasks

10 retrieved papers

The authors perform systematic experiments on neural machine translation and causal language modeling tasks with multiple levels of synthetic noise injection to validate the quality-aware scaling law. The experiments demonstrate that loss scales predictably with the quality parameter and that higher-quality data can compensate for smaller models, particularly relevant for specialized domain applications.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[8] Revisiting Scaling Laws for Language Models: The Role of Data Quality and Training Strategies PDF

Cai, Xunliang, Chen Shi-qi, Chen Zheng-yu, He, Junxian, Wang Jingang, Wang Yu-dong, Wang Siqi, Xiao Teng (2025) • Annual Meeting of the Association for Computational Linguistics

[18] Scaling Parameter-Constrained Language Models with Quality Data PDF

Chandra, Vikas, Chang, Ernie, Huber, Patrick, Li Yang, Lin, Pin-Jie, Liu, Zechun, Paltenghi, Matteo, Shi, Yangyang, Zhao Changsheng (2024) • Conference on Empirical Methods in Natural Language Processing

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Quality-aware scaling law with dimensionless quality parameter Q

[3] Scalingfilter: Assessing data quality through inverse utilization of scaling laws PDF

Cannot Refute

[8] Revisiting Scaling Laws for Language Models: The Role of Data Quality and Training Strategies PDF

Cannot Refute

[10] Towards trustable language models: Investigating information quality of large language models PDF

Cannot Refute

[11] Sub-scaling laws: on the role of data density and training strategies in llms PDF

Cannot Refute

[29] Data scaling laws in NMT: The effect of noise and architecture PDF

Cannot Refute

[51] Beyond neural scaling laws: beating power law scaling via data pruning PDF

Cannot Refute

[52] Scaling laws for neural machine translation PDF

Cannot Refute

[53] An empirical study of scaling laws for transfer PDF

Cannot Refute

[54] Deep learning-based real-time data quality assessment and anomaly detection for large-scale distributed data streams PDF

Cannot Refute

[55] Scaling an artificial neural network-based water quality index model from small to large catchments PDF

Cannot Refute

Contribution

Effective sample size derivation and quality estimators

[65] An Information Theoretic Approach to Prevalence Estimation and Missing Data PDF

Cannot Refute

[66] Practical use of the information-theoretic approach PDF

Cannot Refute

[67] Analyzing Sample Size in Information-Theoretic Models PDF

Cannot Refute

[68] Finding emergence in data by maximizing effective information PDF

Cannot Refute

[69] Information theoretic-based sampling of observations PDF

Cannot Refute

[70] Model selection and multimodel inference: a practical information-theoretic approach PDF

Cannot Refute

[71] Information theoretic perspective on sample complexity. PDF

Cannot Refute

[72] Information conversion, effective samples, and parameter size PDF

Cannot Refute

[73] A bayesian analysis of healthcare information needs among family caregivers to promote cancer adaptation in female patients PDF

Cannot Refute

[74] Quantifying predictability through information theory: small sample estimation in a non-Gaussian framework PDF

Cannot Refute

Contribution

Controlled experimental validation across NMT and CLM tasks

[2] Scaling Data-Constrained Language Models PDF

Cannot Refute

[56] Scaling laws for neural language models PDF

Cannot Refute

[57] Scaling neural machine translation to 200 languages PDF

Cannot Refute

[58] Unified scaling laws for routed language models PDF

Cannot Refute

[59] Temporal scaling law for large language models PDF

Cannot Refute

[60] Scaling laws for multilingual language models PDF

Cannot Refute

[61] Scaling laws for generative mixed-modal language models PDF

Cannot Refute

[62] Scaling Instruction-Finetuned Language Models PDF

Cannot Refute

[63] Scaling laws for linear complexity language models PDF

Cannot Refute

[64] Evosld: Automated neural scaling law discovery with large language models PDF

Cannot Refute

Scaling Laws Revisited: Modeling the Role of Data Quality in Language Model Pretraining

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[8] Revisiting Scaling Laws for Language Models: The Role of Data Quality and Training Strategies PDF

[18] Scaling Parameter-Constrained Language Models with Quality Data PDF

Contribution Analysis

Quality-aware scaling law with dimensionless quality parameter Q

[3] Scalingfilter: Assessing data quality through inverse utilization of scaling laws PDF

[8] Revisiting Scaling Laws for Language Models: The Role of Data Quality and Training Strategies PDF

[10] Towards trustable language models: Investigating information quality of large language models PDF

[11] Sub-scaling laws: on the role of data density and training strategies in llms PDF

[29] Data scaling laws in NMT: The effect of noise and architecture PDF

[51] Beyond neural scaling laws: beating power law scaling via data pruning PDF

[52] Scaling laws for neural machine translation PDF

[53] An empirical study of scaling laws for transfer PDF

[54] Deep learning-based real-time data quality assessment and anomaly detection for large-scale distributed data streams PDF

[55] Scaling an artificial neural network-based water quality index model from small to large catchments PDF

Effective sample size derivation and quality estimators

[65] An Information Theoretic Approach to Prevalence Estimation and Missing Data PDF

[66] Practical use of the information-theoretic approach PDF

[67] Analyzing Sample Size in Information-Theoretic Models PDF

[68] Finding emergence in data by maximizing effective information PDF

[69] Information theoretic-based sampling of observations PDF

[70] Model selection and multimodel inference: a practical information-theoretic approach PDF

[71] Information theoretic perspective on sample complexity. PDF

[72] Information conversion, effective samples, and parameter size PDF

[73] A bayesian analysis of healthcare information needs among family caregivers to promote cancer adaptation in female patients PDF

[74] Quantifying predictability through information theory: small sample estimation in a non-Gaussian framework PDF

Controlled experimental validation across NMT and CLM tasks

[2] Scaling Data-Constrained Language Models PDF

[56] Scaling laws for neural language models PDF

[57] Scaling neural machine translation to 200 languages PDF

[58] Unified scaling laws for routed language models PDF

[59] Temporal scaling law for large language models PDF

[60] Scaling laws for multilingual language models PDF

[61] Scaling laws for generative mixed-modal language models PDF

[62] Scaling Instruction-Finetuned Language Models PDF

[63] Scaling laws for linear complexity language models PDF

[64] Evosld: Automated neural scaling law discovery with large language models PDF

Table of Contents