How Text Quality Interventions Reshape Neural Scaling Laws for LLMs: Empirical Study

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

Neural Scaling lawText quality

Neural scaling laws are widely used for performance projection and resource planning, yet their sensitivity to data quality interventions remains poorly understood. We present an empirical study of how interventions—deduplication, heuristic filtering, and LLM-guided rewriting—reshape scaling behavior in large language model training. Using QualityPajama, a suite of 23 systematically filtered and synthetic datasets, we train over 2,000 models (100M–8B parameters, 100M–200B tokens) to measure how data quality affects scaling-law parameters and compute-optimal design decisions. Our results show that data interventions reshape scaling dynamics in non-trivial ways not captured by current theory, simultaneously moving exponents, coefficients, and constants in conflicting directions that exert opposing forces on loss. For example, an intervention may improve constants but hurt the exponents. Strategies that appear optimal at small scale can reverse at larger scale, and compute-optimal token–parameter ratios can vary by orders of magnitude depending on the intervention. These findings demonstrate that data curation and scaling strategy are deeply intertwined, and that evaluating interventions only at fixed scales can lead to misleading conclusions. We recommend evaluating interventions through their full scaling trajectories using scaling law projections.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper investigates how data quality interventions—deduplication, heuristic filtering, and LLM-guided rewriting—reshape neural scaling laws through extensive empirical study. It resides in the 'Quality-Aware Scaling Law Extensions' leaf, which contains only three papers total, including this work and two siblings (Farseer and Data Quality Scaling). This represents a relatively sparse research direction within the broader taxonomy of 50 papers, suggesting the specific focus on decomposing scaling law parameters under quality interventions remains underexplored compared to adjacent areas like heuristic filtering or compute optimization.

The taxonomy reveals neighboring research in 'Specialized Scaling Frameworks' (multi-domain and constrained-resource contexts) and 'Scaling Law Analysis and Methodology' (fitting and validation techniques). The paper's emphasis on how interventions simultaneously affect exponents, coefficients, and constants distinguishes it from sibling work: Farseer models quality effects on downstream performance, while Data Quality Scaling examines quality metrics more abstractly. The scope note for this leaf explicitly includes 'incorporating data quality as an explicit parameter,' which aligns with the paper's decomposition approach, though the exclude note clarifies it differs from domain-specific or multi-source frameworks.

Among 30 candidates examined across three contributions, none were found to clearly refute any claim. The QualityPajama Benchmark examined 10 candidates with zero refutable overlaps; Full Scaling Law Decomposition and Data-Aware Scaling Strategies each examined 10 candidates with similar results. This suggests that within the limited search scope, the specific combination of systematic quality interventions, large-scale model training (2,000+ models), and full parameter decomposition appears distinctive. However, the analysis explicitly notes this is based on top-K semantic search plus citation expansion, not exhaustive coverage.

Given the sparse population of the quality-aware scaling law leaf and the absence of refuting prior work among 30 examined candidates, the contributions appear to occupy relatively novel ground within the analyzed scope. The scale of empirical validation (23 datasets, 2,000 models) and the focus on conflicting directional effects across scaling parameters distinguish this work from its immediate siblings, though the limited search scope means potentially relevant work outside the top-30 semantic matches may exist.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: understanding how interventions on text quality affect neural scaling laws. The field has organized itself into several major branches that collectively address data quality, theoretical formulations, resource allocation, augmentation strategies, domain-specific applications, training infrastructure, and specialized methodological studies. Data Quality Characterization and Filtering Methods explore how to measure and improve dataset quality through filtering techniques, with works like General Purpose Filtering[1] and Scalingfilter[3] developing principled approaches to curate training corpora. Scaling Law Theory and Formulation investigates the mathematical relationships between model performance, data size, and compute, extending classical power-law formulations to account for quality dimensions; representative studies include Farseer[2] and Data Quality Scaling[4], which incorporate quality metrics into predictive frameworks. Compute-Optimal Resource Allocation examines trade-offs in distributing computational budgets across model size and training tokens, while Data Augmentation and Synthesis considers synthetic data generation as a lever for scaling. Domain-Specific Applications and Datasets apply these principles to specialized contexts such as vision-language tasks or scientific domains, and Training Infrastructure and Methodology addresses practical implementation challenges at scale. A particularly active line of inquiry centers on quality-aware extensions to classical scaling laws, where researchers seek to move beyond simple data-volume metrics to incorporate notions of data cleanliness, diversity, and relevance. Text Quality Scaling[0] sits squarely within this branch, examining how targeted quality interventions shift the scaling behavior predicted by traditional formulations. This work shares thematic overlap with Farseer[2], which also models quality effects on downstream performance, and with Data Filtering Science[5], which systematically studies filtering strategies. Compared to Revisiting Scaling Laws[15], which re-examines foundational assumptions in scaling theory, Text Quality Scaling[0] emphasizes empirical interventions and their measurable impact on the scaling exponent. A central open question across these studies is whether quality improvements can substitute for raw data volume or whether they interact multiplicatively, and how to predict the returns from quality-focused curation at different scales.

Claimed Contributions

QualityPajama Benchmark

10 retrieved papers

The authors present QualityPajama, a benchmark consisting of 23 systematically curated datasets derived from Common Crawl. Each dataset represents a different text quality intervention (including filtering, deduplication, and synthetic curation) to enable controlled study of how data quality affects scaling laws in large language models.

10 retrieved papers

Full Scaling Law Decomposition

10 retrieved papers

The authors conduct the first comprehensive empirical analysis showing that text quality interventions reshape all components of neural scaling laws (exponents, coefficients, and asymptotic loss terms), not just the exponents as prior work assumed. They demonstrate that stronger filtering produces conflicting shifts across different parameters rather than uniformly favorable changes.

10 retrieved papers

Data-Aware Scaling Strategies

10 retrieved papers

The authors demonstrate that data quality fundamentally affects compute-optimal design decisions in LLM training. They show that different quality interventions can shift the optimal number of parameters, training tokens, and their ratio by orders of magnitude, revealing that scaling strategies must explicitly account for data quality rather than treating it as a secondary concern.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[2] Farseer: A Refined Scaling Law in Large Language Models PDF

H Li, W Zheng, Q Wang, Z Ding, H Wang (2025)

[15] Revisiting scaling laws for language models: The role of data quality and training strategies PDF

Zhengyu Chen, Siqi Wang, Teng Xiao, Yudong Wang, Shiqi Chen, ShenâHsing Annabel Chen, Xunliang Cai, Junxian He, Jingang Wang (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

QualityPajama Benchmark

[3] Scalingfilter: Assessing data quality through inverse utilization of scaling laws PDF

Cannot Refute

[24] Data curation via joint example selection further accelerates multimodal learning PDF

Cannot Refute

[63] Densing law of llms PDF

Cannot Refute

[64] CopyBench: Measuring Literal and Non-Literal Reproduction of Copyright-Protected Text in Language Model Generation PDF

Cannot Refute

[65] Exploring training and inference scaling laws in generative retrieval PDF

Cannot Refute

[66] Exponential scaling of factual inconsistency in data-to-text generation with fine-tuned LLMs PDF

Cannot Refute

[67] OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization PDF

Cannot Refute

[68] Towards trustable language models: Investigating information quality of large language models PDF

Cannot Refute

[69] Llm-generated natural language meets scaling laws: New explorations and data augmentation methods PDF

Cannot Refute

[70] Scaling Laws for Downstream Task Performance in Machine Translation PDF

Cannot Refute

Contribution

Full Scaling Law Decomposition

[3] Scalingfilter: Assessing data quality through inverse utilization of scaling laws PDF

Cannot Refute

[7] Datacomp: In search of the next generation of multimodal datasets PDF

Cannot Refute

[9] No" zero-shot" without exponential data: Pretraining concept frequency determines multimodal model performance PDF

Cannot Refute

[18] Not-just-scaling laws: Towards a better understanding of the downstream impact of language model design decisions PDF

Cannot Refute

[19] (mis) fitting scaling laws: A survey of scaling law fitting techniques in deep learning PDF

Cannot Refute

[24] Data curation via joint example selection further accelerates multimodal learning PDF

Cannot Refute

[47] Scaling Laws for Data FilteringâData Curation Cannot be Compute Agnostic PDF

Cannot Refute

[60] Scaling laws for reward model overoptimization in direct alignment algorithms PDF

Cannot Refute

[61] Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision PDF

Cannot Refute

[62] Sub-scaling laws: on the role of data density and training strategies in llms PDF

Cannot Refute

Contribution

Data-Aware Scaling Strategies

[4] Scaling laws revisited: modeling the role of data quality in language model pretraining PDF

Cannot Refute

[51] Datacomp-lm: In search of the next generation of training sets for language models PDF

Cannot Refute

[52] The falcon series of open language models PDF

Cannot Refute

[53] Kanana: Compute-efficient Bilingual Language Models PDF

Cannot Refute

[54] Will we run out of data? an analysis of the limits of scaling datasets in machine learning PDF

Cannot Refute

[55] Data Engineering for Scaling Language Models to 128K Context PDF

Cannot Refute

[56] Physics of language models: Part 3.3, knowledge capacity scaling laws PDF

Cannot Refute

[57] An empirical analysis of compute-optimal large language model training PDF

Cannot Refute

[58] Is training data quality or quantity more impactful to small language model performance PDF

Cannot Refute

[59] Position: Will we run out of data? Limits of LLM scaling based on human-generated data PDF

Cannot Refute

How Text Quality Interventions Reshape Neural Scaling Laws for LLMs: Empirical Study

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[2] Farseer: A Refined Scaling Law in Large Language Models PDF

[15] Revisiting scaling laws for language models: The role of data quality and training strategies PDF

Contribution Analysis

QualityPajama Benchmark

[3] Scalingfilter: Assessing data quality through inverse utilization of scaling laws PDF

[24] Data curation via joint example selection further accelerates multimodal learning PDF

[63] Densing law of llms PDF

[64] CopyBench: Measuring Literal and Non-Literal Reproduction of Copyright-Protected Text in Language Model Generation PDF

[65] Exploring training and inference scaling laws in generative retrieval PDF

[66] Exponential scaling of factual inconsistency in data-to-text generation with fine-tuned LLMs PDF

[67] OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization PDF

[68] Towards trustable language models: Investigating information quality of large language models PDF

[69] Llm-generated natural language meets scaling laws: New explorations and data augmentation methods PDF

[70] Scaling Laws for Downstream Task Performance in Machine Translation PDF

Full Scaling Law Decomposition

[3] Scalingfilter: Assessing data quality through inverse utilization of scaling laws PDF

[7] Datacomp: In search of the next generation of multimodal datasets PDF

[9] No" zero-shot" without exponential data: Pretraining concept frequency determines multimodal model performance PDF

[18] Not-just-scaling laws: Towards a better understanding of the downstream impact of language model design decisions PDF

[19] (mis) fitting scaling laws: A survey of scaling law fitting techniques in deep learning PDF

[24] Data curation via joint example selection further accelerates multimodal learning PDF

[47] Scaling Laws for Data FilteringâData Curation Cannot be Compute Agnostic PDF

[60] Scaling laws for reward model overoptimization in direct alignment algorithms PDF

[61] Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision PDF

[62] Sub-scaling laws: on the role of data density and training strategies in llms PDF

Data-Aware Scaling Strategies

[4] Scaling laws revisited: modeling the role of data quality in language model pretraining PDF

[51] Datacomp-lm: In search of the next generation of training sets for language models PDF

[52] The falcon series of open language models PDF

[53] Kanana: Compute-efficient Bilingual Language Models PDF

[54] Will we run out of data? an analysis of the limits of scaling datasets in machine learning PDF

[55] Data Engineering for Scaling Language Models to 128K Context PDF

[56] Physics of language models: Part 3.3, knowledge capacity scaling laws PDF

[57] An empirical analysis of compute-optimal large language model training PDF

[58] Is training data quality or quantity more impactful to small language model performance PDF

[59] Position: Will we run out of data? Limits of LLM scaling based on human-generated data PDF

Table of Contents

[47] Scaling Laws for Data FilteringâData Curation Cannot be Compute Agnostic PDF