How Text Quality Interventions Reshape Neural Scaling Laws for LLMs: Empirical Study
Overview
Overall Novelty Assessment
The paper investigates how data quality interventions—deduplication, heuristic filtering, and LLM-guided rewriting—reshape neural scaling laws through extensive empirical study. It resides in the 'Quality-Aware Scaling Law Extensions' leaf, which contains only three papers total, including this work and two siblings (Farseer and Data Quality Scaling). This represents a relatively sparse research direction within the broader taxonomy of 50 papers, suggesting the specific focus on decomposing scaling law parameters under quality interventions remains underexplored compared to adjacent areas like heuristic filtering or compute optimization.
The taxonomy reveals neighboring research in 'Specialized Scaling Frameworks' (multi-domain and constrained-resource contexts) and 'Scaling Law Analysis and Methodology' (fitting and validation techniques). The paper's emphasis on how interventions simultaneously affect exponents, coefficients, and constants distinguishes it from sibling work: Farseer models quality effects on downstream performance, while Data Quality Scaling examines quality metrics more abstractly. The scope note for this leaf explicitly includes 'incorporating data quality as an explicit parameter,' which aligns with the paper's decomposition approach, though the exclude note clarifies it differs from domain-specific or multi-source frameworks.
Among 30 candidates examined across three contributions, none were found to clearly refute any claim. The QualityPajama Benchmark examined 10 candidates with zero refutable overlaps; Full Scaling Law Decomposition and Data-Aware Scaling Strategies each examined 10 candidates with similar results. This suggests that within the limited search scope, the specific combination of systematic quality interventions, large-scale model training (2,000+ models), and full parameter decomposition appears distinctive. However, the analysis explicitly notes this is based on top-K semantic search plus citation expansion, not exhaustive coverage.
Given the sparse population of the quality-aware scaling law leaf and the absence of refuting prior work among 30 examined candidates, the contributions appear to occupy relatively novel ground within the analyzed scope. The scale of empirical validation (23 datasets, 2,000 models) and the focus on conflicting directional effects across scaling parameters distinguish this work from its immediate siblings, though the limited search scope means potentially relevant work outside the top-30 semantic matches may exist.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors present QualityPajama, a benchmark consisting of 23 systematically curated datasets derived from Common Crawl. Each dataset represents a different text quality intervention (including filtering, deduplication, and synthetic curation) to enable controlled study of how data quality affects scaling laws in large language models.
The authors conduct the first comprehensive empirical analysis showing that text quality interventions reshape all components of neural scaling laws (exponents, coefficients, and asymptotic loss terms), not just the exponents as prior work assumed. They demonstrate that stronger filtering produces conflicting shifts across different parameters rather than uniformly favorable changes.
The authors demonstrate that data quality fundamentally affects compute-optimal design decisions in LLM training. They show that different quality interventions can shift the optimal number of parameters, training tokens, and their ratio by orders of magnitude, revealing that scaling strategies must explicitly account for data quality rather than treating it as a secondary concern.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[2] Farseer: A Refined Scaling Law in Large Language Models PDF
[15] Revisiting scaling laws for language models: The role of data quality and training strategies PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
QualityPajama Benchmark
The authors present QualityPajama, a benchmark consisting of 23 systematically curated datasets derived from Common Crawl. Each dataset represents a different text quality intervention (including filtering, deduplication, and synthetic curation) to enable controlled study of how data quality affects scaling laws in large language models.
[3] Scalingfilter: Assessing data quality through inverse utilization of scaling laws PDF
[24] Data curation via joint example selection further accelerates multimodal learning PDF
[63] Densing law of llms PDF
[64] CopyBench: Measuring Literal and Non-Literal Reproduction of Copyright-Protected Text in Language Model Generation PDF
[65] Exploring training and inference scaling laws in generative retrieval PDF
[66] Exponential scaling of factual inconsistency in data-to-text generation with fine-tuned LLMs PDF
[67] OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization PDF
[68] Towards trustable language models: Investigating information quality of large language models PDF
[69] Llm-generated natural language meets scaling laws: New explorations and data augmentation methods PDF
[70] Scaling Laws for Downstream Task Performance in Machine Translation PDF
Full Scaling Law Decomposition
The authors conduct the first comprehensive empirical analysis showing that text quality interventions reshape all components of neural scaling laws (exponents, coefficients, and asymptotic loss terms), not just the exponents as prior work assumed. They demonstrate that stronger filtering produces conflicting shifts across different parameters rather than uniformly favorable changes.
[3] Scalingfilter: Assessing data quality through inverse utilization of scaling laws PDF
[7] Datacomp: In search of the next generation of multimodal datasets PDF
[9] No" zero-shot" without exponential data: Pretraining concept frequency determines multimodal model performance PDF
[18] Not-just-scaling laws: Towards a better understanding of the downstream impact of language model design decisions PDF
[19] (mis) fitting scaling laws: A survey of scaling law fitting techniques in deep learning PDF
[24] Data curation via joint example selection further accelerates multimodal learning PDF
[47] Scaling Laws for Data FilteringâData Curation Cannot be Compute Agnostic PDF
[60] Scaling laws for reward model overoptimization in direct alignment algorithms PDF
[61] Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision PDF
[62] Sub-scaling laws: on the role of data density and training strategies in llms PDF
Data-Aware Scaling Strategies
The authors demonstrate that data quality fundamentally affects compute-optimal design decisions in LLM training. They show that different quality interventions can shift the optimal number of parameters, training tokens, and their ratio by orders of magnitude, revealing that scaling strategies must explicitly account for data quality rather than treating it as a secondary concern.