Abstract:

Neural scaling laws are widely used for performance projection and resource planning, yet their sensitivity to data quality interventions remains poorly understood. We present an empirical study of how interventions—deduplication, heuristic filtering, and LLM-guided rewriting—reshape scaling behavior in large language model training. Using QualityPajama, a suite of 23 systematically filtered and synthetic datasets, we train over 2,000 models (100M–8B parameters, 100M–200B tokens) to measure how data quality affects scaling-law parameters and compute-optimal design decisions. Our results show that data interventions reshape scaling dynamics in non-trivial ways not captured by current theory, simultaneously moving exponents, coefficients, and constants in conflicting directions that exert opposing forces on loss. For example, an intervention may improve constants but hurt the exponents. Strategies that appear optimal at small scale can reverse at larger scale, and compute-optimal token–parameter ratios can vary by orders of magnitude depending on the intervention. These findings demonstrate that data curation and scaling strategy are deeply intertwined, and that evaluating interventions only at fixed scales can lead to misleading conclusions. We recommend evaluating interventions through their full scaling trajectories using scaling law projections.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper investigates how data quality interventions—deduplication, heuristic filtering, and LLM-guided rewriting—reshape neural scaling laws through extensive empirical study. It resides in the 'Quality-Aware Scaling Law Extensions' leaf, which contains only three papers total, including this work and two siblings (Farseer and Data Quality Scaling). This represents a relatively sparse research direction within the broader taxonomy of 50 papers, suggesting the specific focus on decomposing scaling law parameters under quality interventions remains underexplored compared to adjacent areas like heuristic filtering or compute optimization.

The taxonomy reveals neighboring research in 'Specialized Scaling Frameworks' (multi-domain and constrained-resource contexts) and 'Scaling Law Analysis and Methodology' (fitting and validation techniques). The paper's emphasis on how interventions simultaneously affect exponents, coefficients, and constants distinguishes it from sibling work: Farseer models quality effects on downstream performance, while Data Quality Scaling examines quality metrics more abstractly. The scope note for this leaf explicitly includes 'incorporating data quality as an explicit parameter,' which aligns with the paper's decomposition approach, though the exclude note clarifies it differs from domain-specific or multi-source frameworks.

Among 30 candidates examined across three contributions, none were found to clearly refute any claim. The QualityPajama Benchmark examined 10 candidates with zero refutable overlaps; Full Scaling Law Decomposition and Data-Aware Scaling Strategies each examined 10 candidates with similar results. This suggests that within the limited search scope, the specific combination of systematic quality interventions, large-scale model training (2,000+ models), and full parameter decomposition appears distinctive. However, the analysis explicitly notes this is based on top-K semantic search plus citation expansion, not exhaustive coverage.

Given the sparse population of the quality-aware scaling law leaf and the absence of refuting prior work among 30 examined candidates, the contributions appear to occupy relatively novel ground within the analyzed scope. The scale of empirical validation (23 datasets, 2,000 models) and the focus on conflicting directional effects across scaling parameters distinguish this work from its immediate siblings, though the limited search scope means potentially relevant work outside the top-30 semantic matches may exist.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: understanding how interventions on text quality affect neural scaling laws. The field has organized itself into several major branches that collectively address data quality, theoretical formulations, resource allocation, augmentation strategies, domain-specific applications, training infrastructure, and specialized methodological studies. Data Quality Characterization and Filtering Methods explore how to measure and improve dataset quality through filtering techniques, with works like General Purpose Filtering[1] and Scalingfilter[3] developing principled approaches to curate training corpora. Scaling Law Theory and Formulation investigates the mathematical relationships between model performance, data size, and compute, extending classical power-law formulations to account for quality dimensions; representative studies include Farseer[2] and Data Quality Scaling[4], which incorporate quality metrics into predictive frameworks. Compute-Optimal Resource Allocation examines trade-offs in distributing computational budgets across model size and training tokens, while Data Augmentation and Synthesis considers synthetic data generation as a lever for scaling. Domain-Specific Applications and Datasets apply these principles to specialized contexts such as vision-language tasks or scientific domains, and Training Infrastructure and Methodology addresses practical implementation challenges at scale. A particularly active line of inquiry centers on quality-aware extensions to classical scaling laws, where researchers seek to move beyond simple data-volume metrics to incorporate notions of data cleanliness, diversity, and relevance. Text Quality Scaling[0] sits squarely within this branch, examining how targeted quality interventions shift the scaling behavior predicted by traditional formulations. This work shares thematic overlap with Farseer[2], which also models quality effects on downstream performance, and with Data Filtering Science[5], which systematically studies filtering strategies. Compared to Revisiting Scaling Laws[15], which re-examines foundational assumptions in scaling theory, Text Quality Scaling[0] emphasizes empirical interventions and their measurable impact on the scaling exponent. A central open question across these studies is whether quality improvements can substitute for raw data volume or whether they interact multiplicatively, and how to predict the returns from quality-focused curation at different scales.

Claimed Contributions

QualityPajama Benchmark

The authors present QualityPajama, a benchmark consisting of 23 systematically curated datasets derived from Common Crawl. Each dataset represents a different text quality intervention (including filtering, deduplication, and synthetic curation) to enable controlled study of how data quality affects scaling laws in large language models.

10 retrieved papers
Full Scaling Law Decomposition

The authors conduct the first comprehensive empirical analysis showing that text quality interventions reshape all components of neural scaling laws (exponents, coefficients, and asymptotic loss terms), not just the exponents as prior work assumed. They demonstrate that stronger filtering produces conflicting shifts across different parameters rather than uniformly favorable changes.

10 retrieved papers
Data-Aware Scaling Strategies

The authors demonstrate that data quality fundamentally affects compute-optimal design decisions in LLM training. They show that different quality interventions can shift the optimal number of parameters, training tokens, and their ratio by orders of magnitude, revealing that scaling strategies must explicitly account for data quality rather than treating it as a secondary concern.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

QualityPajama Benchmark

The authors present QualityPajama, a benchmark consisting of 23 systematically curated datasets derived from Common Crawl. Each dataset represents a different text quality intervention (including filtering, deduplication, and synthetic curation) to enable controlled study of how data quality affects scaling laws in large language models.

Contribution

Full Scaling Law Decomposition

The authors conduct the first comprehensive empirical analysis showing that text quality interventions reshape all components of neural scaling laws (exponents, coefficients, and asymptotic loss terms), not just the exponents as prior work assumed. They demonstrate that stronger filtering produces conflicting shifts across different parameters rather than uniformly favorable changes.

Contribution

Data-Aware Scaling Strategies

The authors demonstrate that data quality fundamentally affects compute-optimal design decisions in LLM training. They show that different quality interventions can shift the optimal number of parameters, training tokens, and their ratio by orders of magnitude, revealing that scaling strategies must explicitly account for data quality rather than treating it as a secondary concern.