Why Less is More (Sometimes): A Theory of Data Curation
Overview
Overall Novelty Assessment
The paper contributes a theoretical framework with exact scaling laws for data curation, analyzing when pruned datasets outperform full datasets and how curation prevents model collapse. It resides in the 'Data Curation Theory and Scaling Behavior' leaf under 'Theoretical Foundations and Scaling Laws', sharing this leaf with only one sibling paper (Data Filtering Scaling). This represents a relatively sparse research direction within the broader taxonomy of 50 papers across 23 leaf nodes, suggesting the theoretical analysis of curation scaling laws remains an emerging area compared to more populated branches like sample selection methods or domain-specific applications.
The taxonomy reveals neighboring work in 'Generalization Mechanisms and Model Behavior' examining overfitting and memorization transitions, while adjacent branches address 'Instance Difficulty and Hardness-Based Selection' and 'Model-Aware and Optimization-Based Selection'. The paper's theoretical focus on oracle-based curation rules and phase transitions distinguishes it from these empirical selection methods. Its scope explicitly covers label-aware and label-agnostic strategies, connecting to the broader field's tension between scaling efficiency and sample quality, yet diverging by providing analytical conditions rather than algorithmic implementations for subset construction.
Among 30 candidates examined across three contributions, none yielded clear refutations. The theoretical framework for exact scaling laws examined 10 candidates with no refutable overlaps, as did the conditions for pruned dataset superiority and the model collapse prevention analysis. This suggests limited prior work directly addressing the same theoretical questions within the search scope. However, the single sibling paper in the same taxonomy leaf indicates some related theoretical investigation exists. The absence of refutations across all contributions may reflect either genuine novelty in the specific analytical approach or limitations in the semantic search coverage of theoretical scaling law literature.
Based on the limited search of 30 semantically similar candidates, the work appears to occupy a relatively unexplored theoretical niche within data curation research. The sparse population of its taxonomy leaf and lack of direct overlaps suggest novelty in formalizing curation scaling laws, though the analysis cannot confirm whether exhaustive searches of optimization theory or statistical learning literature might reveal closer precedents. The empirical validation on ImageNet provides grounding, but the core theoretical contributions' novelty assessment remains constrained by the search scope.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors introduce a mathematical framework that provides exact analytical formulas for test error under data curation strategies. This framework characterizes how pruning training examples based on difficulty and correctness affects generalization performance in high-dimensional binary classification.
The authors establish precise conditions and phase transitions that determine when keeping only a subset of data improves performance over using the full dataset. They show that this depends on data size, generator quality, and oracle reliability, providing analytical boundaries for when less is more versus more is more.
The authors prove that strategic data curation can prevent model collapse when training iteratively on noisy or synthetic data. They identify phase boundaries where uncurated training leads to catastrophic degradation while curated training maintains stability.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[15] Scaling Laws for Data Filtering--Data Curation cannot be Compute Agnostic PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Theoretical framework for data curation with exact scaling laws
The authors introduce a mathematical framework that provides exact analytical formulas for test error under data curation strategies. This framework characterizes how pruning training examples based on difficulty and correctness affects generalization performance in high-dimensional binary classification.
[69] Beyond neural scaling laws: beating power law scaling via data pruning PDF
[70] Validating large-scale quantum machine learning: efficient simulation of quantum support vector machines using tensor networks PDF
[71] Cliploss and norm-based data selection methods for multimodal contrastive learning PDF
[72] Unveiling the impact of dataset size on machine learning models for anxiety and depression prediction amid the COVID-19 pandemic: determining optimal data ⦠PDF
[73] Impact of Dataset Size on Machine Learning Regression Accuracy in Solar Power Prediction. PDF
[74] High-dimensional Analysis of Knowledge Distillation: Weak-to-Strong Generalization and Scaling Laws PDF
[75] Reproducible Scaling Laws for Contrastive Language-Image Learning PDF
[76] Scaling Laws of Decoder-Only Models on the Multilingual Machine Translation Task PDF
[77] Uncovering Neural Scaling Laws in Molecular Representation Learning PDF
[78] Securing distributed gradient descent in high dimensional statistical learning PDF
Conditions under which pruned datasets outperform full datasets
The authors establish precise conditions and phase transitions that determine when keeping only a subset of data improves performance over using the full dataset. They show that this depends on data size, generator quality, and oracle reliability, providing analytical boundaries for when less is more versus more is more.
[11] Dataset pruning: Reducing training data by examining generalization influence PDF
[51] Feature reduction for hepatocellular carcinoma prediction using machine learning algorithms PDF
[52] Impact of dataset size on classification performance: an empirical evaluation in the medical domain PDF
[53] Lightweight dataset pruning without full training via example difficulty and prediction uncertainty PDF
[54] When less is more: Investigating data pruning for pretraining llms at scale PDF
[55] Microstructure segmentation with deep learning encoders pre-trained on a large microscopy dataset PDF
[56] Performance Analysis of YOLO and Detectron2 Models for Detecting Corn and Soybean Pests Employing Customized Dataset PDF
[57] Automatic Pruning and Quality Assurance of Object Detection Datasets for Autonomous Driving PDF
[58] Distill the best, ignore the rest: Improving dataset distillation with loss-value-based pruning PDF
[59] DONOD: Robust and Generalizable Instruction Fine-Tuning for LLMs via Model-Intrinsic Dataset Pruning PDF
Analytical demonstration that curation prevents model collapse
The authors prove that strategic data curation can prevent model collapse when training iteratively on noisy or synthetic data. They identify phase boundaries where uncurated training leads to catastrophic degradation while curated training maintains stability.