Why Less is More (Sometimes): A Theory of Data Curation

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 7.5 Download Report PDF

data curation; LIMO (Less Is More); MIMO(More is More); synthetic data; beating scaling laws; mitigating model collapse; random matrix theory

This paper introduces a theoretical framework to resolve a central paradox in modern machine learning: When is it better to use less data? This question has become critical as classical scaling laws suggesting more is more'' (Sun et al., 2025) are challenged by methods like LIMO (less is more'') and s1 (Ye et al., 2025; Muenighoff et al., 2025), which achieve superior performance with small, aggressively curated datasets. Here, we study data curation strategies where an imperfect oracle selects the training examples according to their difficulty and correctness. Our results provide exact scaling law curves for test error under both label-agnostic and label-aware curation rules, revealing when and why keeping only a subset of data can improve generalization. In contrast to classical scaling laws, we show that under certain conditions, small curated datasets can outperform full datasets, and we provide analytical conditions for this by deriving precise phase transition curves tied to data size and quality. We validate these theoretical claims with empirical results on ImageNet, confirming our predictions about when curation improves accuracy and can even mitigate model collapse. Furthermore, our framework provides a principled explanation for the contradictory curation strategies recently observed in LLM mathematical reasoning.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper contributes a theoretical framework with exact scaling laws for data curation, analyzing when pruned datasets outperform full datasets and how curation prevents model collapse. It resides in the 'Data Curation Theory and Scaling Behavior' leaf under 'Theoretical Foundations and Scaling Laws', sharing this leaf with only one sibling paper (Data Filtering Scaling). This represents a relatively sparse research direction within the broader taxonomy of 50 papers across 23 leaf nodes, suggesting the theoretical analysis of curation scaling laws remains an emerging area compared to more populated branches like sample selection methods or domain-specific applications.

The taxonomy reveals neighboring work in 'Generalization Mechanisms and Model Behavior' examining overfitting and memorization transitions, while adjacent branches address 'Instance Difficulty and Hardness-Based Selection' and 'Model-Aware and Optimization-Based Selection'. The paper's theoretical focus on oracle-based curation rules and phase transitions distinguishes it from these empirical selection methods. Its scope explicitly covers label-aware and label-agnostic strategies, connecting to the broader field's tension between scaling efficiency and sample quality, yet diverging by providing analytical conditions rather than algorithmic implementations for subset construction.

Among 30 candidates examined across three contributions, none yielded clear refutations. The theoretical framework for exact scaling laws examined 10 candidates with no refutable overlaps, as did the conditions for pruned dataset superiority and the model collapse prevention analysis. This suggests limited prior work directly addressing the same theoretical questions within the search scope. However, the single sibling paper in the same taxonomy leaf indicates some related theoretical investigation exists. The absence of refutations across all contributions may reflect either genuine novelty in the specific analytical approach or limitations in the semantic search coverage of theoretical scaling law literature.

Based on the limited search of 30 semantically similar candidates, the work appears to occupy a relatively unexplored theoretical niche within data curation research. The sparse population of its taxonomy leaf and lack of direct overlaps suggest novelty in formalizing curation scaling laws, though the analysis cannot confirm whether exhaustive searches of optimization theory or statistical learning literature might reveal closer precedents. The empirical validation on ImageNet provides grounding, but the core theoretical contributions' novelty assessment remains constrained by the search scope.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: data curation strategies for improving generalization in machine learning. The field encompasses a broad spectrum of approaches organized into several major branches. Theoretical Foundations and Scaling Laws examine how data selection and filtering principles scale with model size and dataset composition, as seen in works like Data Filtering Scaling[15]. Sample Selection and Subset Construction Methods focus on identifying informative subsets through coreset techniques and active learning, exemplified by Glister[8] and Optimal Transport Coreset[38]. Data Augmentation and Synthetic Data Generation explore techniques to expand training distributions, including graph-based methods surveyed in Graph Augmentation Survey[49] and domain-mixing approaches like Style Mix Segmentation[39]. Data Quality Management and Preprocessing address cleaning, annotation, and preparation pipelines across domains such as Medical Imaging Preparation[10] and Data Preprocessing Survey[4]. Domain Generalization and Distribution Shift tackle out-of-distribution robustness, with Domain Generalization Survey[2] providing foundational perspectives. Evaluation Frameworks guide model selection and testing under distributional changes, while Domain-Specific Applications demonstrate curation in specialized contexts from Therapeutics Data Commons[7] to Battery Machine Learning[13]. Several active research directions reveal key trade-offs between theoretical rigor and practical deployment. The tension between scaling efficiency and sample quality appears prominently: while Data Filtering Scaling[15] investigates how filtering heuristics behave at scale, works like Glister[8] and Quantum Coreset Selection[31] pursue principled subset construction with computational overhead. Data Curation Theory[0] sits within the Theoretical Foundations branch alongside Data Filtering Scaling[15], emphasizing formal understanding of how curation choices influence generalization guarantees and scaling behavior. Compared to more application-driven neighbors like Intrusion Detection Generalization[3] or domain-specific pipelines such as Nanoparticle Toxicity Curation[18], Data Curation Theory[0] provides a foundational lens on the underlying principles governing data selection across contexts. Open questions persist around automating curation decisions, as explored in Automated Curation Finetuning[32], and understanding when synthetic augmentation helps versus when it risks Model Collapse[41], highlighting ongoing challenges in balancing data quantity, quality, and diversity for robust generalization.

Claimed Contributions

Theoretical framework for data curation with exact scaling laws

10 retrieved papers

The authors introduce a mathematical framework that provides exact analytical formulas for test error under data curation strategies. This framework characterizes how pruning training examples based on difficulty and correctness affects generalization performance in high-dimensional binary classification.

10 retrieved papers

Conditions under which pruned datasets outperform full datasets

10 retrieved papers

The authors establish precise conditions and phase transitions that determine when keeping only a subset of data improves performance over using the full dataset. They show that this depends on data size, generator quality, and oracle reliability, providing analytical boundaries for when less is more versus more is more.

10 retrieved papers

Analytical demonstration that curation prevents model collapse

10 retrieved papers

The authors prove that strategic data curation can prevent model collapse when training iteratively on noisy or synthetic data. They identify phase boundaries where uncurated training leads to catastrophic degradation while curated training maintains stability.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[15] Scaling Laws for Data Filtering--Data Curation cannot be Compute Agnostic PDF

Sachin Goyal, Pratyush Maini, Zachary C. Lipton, Aditi Raghunathan, J. Zico Kolter (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Theoretical framework for data curation with exact scaling laws

[69] Beyond neural scaling laws: beating power law scaling via data pruning PDF

Cannot Refute

[70] Validating large-scale quantum machine learning: efficient simulation of quantum support vector machines using tensor networks PDF

Cannot Refute

[71] Cliploss and norm-based data selection methods for multimodal contrastive learning PDF

Cannot Refute

[72] Unveiling the impact of dataset size on machine learning models for anxiety and depression prediction amid the COVID-19 pandemic: determining optimal data â¦ PDF

Cannot Refute

[73] Impact of Dataset Size on Machine Learning Regression Accuracy in Solar Power Prediction. PDF

Cannot Refute

[74] High-dimensional Analysis of Knowledge Distillation: Weak-to-Strong Generalization and Scaling Laws PDF

Cannot Refute

[75] Reproducible Scaling Laws for Contrastive Language-Image Learning PDF

Cannot Refute

[76] Scaling Laws of Decoder-Only Models on the Multilingual Machine Translation Task PDF

Cannot Refute

[77] Uncovering Neural Scaling Laws in Molecular Representation Learning PDF

Cannot Refute

[78] Securing distributed gradient descent in high dimensional statistical learning PDF

Cannot Refute

Contribution

Conditions under which pruned datasets outperform full datasets

[11] Dataset pruning: Reducing training data by examining generalization influence PDF

Cannot Refute

[51] Feature reduction for hepatocellular carcinoma prediction using machine learning algorithms PDF

Cannot Refute

[52] Impact of dataset size on classification performance: an empirical evaluation in the medical domain PDF

Cannot Refute

[53] Lightweight dataset pruning without full training via example difficulty and prediction uncertainty PDF

Cannot Refute

[54] When less is more: Investigating data pruning for pretraining llms at scale PDF

Cannot Refute

[55] Microstructure segmentation with deep learning encoders pre-trained on a large microscopy dataset PDF

Cannot Refute

[56] Performance Analysis of YOLO and Detectron2 Models for Detecting Corn and Soybean Pests Employing Customized Dataset PDF

Cannot Refute

[57] Automatic Pruning and Quality Assurance of Object Detection Datasets for Autonomous Driving PDF

Cannot Refute

[58] Distill the best, ignore the rest: Improving dataset distillation with loss-value-based pruning PDF

Cannot Refute

[59] DONOD: Robust and Generalizable Instruction Fine-Tuning for LLMs via Model-Intrinsic Dataset Pruning PDF

Cannot Refute

Contribution

Analytical demonstration that curation prevents model collapse

[24] An empirical study on data distribution-aware test selection for deep learning enhancement PDF

Cannot Refute

[60] Choosing public datasets for private machine learning via gradient subspace distance PDF

Cannot Refute

[61] In shift and in variance: assessing the robustness of HAR deep learning models against variability PDF

Cannot Refute

[62] Revisiting out-of-distribution robustness in nlp: Benchmarks, analysis, and llms evaluations PDF

Cannot Refute

[63] Calibration of time-series forecasting: Detecting and adapting context-driven distribution shift PDF

Cannot Refute

[64] On the need for a language describing distribution shifts: Illustrations on tabular datasets PDF

Cannot Refute

[65] Strategies for detecting and mitigating dataset shift in machine learning for health predictions: A systematic review PDF

Cannot Refute

[66] Data labeling: An empirical investigation into industrial challenges and mitigation strategies PDF

Cannot Refute

[67] Metashift: A dataset of datasets for evaluating contextual distribution shifts and training conflicts PDF

Cannot Refute

[68] Tackling data and model drift in AI: Strategies for maintaining accuracy during ML model inference PDF

Cannot Refute

Why Less is More (Sometimes): A Theory of Data Curation

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[15] Scaling Laws for Data Filtering--Data Curation cannot be Compute Agnostic PDF

Contribution Analysis

Theoretical framework for data curation with exact scaling laws

[69] Beyond neural scaling laws: beating power law scaling via data pruning PDF

[70] Validating large-scale quantum machine learning: efficient simulation of quantum support vector machines using tensor networks PDF

[71] Cliploss and norm-based data selection methods for multimodal contrastive learning PDF

[72] Unveiling the impact of dataset size on machine learning models for anxiety and depression prediction amid the COVID-19 pandemic: determining optimal data â¦ PDF

[73] Impact of Dataset Size on Machine Learning Regression Accuracy in Solar Power Prediction. PDF

[74] High-dimensional Analysis of Knowledge Distillation: Weak-to-Strong Generalization and Scaling Laws PDF

[75] Reproducible Scaling Laws for Contrastive Language-Image Learning PDF

[76] Scaling Laws of Decoder-Only Models on the Multilingual Machine Translation Task PDF

[77] Uncovering Neural Scaling Laws in Molecular Representation Learning PDF

[78] Securing distributed gradient descent in high dimensional statistical learning PDF

Conditions under which pruned datasets outperform full datasets

[11] Dataset pruning: Reducing training data by examining generalization influence PDF

[51] Feature reduction for hepatocellular carcinoma prediction using machine learning algorithms PDF

[52] Impact of dataset size on classification performance: an empirical evaluation in the medical domain PDF

[53] Lightweight dataset pruning without full training via example difficulty and prediction uncertainty PDF

[54] When less is more: Investigating data pruning for pretraining llms at scale PDF

[55] Microstructure segmentation with deep learning encoders pre-trained on a large microscopy dataset PDF

[56] Performance Analysis of YOLO and Detectron2 Models for Detecting Corn and Soybean Pests Employing Customized Dataset PDF

[57] Automatic Pruning and Quality Assurance of Object Detection Datasets for Autonomous Driving PDF

[58] Distill the best, ignore the rest: Improving dataset distillation with loss-value-based pruning PDF

[59] DONOD: Robust and Generalizable Instruction Fine-Tuning for LLMs via Model-Intrinsic Dataset Pruning PDF

Analytical demonstration that curation prevents model collapse

[24] An empirical study on data distribution-aware test selection for deep learning enhancement PDF

[60] Choosing public datasets for private machine learning via gradient subspace distance PDF

[61] In shift and in variance: assessing the robustness of HAR deep learning models against variability PDF

[62] Revisiting out-of-distribution robustness in nlp: Benchmarks, analysis, and llms evaluations PDF

[63] Calibration of time-series forecasting: Detecting and adapting context-driven distribution shift PDF

[64] On the need for a language describing distribution shifts: Illustrations on tabular datasets PDF

[65] Strategies for detecting and mitigating dataset shift in machine learning for health predictions: A systematic review PDF

[66] Data labeling: An empirical investigation into industrial challenges and mitigation strategies PDF

[67] Metashift: A dataset of datasets for evaluating contextual distribution shifts and training conflicts PDF

[68] Tackling data and model drift in AI: Strategies for maintaining accuracy during ML model inference PDF

Table of Contents

[72] Unveiling the impact of dataset size on machine learning models for anxiety and depression prediction amid the COVID-19 pandemic: determining optimal data â¦ PDF