Why Less is More (Sometimes): A Theory of Data Curation

ICLR 2026 Conference SubmissionAnonymous Authors
data curation; LIMO (Less Is More); MIMO(More is More); synthetic data; beating scaling laws; mitigating model collapse; random matrix theory
Abstract:

This paper introduces a theoretical framework to resolve a central paradox in modern machine learning: When is it better to use less data? This question has become critical as classical scaling laws suggesting more is more'' (Sun et al., 2025) are challenged by methods like LIMO (less is more'') and s1 (Ye et al., 2025; Muenighoff et al., 2025), which achieve superior performance with small, aggressively curated datasets. Here, we study data curation strategies where an imperfect oracle selects the training examples according to their difficulty and correctness. Our results provide exact scaling law curves for test error under both label-agnostic and label-aware curation rules, revealing when and why keeping only a subset of data can improve generalization. In contrast to classical scaling laws, we show that under certain conditions, small curated datasets can outperform full datasets, and we provide analytical conditions for this by deriving precise phase transition curves tied to data size and quality. We validate these theoretical claims with empirical results on ImageNet, confirming our predictions about when curation improves accuracy and can even mitigate model collapse. Furthermore, our framework provides a principled explanation for the contradictory curation strategies recently observed in LLM mathematical reasoning.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper contributes a theoretical framework with exact scaling laws for data curation, analyzing when pruned datasets outperform full datasets and how curation prevents model collapse. It resides in the 'Data Curation Theory and Scaling Behavior' leaf under 'Theoretical Foundations and Scaling Laws', sharing this leaf with only one sibling paper (Data Filtering Scaling). This represents a relatively sparse research direction within the broader taxonomy of 50 papers across 23 leaf nodes, suggesting the theoretical analysis of curation scaling laws remains an emerging area compared to more populated branches like sample selection methods or domain-specific applications.

The taxonomy reveals neighboring work in 'Generalization Mechanisms and Model Behavior' examining overfitting and memorization transitions, while adjacent branches address 'Instance Difficulty and Hardness-Based Selection' and 'Model-Aware and Optimization-Based Selection'. The paper's theoretical focus on oracle-based curation rules and phase transitions distinguishes it from these empirical selection methods. Its scope explicitly covers label-aware and label-agnostic strategies, connecting to the broader field's tension between scaling efficiency and sample quality, yet diverging by providing analytical conditions rather than algorithmic implementations for subset construction.

Among 30 candidates examined across three contributions, none yielded clear refutations. The theoretical framework for exact scaling laws examined 10 candidates with no refutable overlaps, as did the conditions for pruned dataset superiority and the model collapse prevention analysis. This suggests limited prior work directly addressing the same theoretical questions within the search scope. However, the single sibling paper in the same taxonomy leaf indicates some related theoretical investigation exists. The absence of refutations across all contributions may reflect either genuine novelty in the specific analytical approach or limitations in the semantic search coverage of theoretical scaling law literature.

Based on the limited search of 30 semantically similar candidates, the work appears to occupy a relatively unexplored theoretical niche within data curation research. The sparse population of its taxonomy leaf and lack of direct overlaps suggest novelty in formalizing curation scaling laws, though the analysis cannot confirm whether exhaustive searches of optimization theory or statistical learning literature might reveal closer precedents. The empirical validation on ImageNet provides grounding, but the core theoretical contributions' novelty assessment remains constrained by the search scope.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: data curation strategies for improving generalization in machine learning. The field encompasses a broad spectrum of approaches organized into several major branches. Theoretical Foundations and Scaling Laws examine how data selection and filtering principles scale with model size and dataset composition, as seen in works like Data Filtering Scaling[15]. Sample Selection and Subset Construction Methods focus on identifying informative subsets through coreset techniques and active learning, exemplified by Glister[8] and Optimal Transport Coreset[38]. Data Augmentation and Synthetic Data Generation explore techniques to expand training distributions, including graph-based methods surveyed in Graph Augmentation Survey[49] and domain-mixing approaches like Style Mix Segmentation[39]. Data Quality Management and Preprocessing address cleaning, annotation, and preparation pipelines across domains such as Medical Imaging Preparation[10] and Data Preprocessing Survey[4]. Domain Generalization and Distribution Shift tackle out-of-distribution robustness, with Domain Generalization Survey[2] providing foundational perspectives. Evaluation Frameworks guide model selection and testing under distributional changes, while Domain-Specific Applications demonstrate curation in specialized contexts from Therapeutics Data Commons[7] to Battery Machine Learning[13]. Several active research directions reveal key trade-offs between theoretical rigor and practical deployment. The tension between scaling efficiency and sample quality appears prominently: while Data Filtering Scaling[15] investigates how filtering heuristics behave at scale, works like Glister[8] and Quantum Coreset Selection[31] pursue principled subset construction with computational overhead. Data Curation Theory[0] sits within the Theoretical Foundations branch alongside Data Filtering Scaling[15], emphasizing formal understanding of how curation choices influence generalization guarantees and scaling behavior. Compared to more application-driven neighbors like Intrusion Detection Generalization[3] or domain-specific pipelines such as Nanoparticle Toxicity Curation[18], Data Curation Theory[0] provides a foundational lens on the underlying principles governing data selection across contexts. Open questions persist around automating curation decisions, as explored in Automated Curation Finetuning[32], and understanding when synthetic augmentation helps versus when it risks Model Collapse[41], highlighting ongoing challenges in balancing data quantity, quality, and diversity for robust generalization.

Claimed Contributions

Theoretical framework for data curation with exact scaling laws

The authors introduce a mathematical framework that provides exact analytical formulas for test error under data curation strategies. This framework characterizes how pruning training examples based on difficulty and correctness affects generalization performance in high-dimensional binary classification.

10 retrieved papers
Conditions under which pruned datasets outperform full datasets

The authors establish precise conditions and phase transitions that determine when keeping only a subset of data improves performance over using the full dataset. They show that this depends on data size, generator quality, and oracle reliability, providing analytical boundaries for when less is more versus more is more.

10 retrieved papers
Analytical demonstration that curation prevents model collapse

The authors prove that strategic data curation can prevent model collapse when training iteratively on noisy or synthetic data. They identify phase boundaries where uncurated training leads to catastrophic degradation while curated training maintains stability.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Theoretical framework for data curation with exact scaling laws

The authors introduce a mathematical framework that provides exact analytical formulas for test error under data curation strategies. This framework characterizes how pruning training examples based on difficulty and correctness affects generalization performance in high-dimensional binary classification.

Contribution

Conditions under which pruned datasets outperform full datasets

The authors establish precise conditions and phase transitions that determine when keeping only a subset of data improves performance over using the full dataset. They show that this depends on data size, generator quality, and oracle reliability, providing analytical boundaries for when less is more versus more is more.

Contribution

Analytical demonstration that curation prevents model collapse

The authors prove that strategic data curation can prevent model collapse when training iteratively on noisy or synthetic data. They identify phase boundaries where uncurated training leads to catastrophic degradation while curated training maintains stability.

Why Less is More (Sometimes): A Theory of Data Curation | Novelty Validation