Larger Datasets Can Be Repeated More: A Theoretical Analysis of Multi-Epoch Scaling in Linear Regression

ICLR 2026 Conference SubmissionAnonymous Authors
Deep learning theoryMulti-epoch trainingData-reuseOptimizationScaling lawLarge language model
Abstract:

Large Language Model (LLM) training often processes vast text corpora in a single pass, leaving much available data underutilized. This paper presents a theoretical analysis of how a common workaround, training for multiple epochs on the same dataset, reshapes the data scaling laws. Concretely, given a KK-epoch training on NN samples, how many fresh samples would one-pass training require to match the same performance? We quantify this using the \textit{effective reuse rate} of the data, E(K,N)E(K, N), which we define as the factor by which the dataset must grow under one-pass training to match the test loss of multi-epoch training. Our analysis precisely characterizes the scaling behavior of E(K,N)E(K, N) for SGD in linear regression under either strong convexity or Zipf-distributed data: (1) When KK is small, we prove that E(K,N)KE(K, N) \approx K, indicating that every new epoch yields a linear gain; (2) As KK increases, E(K,N)E(K, N) plateaus at a problem-dependent value that grows with NN (Θ(logN)\Theta(\log N) for the strongly-convex case), implying that larger datasets can be repeated more times before the marginal benefit vanishes. These theoretical findings complement a recent empirical study by Muennighoff et al. (2023), which found that training LLMs for up to 44 epochs results in negligible loss differences compared to using fresh data at each step, \textit{i.e.}, E(K,N)KE(K, N) \approx K for K4K \le 4 in our notation. Supported by further empirical validation with LLMs, our results reveal how this behavior depends on the underlying data size and distribution, and underscore the need to explicitly model both factors in future studies of scaling laws with data reuse.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces a theoretical framework for quantifying the effective reuse rate E(K,N) in multi-epoch linear regression, characterizing how repeated passes over N samples compare to single-pass training on fresh data. It resides in the 'Scaling Laws and Effective Reuse Rates' leaf of the taxonomy, which contains only two papers total (including this one). This leaf sits within the broader 'Theoretical Analysis of Multi-Epoch Training and Data Reuse' branch, indicating a relatively sparse research direction focused on fundamental scaling properties rather than algorithmic or applied concerns.

The taxonomy reveals neighboring branches addressing related but distinct problems. 'Sensitivity and Complexity Analysis' examines model robustness to data perturbations, while 'Model Transfer and Reuse for Efficient Training' focuses on leveraging pretrained representations across tasks rather than iterative training on fixed data. The 'Linear Mixed-Effects Models' branch handles repeated measurements through hierarchical random effects, representing a statistical modeling tradition fundamentally different from the iterative optimization perspective adopted here. The paper's theoretical focus on scaling laws positions it at the intersection of classical statistical learning theory and modern concerns about data efficiency in large-scale training.

Among 21 candidates examined across three contributions, the analysis found limited prior work overlap. The core effective reuse rate characterization (8 candidates examined, 0 refutable) and scaling behavior analysis for strongly convex and Zipf cases (10 candidates examined, 0 refutable) appear relatively novel within this search scope. However, the optimal learning rate derivation and risk approximation for multi-epoch SGD (3 candidates examined, 1 refutable) shows more substantial prior work, suggesting this technical component may have existing coverage in the optimization literature.

Based on the limited search scope of 21 semantically similar candidates, the work appears to occupy a sparsely populated research direction with only one sibling paper in its taxonomy leaf. The core theoretical contributions around effective reuse rates show minimal overlap with examined candidates, though the learning rate analysis component has at least one overlapping prior result. The analysis does not cover exhaustive citation networks or domain-specific venues that might contain additional relevant theoretical work on data reuse in iterative training.

Taxonomy

Core-task Taxonomy Papers
27
3
Claimed Contributions
21
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: Multi-epoch training in linear regression with data reuse examines how repeatedly cycling through the same dataset affects model performance and generalization. The field structure suggested by the taxonomy spans several distinct branches. Theoretical Analysis of Multi-Epoch Training and Data Reuse investigates fundamental questions about scaling laws, effective reuse rates, and the statistical properties of iterative training, as seen in works like MultiEpoch Scaling[0] and Data Reuse Scaling[1]. Model Transfer and Reuse for Efficient Training focuses on leveraging pre-trained representations or knowledge across tasks. Linear Mixed-Effects Models for Repeated Measurements (e.g., Linear Mixed Models[12], PLS Mixed Models[10]) address hierarchical data structures with random effects, while Machine Learning Integration with Linear Models for Repeated Data and Machine Learning Models with Repeated Data in Training explore how modern learning algorithms handle correlated observations. Computational Optimization with Linear Models and Data Reuse emphasizes algorithmic efficiency and numerical methods for large-scale problems. Particularly active lines of work contrast theoretical guarantees with practical algorithmic design. The theoretical branch grapples with how data reuse influences bias-variance trade-offs and whether effective sample size diminishes with repeated passes, while computational branches prioritize scalable solvers and convergence acceleration. MultiEpoch Scaling[0] sits squarely within the Theoretical Analysis branch, specifically under Scaling Laws and Effective Reuse Rates, where it shares close thematic ties with Data Reuse Scaling[1]. Both works examine how training dynamics evolve as data is reused across epochs, but MultiEpoch Scaling[0] appears to emphasize the interplay between epoch count and model capacity in linear settings. This contrasts with neighboring branches like Linear Mixed-Effects Models, which handle repeated measurements through hierarchical random effects rather than iterative optimization, highlighting a fundamental divide between statistical modeling traditions and modern machine learning perspectives on data reuse.

Claimed Contributions

Theoretical characterization of effective reuse rate E(K,N) in linear regression

The authors theoretically analyze how the effective reuse rate E(K,N)—the multiplicative factor by which a dataset must grow under one-pass training to match K-epoch training performance—depends on both the number of epochs K and dataset size N. They prove that for small K, E(K,N) is approximately K, while for large K it plateaus at a problem-dependent value that grows with N (order log N for strongly convex cases).

8 retrieved papers
Scaling behavior analysis for strongly convex and Zipf-distributed data cases

The authors establish precise scaling laws for E(K,N) in two settings: strongly convex linear regression where saturation occurs at order log N, and Zipf-distributed data where saturation scales as a power of N. These results reveal a phase transition between an effective-reuse regime and a limited-reuse regime.

10 retrieved papers
Optimal learning rate derivation and risk approximation formula for multi-epoch SGD

The authors derive the optimal learning rate for multi-epoch stochastic gradient descent in linear regression and provide an approximation formula for expected excess risk with multiplicative error no(1). These technical results enable precise characterization of multi-epoch training dynamics.

3 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Theoretical characterization of effective reuse rate E(K,N) in linear regression

The authors theoretically analyze how the effective reuse rate E(K,N)—the multiplicative factor by which a dataset must grow under one-pass training to match K-epoch training performance—depends on both the number of epochs K and dataset size N. They prove that for small K, E(K,N) is approximately K, while for large K it plateaus at a problem-dependent value that grows with N (order log N for strongly convex cases).

Contribution

Scaling behavior analysis for strongly convex and Zipf-distributed data cases

The authors establish precise scaling laws for E(K,N) in two settings: strongly convex linear regression where saturation occurs at order log N, and Zipf-distributed data where saturation scales as a power of N. These results reveal a phase transition between an effective-reuse regime and a limited-reuse regime.

Contribution

Optimal learning rate derivation and risk approximation formula for multi-epoch SGD

The authors derive the optimal learning rate for multi-epoch stochastic gradient descent in linear regression and provide an approximation formula for expected excess risk with multiplicative error no(1). These technical results enable precise characterization of multi-epoch training dynamics.

Larger Datasets Can Be Repeated More: A Theoretical Analysis of Multi-Epoch Scaling in Linear Regression | Novelty Validation