Larger Datasets Can Be Repeated More: A Theoretical Analysis of Multi-Epoch Scaling in Linear Regression

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

Deep learning theoryMulti-epoch trainingData-reuseOptimizationScaling lawLarge language model

Large Language Model (LLM) training often processes vast text corpora in a single pass, leaving much available data underutilized. This paper presents a theoretical analysis of how a common workaround, training for multiple epochs on the same dataset, reshapes the data scaling laws. Concretely, given a $K$ -epoch training on $N$ samples, how many fresh samples would one-pass training require to match the same performance? We quantify this using the \textit{effective reuse rate} of the data, $E(K, N)$ , which we define as the factor by which the dataset must grow under one-pass training to match the test loss of multi-epoch training. Our analysis precisely characterizes the scaling behavior of $E(K, N)$ for SGD in linear regression under either strong convexity or Zipf-distributed data: (1) When $K$ is small, we prove that $E(K, N) \approx K$ , indicating that every new epoch yields a linear gain; (2) As $K$ increases, $E(K, N)$ plateaus at a problem-dependent value that grows with $N$ ( $\Theta(\log N)$ for the strongly-convex case), implying that larger datasets can be repeated more times before the marginal benefit vanishes. These theoretical findings complement a recent empirical study by Muennighoff et al. (2023), which found that training LLMs for up to $4$ epochs results in negligible loss differences compared to using fresh data at each step, \textit{i.e.}, $E(K, N) \approx K$ for $K \le 4$ in our notation. Supported by further empirical validation with LLMs, our results reveal how this behavior depends on the underlying data size and distribution, and underscore the need to explicitly model both factors in future studies of scaling laws with data reuse.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces a theoretical framework for quantifying the effective reuse rate E(K,N) in multi-epoch linear regression, characterizing how repeated passes over N samples compare to single-pass training on fresh data. It resides in the 'Scaling Laws and Effective Reuse Rates' leaf of the taxonomy, which contains only two papers total (including this one). This leaf sits within the broader 'Theoretical Analysis of Multi-Epoch Training and Data Reuse' branch, indicating a relatively sparse research direction focused on fundamental scaling properties rather than algorithmic or applied concerns.

The taxonomy reveals neighboring branches addressing related but distinct problems. 'Sensitivity and Complexity Analysis' examines model robustness to data perturbations, while 'Model Transfer and Reuse for Efficient Training' focuses on leveraging pretrained representations across tasks rather than iterative training on fixed data. The 'Linear Mixed-Effects Models' branch handles repeated measurements through hierarchical random effects, representing a statistical modeling tradition fundamentally different from the iterative optimization perspective adopted here. The paper's theoretical focus on scaling laws positions it at the intersection of classical statistical learning theory and modern concerns about data efficiency in large-scale training.

Among 21 candidates examined across three contributions, the analysis found limited prior work overlap. The core effective reuse rate characterization (8 candidates examined, 0 refutable) and scaling behavior analysis for strongly convex and Zipf cases (10 candidates examined, 0 refutable) appear relatively novel within this search scope. However, the optimal learning rate derivation and risk approximation for multi-epoch SGD (3 candidates examined, 1 refutable) shows more substantial prior work, suggesting this technical component may have existing coverage in the optimization literature.

Based on the limited search scope of 21 semantically similar candidates, the work appears to occupy a sparsely populated research direction with only one sibling paper in its taxonomy leaf. The core theoretical contributions around effective reuse rates show minimal overlap with examined candidates, though the learning rate analysis component has at least one overlapping prior result. The analysis does not cover exhaustive citation networks or domain-specific venues that might contain additional relevant theoretical work on data reuse in iterative training.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Multi-epoch training in linear regression with data reuse examines how repeatedly cycling through the same dataset affects model performance and generalization. The field structure suggested by the taxonomy spans several distinct branches. Theoretical Analysis of Multi-Epoch Training and Data Reuse investigates fundamental questions about scaling laws, effective reuse rates, and the statistical properties of iterative training, as seen in works like MultiEpoch Scaling[0] and Data Reuse Scaling[1]. Model Transfer and Reuse for Efficient Training focuses on leveraging pre-trained representations or knowledge across tasks. Linear Mixed-Effects Models for Repeated Measurements (e.g., Linear Mixed Models[12], PLS Mixed Models[10]) address hierarchical data structures with random effects, while Machine Learning Integration with Linear Models for Repeated Data and Machine Learning Models with Repeated Data in Training explore how modern learning algorithms handle correlated observations. Computational Optimization with Linear Models and Data Reuse emphasizes algorithmic efficiency and numerical methods for large-scale problems. Particularly active lines of work contrast theoretical guarantees with practical algorithmic design. The theoretical branch grapples with how data reuse influences bias-variance trade-offs and whether effective sample size diminishes with repeated passes, while computational branches prioritize scalable solvers and convergence acceleration. MultiEpoch Scaling[0] sits squarely within the Theoretical Analysis branch, specifically under Scaling Laws and Effective Reuse Rates, where it shares close thematic ties with Data Reuse Scaling[1]. Both works examine how training dynamics evolve as data is reused across epochs, but MultiEpoch Scaling[0] appears to emphasize the interplay between epoch count and model capacity in linear settings. This contrasts with neighboring branches like Linear Mixed-Effects Models, which handle repeated measurements through hierarchical random effects rather than iterative optimization, highlighting a fundamental divide between statistical modeling traditions and modern machine learning perspectives on data reuse.

Claimed Contributions

Theoretical characterization of effective reuse rate E(K,N) in linear regression

8 retrieved papers

The authors theoretically analyze how the effective reuse rate E(K,N)—the multiplicative factor by which a dataset must grow under one-pass training to match K-epoch training performance—depends on both the number of epochs K and dataset size N. They prove that for small K, E(K,N) is approximately K, while for large K it plateaus at a problem-dependent value that grows with N (order log N for strongly convex cases).

8 retrieved papers

Scaling behavior analysis for strongly convex and Zipf-distributed data cases

10 retrieved papers

The authors establish precise scaling laws for E(K,N) in two settings: strongly convex linear regression where saturation occurs at order log N, and Zipf-distributed data where saturation scales as a power of N. These results reveal a phase transition between an effective-reuse regime and a limited-reuse regime.

10 retrieved papers

Optimal learning rate derivation and risk approximation formula for multi-epoch SGD

Can Refute

3 retrieved papers

The authors derive the optimal learning rate for multi-epoch stochastic gradient descent in linear regression and provide an approximation formula for expected excess risk with multiplicative error no(1). These technical results enable precise characterization of multi-epoch training dynamics.

3 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] Improved Scaling Laws in Linear Regression via Data Reuse PDF

Lin Li-cong, Wu Jingfeng, Bartlett, Peter L. (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Theoretical characterization of effective reuse rate E(K,N) in linear regression

[1] Improved Scaling Laws in Linear Regression via Data Reuse PDF

Cannot Refute

[28] {RECL}: Responsive {Resource-Efficient} continuous learning for video analytics PDF

Cannot Refute

[29] The Distribution of Tax Collectability, Quality of Tax Services Efforts to Tax Coverage Ratio PDF

Cannot Refute

[30] A linear regression-based resource utilization prediction policy for live migration in cloud computing PDF

Cannot Refute

[31] Overview of multivariate regression model analysis and application PDF

Cannot Refute

[32] Examining Associations Between Physician Data Utilization for Practice Improvement and Lifelong Learning. PDF

Cannot Refute

[33] A novel roundness error evaluation method for high-speed EMU train axles PDF

Cannot Refute

[34] Enhanced Data Utilization Approach to Improve the Prediction Performance of Groundwater Level Using Semianalytical and Data Process Models PDF

Cannot Refute

Contribution

Scaling behavior analysis for strongly convex and Zipf-distributed data cases

[38] Rapid Overfitting of Multi-Pass Stochastic Gradient Descent in Stochastic Convex Optimization PDF

Cannot Refute

[39] SGD and Hogwild! convergence without the bounded gradients assumption PDF

Cannot Refute

[40] Large deviations rates for stochastic gradient descent with strongly convex functions PDF

Cannot Refute

[41] Fast and Faster Convergence of SGD for Over-Parameterized Models and an Accelerated Perceptron PDF

Cannot Refute

[42] Stochastic Polyak Step-size for SGD: An Adaptive Learning Rate for Fast Convergence PDF

Cannot Refute

[43] Performative control for linear dynamical systems PDF

Cannot Refute

[44] Stochastic versus Deterministic in Stochastic Gradient Descent PDF

Cannot Refute

[45] Learning with interactive models over decision-dependent distributions PDF

Cannot Refute

[46] Convergence and concentration properties of constant step-size SGD through Markov chains PDF

Cannot Refute

[47] Adaptive SGD with Polyak stepsize and Line-search: Robust Convergence and Variance Reduction PDF

Cannot Refute

Contribution

Optimal learning rate derivation and risk approximation formula for multi-epoch SGD

[35] Optimal learning for multi-pass stochastic gradient methods PDF

Can Refute

[36] Risk bounds of multi-pass sgd for least squares in the interpolation regime PDF

Cannot Refute

[37] Last iterate risk bounds of sgd with decaying stepsize for overparameterized linear regression PDF

Cannot Refute

Larger Datasets Can Be Repeated More: A Theoretical Analysis of Multi-Epoch Scaling in Linear Regression

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] Improved Scaling Laws in Linear Regression via Data Reuse PDF

Contribution Analysis

Theoretical characterization of effective reuse rate E(K,N) in linear regression

[1] Improved Scaling Laws in Linear Regression via Data Reuse PDF

[28] {RECL}: Responsive {Resource-Efficient} continuous learning for video analytics PDF

[29] The Distribution of Tax Collectability, Quality of Tax Services Efforts to Tax Coverage Ratio PDF

[30] A linear regression-based resource utilization prediction policy for live migration in cloud computing PDF

[31] Overview of multivariate regression model analysis and application PDF

[32] Examining Associations Between Physician Data Utilization for Practice Improvement and Lifelong Learning. PDF

[33] A novel roundness error evaluation method for high-speed EMU train axles PDF

[34] Enhanced Data Utilization Approach to Improve the Prediction Performance of Groundwater Level Using Semianalytical and Data Process Models PDF

Scaling behavior analysis for strongly convex and Zipf-distributed data cases

[38] Rapid Overfitting of Multi-Pass Stochastic Gradient Descent in Stochastic Convex Optimization PDF

[39] SGD and Hogwild! convergence without the bounded gradients assumption PDF

[40] Large deviations rates for stochastic gradient descent with strongly convex functions PDF

[41] Fast and Faster Convergence of SGD for Over-Parameterized Models and an Accelerated Perceptron PDF

[42] Stochastic Polyak Step-size for SGD: An Adaptive Learning Rate for Fast Convergence PDF

[43] Performative control for linear dynamical systems PDF

[44] Stochastic versus Deterministic in Stochastic Gradient Descent PDF

[45] Learning with interactive models over decision-dependent distributions PDF

[46] Convergence and concentration properties of constant step-size SGD through Markov chains PDF

[47] Adaptive SGD with Polyak stepsize and Line-search: Robust Convergence and Variance Reduction PDF

Optimal learning rate derivation and risk approximation formula for multi-epoch SGD

[35] Optimal learning for multi-pass stochastic gradient methods PDF

[36] Risk bounds of multi-pass sgd for least squares in the interpolation regime PDF

[37] Last iterate risk bounds of sgd with decaying stepsize for overparameterized linear regression PDF

Table of Contents