How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

LLM pretrainingCurriculum LearningModel Weight Average

Due to the scarcity of high-quality data, large language models (LLMs) are often trained on mixtures of data with varying quality levels, even after sophisticated data curation. A natural approach to better leverage high-quality data is curriculum-based pretraining, where the model is trained on data sorted in ascending order of quality as determined by a quality metric. However, prior studies have reported limited improvements from such curriculum-based pretraining strategies. This work identifies a critical factor constraining these methods: the incompatibility between the ascending data quality order and the decaying learning rate (LR) schedule. We find that while curriculum-based training substantially outperforms random shuffling when using a constant LR, its advantage diminishes under standard LR decay schedules. Our experiments show this incompatibility can be mitigated by two simple strategies: (1) employing a more moderate LR decay schedule, where the final LR is only moderately smaller than the peak LR, and (2) replacing LR decay with model averaging, i.e., computing a weighted average of the final few checkpoints. By combining these strategies, we improve the average score on a suite of standard benchmarks by 1.64% over random shuffling, without additional data refinement. Validated on 1.5B-parameter models trained over 30B tokens with various data-quality metrics, our findings call for a re-evaluation of curriculum-based LLM pretraining and underscore the potential of co-designing data curricula with optimization methods.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper identifies a fundamental incompatibility between curriculum-based data ordering (ascending quality) and standard learning rate decay schedules in LLM pretraining. It proposes two mitigation strategies: moderate LR decay and Curriculum Model Averaging (CMA). Within the taxonomy, this work resides in the 'Curriculum-Compatible Learning Rate Decay' leaf under 'Optimization and Training Dynamics'. This leaf contains only two papers total, indicating a relatively sparse research direction focused specifically on reconciling curriculum progression with optimization schedules.

The taxonomy reveals that most curriculum learning research separates data ordering concerns (under 'Curriculum Design and Scheduling Strategies') from optimization dynamics. Neighboring leaves include 'General Scheduling and Decay Patterns' with standard approaches like cosine annealing, and 'Layer-Wise and Adaptive Optimization' exploring parameter-specific learning rates. The paper bridges these areas by arguing that curriculum effectiveness depends critically on co-designing data order and LR schedules, a perspective less emphasized in sibling work like 'Learning Rate Curriculum' which treats LR itself as a curriculum signal rather than examining schedule-curriculum interactions.

Among ten candidates examined across three contributions, none clearly refute the core claims. The incompatibility identification examined three candidates with zero refutations, suggesting limited prior work explicitly analyzing this curriculum-LR decay interaction. The CMA strategy examined six candidates without refutation, indicating novelty in applying model averaging specifically to curriculum contexts. The co-design framework examined one candidate. The small search scope (ten total candidates from semantic search) means these findings reflect top-ranked related work rather than exhaustive coverage, but the absence of overlapping prior work across all contributions is noteworthy.

Based on the limited literature search, the work appears to occupy a genuinely sparse intersection between curriculum learning and learning rate scheduling. The taxonomy structure confirms that while both curriculum design and optimization dynamics are well-studied independently, their interaction receives less attention. The analysis covers top-ranked semantic matches but cannot rule out relevant work outside this scope, particularly in broader optimization or continual learning literature not captured by the curriculum-focused taxonomy.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: curriculum-based pretraining with learning rate decay schedules. The field organizes around three main branches that capture complementary perspectives on how to structure training progressions. Curriculum Design and Scheduling Strategies encompasses works that determine the ordering and pacing of training examples, ranging from instance hardness metrics like Dynamic Instance Hardness[3] to domain-specific sequencing approaches such as Length-based Curriculum[4] and CurriMAE[6]. Optimization and Training Dynamics focuses on the interplay between curriculum choices and optimizer behavior, including learning rate scheduling methods like Learning Rate Curriculum[5] and adaptive strategies such as NARS Scheduler[19], as well as broader training dynamics studies like Mid-training Survey[8]. Domain-Specific Applications and Implementations demonstrates how curriculum principles translate to specialized settings, from robotics tasks like Concentric Tube Robot Curriculum[20] to large-scale language model pretraining exemplified by Billion-scale GPT Curriculum[22] and financial domain work in Financial NLP Supercomputing[14]. A central tension across these branches concerns whether and how learning rate decay should be coordinated with curriculum progression. Some works treat scheduling as an independent optimization concern, while others tightly couple data ordering with adaptive learning rates, as seen in MetaSLRCL[16] and Stage-Decaying Clipping[15]. Learning Rate Decay Wastes Data[0] sits squarely within the Optimization and Training Dynamics branch, specifically addressing curriculum-compatible learning rate decay strategies. Its emphasis contrasts with nearby work like Billion-scale GPT Curriculum[22], which focuses more on data ordering at scale, and Learning Rate Curriculum[5], which explores learning rate as a curriculum signal itself. The original paper's positioning suggests a critical examination of how standard decay schedules may underutilize training data when combined with curriculum approaches, raising questions about whether conventional optimization recipes remain optimal under structured data presentation.

Claimed Contributions

Identification of incompatibility between data curriculum and learning rate decay

3 retrieved papers

The authors identify that curriculum-based pretraining, which orders data by ascending quality, is fundamentally incompatible with standard decaying learning rate schedules. This incompatibility causes high-quality data to be processed when the learning rate is at its lowest, thereby wasting the potential benefits of curriculum learning.

3 retrieved papers

Curriculum Model Averaging (CMA) strategy

6 retrieved papers

The authors propose Curriculum Model Averaging (CMA), which replaces learning rate decay with model averaging (computing a weighted average of final checkpoints) while maintaining a constant learning rate. This allows the model to fully exploit high-quality data introduced later in the curriculum, achieving substantial benchmark improvements.

6 retrieved papers

Co-design framework for data curricula, learning rate schedules, and model averaging

1 retrieved paper

The authors demonstrate that combining moderate learning rate decay, curriculum learning, and weight averaging produces synergistic advantages. They identify a previously unexplored high-performing pretraining regime and advocate for co-designing these components rather than optimizing them in isolation.

1 retrieved paper

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[22] Curriculum learning: A regularization method for efficient and stable billion-scale gpt model pre-training PDF

C Li, M Zhang, Y He (2021)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Identification of incompatibility between data curriculum and learning rate decay

[39] Research on Tibetan-Chinese Translation Method Based on Improved Transformer Model PDF

Cannot Refute

[40] Curriculum Learning: From Human Strategies to Learning Dynamics PDF

Cannot Refute

[41] A New Methodology for Edge Intelligence Data Quality Evaluation in Idd and Non-Iid Dataset in Federated Learning PDF

Cannot Refute

Contribution

Curriculum Model Averaging (CMA) strategy

[32] On efficient training of large-scale deep learning models PDF

Cannot Refute

[33] On Efficient Training of Large-Scale Deep Learning Models: A Literature Review PDF

Cannot Refute

[34] Curriculum learning effectively improves low data VQA PDF

Cannot Refute

[35] Lan-Bridge's Submission to CCMT 2024 PDF

Cannot Refute

[36] Lan-Bridge's Submission to CCMT 2024 Translation Evaluation Task PDF

Cannot Refute

[37] Quantization of Deep Neural Networks for Improving the Generalization Capability PDF

Cannot Refute

Contribution

Co-design framework for data curricula, learning rate schedules, and model averaging

[38] Diabetic Retinopathy Detection and Grading: A Transfer Learning Approach Using Simultaneous Parameter Optimization and Feature-Weighted ECOC Ensemble PDF

Cannot Refute

How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[22] Curriculum learning: A regularization method for efficient and stable billion-scale gpt model pre-training PDF

Contribution Analysis

Identification of incompatibility between data curriculum and learning rate decay

[39] Research on Tibetan-Chinese Translation Method Based on Improved Transformer Model PDF

[40] Curriculum Learning: From Human Strategies to Learning Dynamics PDF

[41] A New Methodology for Edge Intelligence Data Quality Evaluation in Idd and Non-Iid Dataset in Federated Learning PDF

Curriculum Model Averaging (CMA) strategy

[32] On efficient training of large-scale deep learning models PDF

[33] On Efficient Training of Large-Scale Deep Learning Models: A Literature Review PDF

[34] Curriculum learning effectively improves low data VQA PDF

[35] Lan-Bridge's Submission to CCMT 2024 PDF

[36] Lan-Bridge's Submission to CCMT 2024 Translation Evaluation Task PDF

[37] Quantization of Deep Neural Networks for Improving the Generalization Capability PDF

Co-design framework for data curricula, learning rate schedules, and model averaging

[38] Diabetic Retinopathy Detection and Grading: A Transfer Learning Approach Using Simultaneous Parameter Optimization and Feature-Weighted ECOC Ensemble PDF

Table of Contents