How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining
Overview
Overall Novelty Assessment
The paper identifies a fundamental incompatibility between curriculum-based data ordering (ascending quality) and standard learning rate decay schedules in LLM pretraining. It proposes two mitigation strategies: moderate LR decay and Curriculum Model Averaging (CMA). Within the taxonomy, this work resides in the 'Curriculum-Compatible Learning Rate Decay' leaf under 'Optimization and Training Dynamics'. This leaf contains only two papers total, indicating a relatively sparse research direction focused specifically on reconciling curriculum progression with optimization schedules.
The taxonomy reveals that most curriculum learning research separates data ordering concerns (under 'Curriculum Design and Scheduling Strategies') from optimization dynamics. Neighboring leaves include 'General Scheduling and Decay Patterns' with standard approaches like cosine annealing, and 'Layer-Wise and Adaptive Optimization' exploring parameter-specific learning rates. The paper bridges these areas by arguing that curriculum effectiveness depends critically on co-designing data order and LR schedules, a perspective less emphasized in sibling work like 'Learning Rate Curriculum' which treats LR itself as a curriculum signal rather than examining schedule-curriculum interactions.
Among ten candidates examined across three contributions, none clearly refute the core claims. The incompatibility identification examined three candidates with zero refutations, suggesting limited prior work explicitly analyzing this curriculum-LR decay interaction. The CMA strategy examined six candidates without refutation, indicating novelty in applying model averaging specifically to curriculum contexts. The co-design framework examined one candidate. The small search scope (ten total candidates from semantic search) means these findings reflect top-ranked related work rather than exhaustive coverage, but the absence of overlapping prior work across all contributions is noteworthy.
Based on the limited literature search, the work appears to occupy a genuinely sparse intersection between curriculum learning and learning rate scheduling. The taxonomy structure confirms that while both curriculum design and optimization dynamics are well-studied independently, their interaction receives less attention. The analysis covers top-ranked semantic matches but cannot rule out relevant work outside this scope, particularly in broader optimization or continual learning literature not captured by the curriculum-focused taxonomy.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors identify that curriculum-based pretraining, which orders data by ascending quality, is fundamentally incompatible with standard decaying learning rate schedules. This incompatibility causes high-quality data to be processed when the learning rate is at its lowest, thereby wasting the potential benefits of curriculum learning.
The authors propose Curriculum Model Averaging (CMA), which replaces learning rate decay with model averaging (computing a weighted average of final checkpoints) while maintaining a constant learning rate. This allows the model to fully exploit high-quality data introduced later in the curriculum, achieving substantial benchmark improvements.
The authors demonstrate that combining moderate learning rate decay, curriculum learning, and weight averaging produces synergistic advantages. They identify a previously unexplored high-performing pretraining regime and advocate for co-designing these components rather than optimizing them in isolation.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[22] Curriculum learning: A regularization method for efficient and stable billion-scale gpt model pre-training PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Identification of incompatibility between data curriculum and learning rate decay
The authors identify that curriculum-based pretraining, which orders data by ascending quality, is fundamentally incompatible with standard decaying learning rate schedules. This incompatibility causes high-quality data to be processed when the learning rate is at its lowest, thereby wasting the potential benefits of curriculum learning.
[39] Research on Tibetan-Chinese Translation Method Based on Improved Transformer Model PDF
[40] Curriculum Learning: From Human Strategies to Learning Dynamics PDF
[41] A New Methodology for Edge Intelligence Data Quality Evaluation in Idd and Non-Iid Dataset in Federated Learning PDF
Curriculum Model Averaging (CMA) strategy
The authors propose Curriculum Model Averaging (CMA), which replaces learning rate decay with model averaging (computing a weighted average of final checkpoints) while maintaining a constant learning rate. This allows the model to fully exploit high-quality data introduced later in the curriculum, achieving substantial benchmark improvements.
[32] On efficient training of large-scale deep learning models PDF
[33] On Efficient Training of Large-Scale Deep Learning Models: A Literature Review PDF
[34] Curriculum learning effectively improves low data VQA PDF
[35] Lan-Bridge's Submission to CCMT 2024 PDF
[36] Lan-Bridge's Submission to CCMT 2024 Translation Evaluation Task PDF
[37] Quantization of Deep Neural Networks for Improving the Generalization Capability PDF
Co-design framework for data curricula, learning rate schedules, and model averaging
The authors demonstrate that combining moderate learning rate decay, curriculum learning, and weight averaging produces synergistic advantages. They identify a previously unexplored high-performing pretraining regime and advocate for co-designing these components rather than optimizing them in isolation.