How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining

ICLR 2026 Conference SubmissionAnonymous Authors
LLM pretrainingCurriculum LearningModel Weight Average
Abstract:

Due to the scarcity of high-quality data, large language models (LLMs) are often trained on mixtures of data with varying quality levels, even after sophisticated data curation. A natural approach to better leverage high-quality data is curriculum-based pretraining, where the model is trained on data sorted in ascending order of quality as determined by a quality metric. However, prior studies have reported limited improvements from such curriculum-based pretraining strategies. This work identifies a critical factor constraining these methods: the incompatibility between the ascending data quality order and the decaying learning rate (LR) schedule. We find that while curriculum-based training substantially outperforms random shuffling when using a constant LR, its advantage diminishes under standard LR decay schedules. Our experiments show this incompatibility can be mitigated by two simple strategies: (1) employing a more moderate LR decay schedule, where the final LR is only moderately smaller than the peak LR, and (2) replacing LR decay with model averaging, i.e., computing a weighted average of the final few checkpoints. By combining these strategies, we improve the average score on a suite of standard benchmarks by 1.64% over random shuffling, without additional data refinement. Validated on 1.5B-parameter models trained over 30B tokens with various data-quality metrics, our findings call for a re-evaluation of curriculum-based LLM pretraining and underscore the potential of co-designing data curricula with optimization methods.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper identifies a fundamental incompatibility between curriculum-based data ordering (ascending quality) and standard learning rate decay schedules in LLM pretraining. It proposes two mitigation strategies: moderate LR decay and Curriculum Model Averaging (CMA). Within the taxonomy, this work resides in the 'Curriculum-Compatible Learning Rate Decay' leaf under 'Optimization and Training Dynamics'. This leaf contains only two papers total, indicating a relatively sparse research direction focused specifically on reconciling curriculum progression with optimization schedules.

The taxonomy reveals that most curriculum learning research separates data ordering concerns (under 'Curriculum Design and Scheduling Strategies') from optimization dynamics. Neighboring leaves include 'General Scheduling and Decay Patterns' with standard approaches like cosine annealing, and 'Layer-Wise and Adaptive Optimization' exploring parameter-specific learning rates. The paper bridges these areas by arguing that curriculum effectiveness depends critically on co-designing data order and LR schedules, a perspective less emphasized in sibling work like 'Learning Rate Curriculum' which treats LR itself as a curriculum signal rather than examining schedule-curriculum interactions.

Among ten candidates examined across three contributions, none clearly refute the core claims. The incompatibility identification examined three candidates with zero refutations, suggesting limited prior work explicitly analyzing this curriculum-LR decay interaction. The CMA strategy examined six candidates without refutation, indicating novelty in applying model averaging specifically to curriculum contexts. The co-design framework examined one candidate. The small search scope (ten total candidates from semantic search) means these findings reflect top-ranked related work rather than exhaustive coverage, but the absence of overlapping prior work across all contributions is noteworthy.

Based on the limited literature search, the work appears to occupy a genuinely sparse intersection between curriculum learning and learning rate scheduling. The taxonomy structure confirms that while both curriculum design and optimization dynamics are well-studied independently, their interaction receives less attention. The analysis covers top-ranked semantic matches but cannot rule out relevant work outside this scope, particularly in broader optimization or continual learning literature not captured by the curriculum-focused taxonomy.

Taxonomy

Core-task Taxonomy Papers
31
3
Claimed Contributions
10
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: curriculum-based pretraining with learning rate decay schedules. The field organizes around three main branches that capture complementary perspectives on how to structure training progressions. Curriculum Design and Scheduling Strategies encompasses works that determine the ordering and pacing of training examples, ranging from instance hardness metrics like Dynamic Instance Hardness[3] to domain-specific sequencing approaches such as Length-based Curriculum[4] and CurriMAE[6]. Optimization and Training Dynamics focuses on the interplay between curriculum choices and optimizer behavior, including learning rate scheduling methods like Learning Rate Curriculum[5] and adaptive strategies such as NARS Scheduler[19], as well as broader training dynamics studies like Mid-training Survey[8]. Domain-Specific Applications and Implementations demonstrates how curriculum principles translate to specialized settings, from robotics tasks like Concentric Tube Robot Curriculum[20] to large-scale language model pretraining exemplified by Billion-scale GPT Curriculum[22] and financial domain work in Financial NLP Supercomputing[14]. A central tension across these branches concerns whether and how learning rate decay should be coordinated with curriculum progression. Some works treat scheduling as an independent optimization concern, while others tightly couple data ordering with adaptive learning rates, as seen in MetaSLRCL[16] and Stage-Decaying Clipping[15]. Learning Rate Decay Wastes Data[0] sits squarely within the Optimization and Training Dynamics branch, specifically addressing curriculum-compatible learning rate decay strategies. Its emphasis contrasts with nearby work like Billion-scale GPT Curriculum[22], which focuses more on data ordering at scale, and Learning Rate Curriculum[5], which explores learning rate as a curriculum signal itself. The original paper's positioning suggests a critical examination of how standard decay schedules may underutilize training data when combined with curriculum approaches, raising questions about whether conventional optimization recipes remain optimal under structured data presentation.

Claimed Contributions

Identification of incompatibility between data curriculum and learning rate decay

The authors identify that curriculum-based pretraining, which orders data by ascending quality, is fundamentally incompatible with standard decaying learning rate schedules. This incompatibility causes high-quality data to be processed when the learning rate is at its lowest, thereby wasting the potential benefits of curriculum learning.

3 retrieved papers
Curriculum Model Averaging (CMA) strategy

The authors propose Curriculum Model Averaging (CMA), which replaces learning rate decay with model averaging (computing a weighted average of final checkpoints) while maintaining a constant learning rate. This allows the model to fully exploit high-quality data introduced later in the curriculum, achieving substantial benchmark improvements.

6 retrieved papers
Co-design framework for data curricula, learning rate schedules, and model averaging

The authors demonstrate that combining moderate learning rate decay, curriculum learning, and weight averaging produces synergistic advantages. They identify a previously unexplored high-performing pretraining regime and advocate for co-designing these components rather than optimizing them in isolation.

1 retrieved paper

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Identification of incompatibility between data curriculum and learning rate decay

The authors identify that curriculum-based pretraining, which orders data by ascending quality, is fundamentally incompatible with standard decaying learning rate schedules. This incompatibility causes high-quality data to be processed when the learning rate is at its lowest, thereby wasting the potential benefits of curriculum learning.

Contribution

Curriculum Model Averaging (CMA) strategy

The authors propose Curriculum Model Averaging (CMA), which replaces learning rate decay with model averaging (computing a weighted average of final checkpoints) while maintaining a constant learning rate. This allows the model to fully exploit high-quality data introduced later in the curriculum, achieving substantial benchmark improvements.

Contribution

Co-design framework for data curricula, learning rate schedules, and model averaging

The authors demonstrate that combining moderate learning rate decay, curriculum learning, and weight averaging produces synergistic advantages. They identify a previously unexplored high-performing pretraining regime and advocate for co-designing these components rather than optimizing them in isolation.