WSM: Decay-Free Learning Rate Schedule via Checkpoint Merging for LLM Pre-training

ICLR 2026 Conference SubmissionAnonymous Authors
llm pre-traininglearning rate schedulecheckpoint mergingdecay-free approach
Abstract:

Recent advances in learning rate~(LR) scheduling have demonstrated the effectiveness of decay-free approaches that eliminate the traditional decay phase while maintaining competitive performance. Model merging techniques have emerged as particularly promising solutions in this domain. We present Warmup-Stable and Merge (WSM), a general framework that establishes a formal connection between learning rate decay and model merging. WSM provides a unified theoretical foundation for emulating various decay strategies—including cosine decay, linear decay and inverse square root decay—as principled model averaging schemes, while remaining fully compatible with diverse optimization methods. Through extensive experiments, we identify merge duration—the training window for checkpoint aggregation—as the most critical factor influencing model performance, surpassing the importance of both checkpoint interval and merge quantity. Our framework consistently outperforms the widely-adopted Warmup-Stable-Decay (WSD) approach across multiple benchmarks, achieving significant improvements of +3.5% on MATH, +2.9% on HumanEval, and +5.5% on MMLU-Pro. The performance advantages extend to supervised fine-tuning scenarios, highlighting WSM's potential for long-term model refinement.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes WSM, a framework that connects learning rate decay strategies to checkpoint merging by deriving merge weights from decay schedules. It resides in the 'Novel Schedule Architectures' leaf, which contains four papers exploring alternatives to standard cosine or linear decay. This leaf sits within the broader 'Learning Rate Schedule Design and Optimization' branch, indicating a moderately populated research direction focused on rethinking fundamental schedule structures. The taxonomy shows that while schedule design is an active area, this specific leaf is not overcrowded, suggesting room for novel architectural contributions.

The taxonomy reveals neighboring work in 'Checkpoint Averaging and Model Merging' (two papers) and 'Theoretical Foundations and Scaling Laws' (four papers), both closely related to WSM's dual focus on merging mechanics and theoretical grounding. The 'Adaptive and Dynamic Schedule Optimization' leaf (three papers) explores online tuning, contrasting with WSM's fixed framework approach. The taxonomy's scope notes clarify that WSM's theoretical connection between decay and merging distinguishes it from purely empirical checkpoint reuse methods, positioning it at the intersection of schedule design and training efficiency.

Among seventeen candidates examined, no contribution was clearly refuted. The core WSM framework examined ten candidates with zero refutations, the theorem derivation examined one candidate with no overlap, and the merge duration finding examined six candidates without refutation. This limited search scope—seventeen papers from semantic retrieval—means the analysis captures nearby work but cannot claim exhaustive coverage. The absence of refutations among examined candidates suggests the specific framing of decay-to-merging equivalence may be underexplored, though the small sample size limits confidence in this assessment.

Based on top-seventeen semantic matches, the work appears to occupy a relatively sparse intersection between schedule architecture and checkpoint merging theory. The taxonomy structure confirms that while both schedule design and merging are active areas, their formal unification is less densely studied. However, the limited search scope means potentially relevant work in adjacent leaves or outside the top-K results may not be reflected here.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
17
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: learning rate scheduling for large language model pre-training. The field has evolved into a rich landscape of interconnected research directions. At the highest level, the taxonomy distinguishes between foundational schedule design (exploring novel architectures and optimization principles), continual and incremental pre-training (addressing how to adapt schedules when extending or updating models), training efficiency and acceleration methods (focusing on computational trade-offs and system-level speedups), and specialized tuning approaches (including layer-wise or adaptive strategies). Additional branches cover empirical comparisons, infrastructure perspectives, model compression, and domain-specific extensions. Works such as Rethinking Learning Rate[1] and Beyond Cosine Decay[10] exemplify efforts to rethink classical decay patterns, while Continual Pretraining Warmup[7] and Scalable Continual Pretraining[9] illustrate the growing interest in schedule adjustments for ongoing training. Meanwhile, studies like Critical Batch Size[15] and Compute Optimal Training[17] bridge schedule design with resource allocation, and Sharpness Disparity Principle[16] connects learning rate choices to loss landscape geometry. Several active lines of work highlight contrasting priorities and open questions. One thread investigates whether traditional cosine or linear decay remains optimal, with papers like Straight to Zero[8] and Optimal Linear Decay[37] proposing alternatives that challenge conventional wisdom. Another thread explores checkpoint reuse and model merging strategies—such as Early Weight Averaging[5] and Reuse Dont Retrain[3]—to reduce redundant computation. Within this landscape, WSM Checkpoint Merging[0] sits naturally among novel schedule architectures, sharing thematic ties with Learning Rate Path Switching[2] and River Valley Landscape[23], all of which examine how to navigate or combine different training trajectories. Compared to Reuse Dont Retrain[3], which emphasizes recycling existing checkpoints, WSM Checkpoint Merging[0] appears to focus more directly on merging strategies as a schedule design primitive. The broader tension between discovering fundamentally new schedules versus refining existing ones for continual or resource-constrained settings remains a central question across these branches.

Claimed Contributions

WSM framework connecting LR decay and checkpoint merging

The authors introduce WSM, a framework that theoretically connects learning rate decay strategies to checkpoint merging operations. This framework provides a unified foundation for emulating various decay strategies (cosine, linear, inverse square root) as principled model averaging schemes while remaining compatible with diverse optimization methods.

10 retrieved papers
Theorem for deriving checkpoint weights from decay schedules

The authors formalize Theorem 3.1, which provides a principled method to derive checkpoint merging weights from any desired gradient decay schedule. This theorem enables the conversion of LR decay methods into theoretically approximate model averaging implementations.

1 retrieved paper
Identification of merge duration as critical performance factor

Through systematic experiments, the authors identify merge duration (the training window for checkpoint aggregation) as the most important factor affecting model performance in their framework, surpassing the importance of checkpoint saving intervals and the number of checkpoints merged.

6 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

WSM framework connecting LR decay and checkpoint merging

The authors introduce WSM, a framework that theoretically connects learning rate decay strategies to checkpoint merging operations. This framework provides a unified foundation for emulating various decay strategies (cosine, linear, inverse square root) as principled model averaging schemes while remaining compatible with diverse optimization methods.

Contribution

Theorem for deriving checkpoint weights from decay schedules

The authors formalize Theorem 3.1, which provides a principled method to derive checkpoint merging weights from any desired gradient decay schedule. This theorem enables the conversion of LR decay methods into theoretically approximate model averaging implementations.

Contribution

Identification of merge duration as critical performance factor

Through systematic experiments, the authors identify merge duration (the training window for checkpoint aggregation) as the most important factor affecting model performance in their framework, surpassing the importance of checkpoint saving intervals and the number of checkpoints merged.