WSM: Decay-Free Learning Rate Schedule via Checkpoint Merging for LLM Pre-training

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 7.0 Download Report PDF

llm pre-traininglearning rate schedulecheckpoint mergingdecay-free approach

Recent advances in learning rate~(LR) scheduling have demonstrated the effectiveness of decay-free approaches that eliminate the traditional decay phase while maintaining competitive performance. Model merging techniques have emerged as particularly promising solutions in this domain. We present Warmup-Stable and Merge (WSM), a general framework that establishes a formal connection between learning rate decay and model merging. WSM provides a unified theoretical foundation for emulating various decay strategies—including cosine decay, linear decay and inverse square root decay—as principled model averaging schemes, while remaining fully compatible with diverse optimization methods. Through extensive experiments, we identify merge duration—the training window for checkpoint aggregation—as the most critical factor influencing model performance, surpassing the importance of both checkpoint interval and merge quantity. Our framework consistently outperforms the widely-adopted Warmup-Stable-Decay (WSD) approach across multiple benchmarks, achieving significant improvements of +3.5% on MATH, +2.9% on HumanEval, and +5.5% on MMLU-Pro. The performance advantages extend to supervised fine-tuning scenarios, highlighting WSM's potential for long-term model refinement.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes WSM, a framework that connects learning rate decay strategies to checkpoint merging by deriving merge weights from decay schedules. It resides in the 'Novel Schedule Architectures' leaf, which contains four papers exploring alternatives to standard cosine or linear decay. This leaf sits within the broader 'Learning Rate Schedule Design and Optimization' branch, indicating a moderately populated research direction focused on rethinking fundamental schedule structures. The taxonomy shows that while schedule design is an active area, this specific leaf is not overcrowded, suggesting room for novel architectural contributions.

The taxonomy reveals neighboring work in 'Checkpoint Averaging and Model Merging' (two papers) and 'Theoretical Foundations and Scaling Laws' (four papers), both closely related to WSM's dual focus on merging mechanics and theoretical grounding. The 'Adaptive and Dynamic Schedule Optimization' leaf (three papers) explores online tuning, contrasting with WSM's fixed framework approach. The taxonomy's scope notes clarify that WSM's theoretical connection between decay and merging distinguishes it from purely empirical checkpoint reuse methods, positioning it at the intersection of schedule design and training efficiency.

Among seventeen candidates examined, no contribution was clearly refuted. The core WSM framework examined ten candidates with zero refutations, the theorem derivation examined one candidate with no overlap, and the merge duration finding examined six candidates without refutation. This limited search scope—seventeen papers from semantic retrieval—means the analysis captures nearby work but cannot claim exhaustive coverage. The absence of refutations among examined candidates suggests the specific framing of decay-to-merging equivalence may be underexplored, though the small sample size limits confidence in this assessment.

Based on top-seventeen semantic matches, the work appears to occupy a relatively sparse intersection between schedule architecture and checkpoint merging theory. The taxonomy structure confirms that while both schedule design and merging are active areas, their formal unification is less densely studied. However, the limited search scope means potentially relevant work in adjacent leaves or outside the top-K results may not be reflected here.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: learning rate scheduling for large language model pre-training. The field has evolved into a rich landscape of interconnected research directions. At the highest level, the taxonomy distinguishes between foundational schedule design (exploring novel architectures and optimization principles), continual and incremental pre-training (addressing how to adapt schedules when extending or updating models), training efficiency and acceleration methods (focusing on computational trade-offs and system-level speedups), and specialized tuning approaches (including layer-wise or adaptive strategies). Additional branches cover empirical comparisons, infrastructure perspectives, model compression, and domain-specific extensions. Works such as Rethinking Learning Rate[1] and Beyond Cosine Decay[10] exemplify efforts to rethink classical decay patterns, while Continual Pretraining Warmup[7] and Scalable Continual Pretraining[9] illustrate the growing interest in schedule adjustments for ongoing training. Meanwhile, studies like Critical Batch Size[15] and Compute Optimal Training[17] bridge schedule design with resource allocation, and Sharpness Disparity Principle[16] connects learning rate choices to loss landscape geometry. Several active lines of work highlight contrasting priorities and open questions. One thread investigates whether traditional cosine or linear decay remains optimal, with papers like Straight to Zero[8] and Optimal Linear Decay[37] proposing alternatives that challenge conventional wisdom. Another thread explores checkpoint reuse and model merging strategies—such as Early Weight Averaging[5] and Reuse Dont Retrain[3]—to reduce redundant computation. Within this landscape, WSM Checkpoint Merging[0] sits naturally among novel schedule architectures, sharing thematic ties with Learning Rate Path Switching[2] and River Valley Landscape[23], all of which examine how to navigate or combine different training trajectories. Compared to Reuse Dont Retrain[3], which emphasizes recycling existing checkpoints, WSM Checkpoint Merging[0] appears to focus more directly on merging strategies as a schedule design primitive. The broader tension between discovering fundamentally new schedules versus refining existing ones for continual or resource-constrained settings remains a central question across these branches.

Claimed Contributions

WSM framework connecting LR decay and checkpoint merging

10 retrieved papers

The authors introduce WSM, a framework that theoretically connects learning rate decay strategies to checkpoint merging operations. This framework provides a unified foundation for emulating various decay strategies (cosine, linear, inverse square root) as principled model averaging schemes while remaining compatible with diverse optimization methods.

10 retrieved papers

Theorem for deriving checkpoint weights from decay schedules

1 retrieved paper

The authors formalize Theorem 3.1, which provides a principled method to derive checkpoint merging weights from any desired gradient decay schedule. This theorem enables the conversion of LR decay methods into theoretically approximate model averaging implementations.

1 retrieved paper

Identification of merge duration as critical performance factor

6 retrieved papers

Through systematic experiments, the authors identify merge duration (the training window for checkpoint aggregation) as the most important factor affecting model performance in their framework, surpassing the importance of checkpoint saving intervals and the number of checkpoints merged.

6 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[8] Straight to zero: Why linearly decaying the learning rate to zero works best for LLMs PDF

Bergsma, Shane, Dey, Nolan, Gosal, Gurpreet, Soboleva, Daria, Hestness, Joel (2025)

[10] Beyond Cosine Decay: On the effectiveness of Infinite Learning Rate Schedule for Continual Pre-training PDF

Singh, Vaibhav, Janson, Paul, Ibrahim, Adam, Rish, Irina, Belilovsky, Eugene, ThÃ©rien, Benjamin (2025) • arXiv.org

[23] Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape Perspective PDF

Wen, Kaiyue, Li, Zhiyuan, Wang, Jason, Hall David, Liang, Percy, Ma Tengyu (2024) • arXiv.org

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

WSM framework connecting LR decay and checkpoint merging

[4] Hop, skip, jump to convergence: Dynamics of learning rate transitions for improved training of large language models PDF

Cannot Refute

[23] Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape Perspective PDF

Cannot Refute

[42] Surge phenomenon in optimal learning rate and batch size scaling PDF

Cannot Refute

[44] How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining PDF

Cannot Refute

[57] When, Where and Why to Average Weights? PDF

Cannot Refute

[58] How to Merge Your Multimodal Models Over Time? PDF

Cannot Refute

[59] How to Merge Multimodal Models Over Time? PDF

Cannot Refute

[60] JaColBERTv2. 5: Optimising Multi-Vector Retrievers to Create State-of-the-Art Japanese Retrievers with Constrained Resources PDF

Cannot Refute

[61] Improved Cotton Leaf Disease Classification Using Parameter-Efficient Deep Learning Framework PDF

Cannot Refute

[62] Multimodal automl on structured tables with text fields PDF

Cannot Refute

Contribution

Theorem for deriving checkpoint weights from decay schedules

[51] A Simple Baseline for Bayesian Uncertainty in Deep Learning PDF

Cannot Refute

Contribution

Identification of merge duration as critical performance factor

[18] Scaling laws and compute-optimal training beyond fixed training durations PDF

Cannot Refute

[52] LLM circuit analyses are consistent across training and scale PDF

Cannot Refute

[53] Sequential manifold regularization for large language model contextual stability PDF

Cannot Refute

[54] COVID-era forecasting: Google trends and window and model averaging PDF

Cannot Refute

[55] Instability in Downstream Task Performance During LLM Pretraining PDF

Cannot Refute

[56] Window-based Model Averaging Improves Generalization in Heterogeneous Federated Learning PDF

Cannot Refute

WSM: Decay-Free Learning Rate Schedule via Checkpoint Merging for LLM Pre-training

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[8] Straight to zero: Why linearly decaying the learning rate to zero works best for LLMs PDF

[10] Beyond Cosine Decay: On the effectiveness of Infinite Learning Rate Schedule for Continual Pre-training PDF

[23] Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape Perspective PDF

Contribution Analysis

WSM framework connecting LR decay and checkpoint merging

[4] Hop, skip, jump to convergence: Dynamics of learning rate transitions for improved training of large language models PDF

[23] Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape Perspective PDF

[42] Surge phenomenon in optimal learning rate and batch size scaling PDF

[44] How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining PDF

[57] When, Where and Why to Average Weights? PDF

[58] How to Merge Your Multimodal Models Over Time? PDF

[59] How to Merge Multimodal Models Over Time? PDF

[60] JaColBERTv2. 5: Optimising Multi-Vector Retrievers to Create State-of-the-Art Japanese Retrievers with Constrained Resources PDF

[61] Improved Cotton Leaf Disease Classification Using Parameter-Efficient Deep Learning Framework PDF

[62] Multimodal automl on structured tables with text fields PDF

Theorem for deriving checkpoint weights from decay schedules

[51] A Simple Baseline for Bayesian Uncertainty in Deep Learning PDF

Identification of merge duration as critical performance factor

[18] Scaling laws and compute-optimal training beyond fixed training durations PDF

[52] LLM circuit analyses are consistent across training and scale PDF

[53] Sequential manifold regularization for large language model contextual stability PDF

[54] COVID-era forecasting: Google trends and window and model averaging PDF

[55] Instability in Downstream Task Performance During LLM Pretraining PDF

[56] Window-based Model Averaging Improves Generalization in Heterogeneous Federated Learning PDF

Table of Contents