Abstract:

AdamW has been the default optimizer for transformer pretraining. For many years, our community searched for faster and more stable optimizers with only constrained positive outcomes. In this work, we propose a \textbf{single-line modification in Pytorch} to any momentum-based optimizer, which we rename cautious optimizer, e.g. C-AdamW and C-Lion. Our theoretical result shows that this modification preserves Adam's Hamiltonian function and it does not break the convergence guarantee under the Lyapunov analysis. In addition, a whole new family of optimizers is revealed by our theoretical insight. Among them, we pick the simplest one for empirical experiments, showing not only consistent speed-up on LLM pretraining and post-training tasks, but also better results in MAE pretraining, with minimum extra tuning on hyperparameters.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a single-line modification to momentum-based optimizers that masks parameter updates when gradient and momentum directions conflict. It sits within the Gradient-Momentum Alignment-Based Masking leaf of the taxonomy, which contains only two papers total. This represents a relatively sparse research direction within the broader field of momentum-based optimization improvements, suggesting the specific approach of alignment-based masking has received limited prior exploration compared to alternative strategies like signal processing frameworks or adaptive strength modulation.

The taxonomy reveals six major branches addressing momentum optimization from different angles. The paper's leaf sits under Cautious Update Mechanisms, which contrasts with neighboring branches like Momentum Filtering (using frequency-domain analysis) and Momentum Dynamics (addressing step size control). The sibling leaf, Adaptive Update Strength Modulation, focuses on dynamically adjusting update magnitudes rather than binary masking based on alignment. The taxonomy's scope notes clarify that alignment-based masking excludes uniform momentum application and dynamic alpha adjustment without alignment checks, positioning this work as a targeted intervention rather than a global signal reshaping approach.

Among 26 candidates examined across three contributions, the core optimizer modification shows overlap with prior work: 6 candidates examined, 2 refutable. The theoretical convergence guarantees and new optimizer family contributions appear more novel, with 10 candidates each examined and zero refutable matches. This suggests the alignment-based masking concept has some precedent in the limited search scope, while the theoretical analysis and broader family of optimizers may represent more distinctive contributions. The statistics reflect a focused literature search rather than exhaustive coverage of all momentum optimization research.

Based on the top-26 semantic matches examined, the work appears to occupy a sparsely populated niche within momentum optimization. The alignment-based masking approach has limited direct precedent, though the search scope cannot rule out related work outside the candidate set. The theoretical contributions show no clear overlap among examined papers, though this reflects search limitations rather than definitive novelty claims.

Taxonomy

Core-task Taxonomy Papers
15
3
Claimed Contributions
26
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: Improving momentum-based optimization with cautious update masking. The field of momentum-based optimization has evolved into a rich landscape of techniques that refine how momentum accumulates and applies parameter updates. The taxonomy reveals six main branches: Cautious Update Mechanisms focus on selective parameter modification through alignment-based masking and adaptive gating; Momentum Filtering applies signal processing perspectives to denoise or reshape momentum signals; Domain-Specific Adaptations tailor momentum to specialized settings like federated learning or adversarial robustness; Momentum Dynamics explores step size control and overshoot prevention; Communication-Efficient methods address distributed training bottlenecks; and Non-Standard Domains extend momentum to unconventional optimization contexts. Representative works such as Mofo[1] and Continual Momentum Filtering[3] illustrate how filtering frameworks can stabilize training, while MomentumSMoe[2] demonstrates domain-specific tuning for mixture-of-experts architectures. Several active lines of work reveal contrasting philosophies and open questions. One cluster emphasizes cautious or selective updates—deciding when and where to apply momentum—trading off convergence speed against stability and generalization. Another cluster treats momentum as a signal to be filtered or spectrally analyzed, as seen in Signal Processing SGD[12] and Momentum Frequency Analysis[13], raising questions about optimal filter design and computational overhead. Cautious Optimizers[0] sits squarely within the gradient-momentum alignment-based masking approach, closely related to MGUP[15], both of which mask updates when gradient and momentum disagree. Compared to broader filtering methods like Continual Momentum Filtering[3] or adaptive step size schemes like AlphaAdam[5], Cautious Optimizers[0] emphasizes a more surgical, alignment-driven intervention rather than global signal reshaping, positioning it as a targeted solution for preventing harmful momentum-driven overshoots while preserving beneficial acceleration.

Claimed Contributions

Cautious Optimizers: One-Line Modification for Momentum-Based Optimizers

The authors introduce a simple modification to momentum-based optimizers that masks updates based on alignment between the update direction and current gradients. This modification requires only one line of code and can be applied to any momentum-based optimizer such as AdamW and Lion.

6 retrieved papers
Can Refute
Theoretical Convergence Guarantees and Hamiltonian Preservation

The authors provide theoretical analysis demonstrating that cautious optimizers preserve the convergence properties of base optimizers while ensuring monotonic decrease of the loss function. They show this holds under the Hamiltonian+Descent framework and Lyapunov analysis.

10 retrieved papers
New Family of Optimizers Revealed by Theoretical Insight

The theoretical framework developed by the authors reveals a broader family of optimizers beyond the specific cautious variant tested empirically. This family is characterized by different choices of the masking function that satisfy certain theoretical conditions.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Cautious Optimizers: One-Line Modification for Momentum-Based Optimizers

The authors introduce a simple modification to momentum-based optimizers that masks updates based on alignment between the update direction and current gradients. This modification requires only one line of code and can be applied to any momentum-based optimizer such as AdamW and Lion.

Contribution

Theoretical Convergence Guarantees and Hamiltonian Preservation

The authors provide theoretical analysis demonstrating that cautious optimizers preserve the convergence properties of base optimizers while ensuring monotonic decrease of the loss function. They show this holds under the Hamiltonian+Descent framework and Lyapunov analysis.

Contribution

New Family of Optimizers Revealed by Theoretical Insight

The theoretical framework developed by the authors reveals a broader family of optimizers beyond the specific cautious variant tested empirically. This family is characterized by different choices of the masking function that satisfy certain theoretical conditions.