Cautious Weight Decay

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 7.0 Download Report PDF

optimizationregularizationweight decaydecoupledlyapunovtrainingdeep learning

We introduce Cautious Weight Decay (CWD), a one-line, optimizer-agnostic modification that applies weight decay only to parameter coordinates whose signs align with the optimizer update. Unlike standard decoupled decay, which implicitly optimizes a regularized or constrained objective, CWD preserves the original loss and admits a bilevel interpretation: it induces sliding-mode behavior upon reaching the stationary manifold, allowing it to search for locally Pareto-optimal stationary points of the unmodified objective. In practice, CWD is a drop-in change for optimizers such as AdamW, Lion, and Muon, requiring no new hyperparameters or additional tuning. For language model pre-training and ImageNet classification, CWD consistently improves final loss and accuracy at million- to billion-parameter scales.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Cautious Weight Decay (CWD), which applies weight decay only when parameter signs align with optimizer updates. According to the taxonomy, this work resides in the 'Selective Decay via Sign Alignment' leaf under 'Sign-Based Weight Decay Mechanisms'. Notably, this leaf contains only the original paper itself—no sibling papers are listed. This positioning suggests the paper occupies a relatively sparse research direction within the broader field of sign-based regularization strategies, where most related work focuses on federated aggregation or transfer learning rather than direct sign-conditioned decay.

The taxonomy reveals three main branches: sign-based decay mechanisms, federated gradient aggregation, and theoretical weight decay analysis. The original paper's leaf sits within the first branch, which also includes a transfer learning regularization approach (paper 52a87076). Neighboring branches address Byzantine-robust aggregation and pruning-based defenses in federated settings, as well as convergence theory for standard weight decay. The scope notes clarify that CWD's sign-alignment conditioning distinguishes it from methods using sign information for aggregation or pruning, placing it in a distinct methodological niche focused on single-machine optimizer modifications.

Among the three contributions analyzed, the CWD algorithm itself examined two candidates with zero refutable prior work, while the bilevel interpretation examined three candidates with zero refutations. The Lyapunov-based convergence analysis examined ten candidates and found two that appear to provide overlapping theoretical frameworks. Given the limited search scope of fifteen total candidates, these statistics suggest the algorithmic and interpretive contributions may be more novel, whereas the convergence analysis builds on established techniques. The analysis does not claim exhaustive coverage but indicates that among top-ranked semantic matches, substantial algorithmic overlap is minimal.

Based on the limited literature search, CWD appears to introduce a relatively underexplored mechanism within sign-based regularization. The taxonomy structure and sibling-paper absence suggest this direction has received less attention than federated or transfer learning applications of sign information. However, the search examined only fifteen candidates, so broader prior work outside top semantic matches remains unassessed. The convergence analysis shows more connection to existing theory, while the core algorithm and bilevel framing appear more distinctive within the examined scope.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Selective weight decay based on parameter-update sign alignment. This emerging area explores how regularization can be made adaptive by examining the directional consistency between weight updates and decay forces. The taxonomy reveals three main branches that capture distinct facets of this idea. The first branch, Sign-Based Weight Decay Mechanisms, focuses on methods that modulate or selectively apply decay by checking whether parameter updates align or conflict with the decay direction. The second branch, Sign-Based Gradient Aggregation in Federated Learning, examines how sign information can guide communication and aggregation strategies when training is distributed across clients. The third branch, Theoretical Analysis of Weight Decay in Optimization, provides formal guarantees and convergence insights for weight decay variants in both convex and nonconvex settings. Together, these branches illustrate that sign alignment is relevant not only for single-machine training but also for distributed scenarios and for understanding the underlying optimization dynamics. Within the first branch, a small handful of works have begun to explore selective or cautious decay strategies. Cautious Weight Decay[0] introduces a mechanism that applies regularization only when the sign of a parameter's gradient aligns with the sign of the parameter itself, aiming to prevent premature shrinkage of weights that are still being adjusted in conflicting directions. This approach contrasts with classical uniform decay and sits alongside other recent efforts such as Silencer[2], which targets specific subsets of parameters based on different criteria. Meanwhile, Weight Decay Nonconvex[1] offers theoretical perspectives on how decay interacts with nonconvex landscapes, providing a complementary lens on why selective strategies might improve generalization. By conditioning decay on sign alignment, Cautious Weight Decay[0] occupies a niche that bridges heuristic parameter selection and principled regularization, addressing scenarios where indiscriminate penalization may hinder learning.

Claimed Contributions

Cautious Weight Decay (CWD) algorithm

2 retrieved papers

The authors propose a simple modification to decoupled weight decay that selectively applies decay only when the optimizer update and parameter signs agree. This is implemented as a one-line change requiring no new hyperparameters and is compatible with optimizers such as AdamW, Lion, and Muon.

2 retrieved papers

Bilevel interpretation and sliding-mode dynamics

3 retrieved papers

The authors establish that CWD optimizes the original objective without implicit regularization bias. They show it induces sliding-mode dynamics within the stationary manifold, converging to locally Pareto-optimal stationary points that minimize parameter magnitudes while remaining stationary.

3 retrieved papers

Lyapunov-based convergence analysis

Can Refute

10 retrieved papers

The authors construct Lyapunov functions for several optimizers equipped with CWD and prove asymptotic stability and convergence to the stationary set of the original objective. They also provide a convergence rate for discrete-time Adam with CWD under additional assumptions.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution