A Tale of Two Smoothness Notions: Adaptive Optimizers and Non-Euclidean Descent

ICLR 2026 Conference SubmissionAnonymous Authors
adaptive optimizersteepest descentloss geometryconvergence rate
Abstract:

Adaptive optimizers can reduce to normalized steepest descent (NSD) when only adapting to the current gradient, suggesting a close connection between the two algorithmic families. A key distinction in their analyses, however, lies in the smoothness assumptions they rely on. In the convex setting, adaptive optimizers are governed by a stronger adaptive smoothness condition, while NSD relies on the standard notion of smoothness. We extend the theory of adaptive smoothness to the nonconvex setting and show that it precisely characterizes the convergence of adaptive optimizers. Moreover, we establish that adaptive smoothness enables acceleration of adaptive optimizers with Nesterov momentum, a guarantee unattainable under standard smoothness. We further develop an analogous comparison for stochastic optimization by introducing adaptive variance, which parallels adaptive smoothness and leads to qualitatively stronger guarantees than the standard variance.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper extends adaptive smoothness theory from convex to nonconvex settings and establishes convergence guarantees for adaptive optimizers with Nesterov momentum. It resides in the 'Adam and RMSProp Convergence Theory' leaf, which contains six papers examining convergence under relaxed smoothness assumptions. This leaf sits within the broader 'Adaptive Optimizer Convergence under Relaxed Smoothness' branch, indicating a moderately populated research direction focused on moving beyond uniform Lipschitz smoothness. The taxonomy shows this is an active but not overcrowded area, with parallel work on AdaGrad and complexity lower bounds occupying sibling leaves.

The taxonomy reveals several neighboring research directions that contextualize this work. The 'AdaGrad and Normalized Steepest Descent' leaf explores connections between adaptive methods and normalized gradient descent under anisotropic smoothness, while the 'Gradient Descent and Proximal Methods under Local or Directional Smoothness' branch examines non-adaptive methods exploiting path-dependent curvature. The paper's focus on adaptive smoothness distinguishes it from these alternatives: unlike directional smoothness approaches that condition on optimization paths, adaptive smoothness captures coordinate-wise scaling inherent to adaptive optimizers. The taxonomy's scope notes clarify that this work excludes variational inequalities and non-adaptive methods, positioning it firmly within optimizer-specific convergence theory.

Among twenty-one candidates examined through semantic search and citation expansion, the analysis identified limited prior work overlap. The first contribution on unified nonconvex convergence examined ten candidates with zero refutations, suggesting novelty in extending adaptive smoothness to nonconvex settings. The second contribution on acceleration with Nesterov momentum examined ten candidates and found one refutable match, indicating some existing work on momentum-based adaptive methods. The third contribution on adaptive variance examined only one candidate with no refutation. These statistics reflect a focused but not exhaustive search scope, suggesting the contributions address gaps in the examined literature while acknowledging that broader searches might reveal additional related work.

Based on the limited search of twenty-one semantically similar papers, the work appears to make substantive theoretical contributions, particularly in unifying nonconvex analysis and introducing adaptive variance concepts. The single refutation for the momentum acceleration claim suggests this aspect has some precedent, though the specific combination with adaptive smoothness may still be novel. The analysis does not cover the full breadth of optimization literature, and a more comprehensive search across related branches like stochastic methods or parameter-free algorithms might reveal additional connections.

Taxonomy

Core-task Taxonomy Papers
35
3
Claimed Contributions
21
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: convergence analysis of adaptive optimizers under adaptive smoothness conditions. The field has evolved to address the gap between classical smoothness assumptions and the realities of modern machine learning, where gradient Lipschitz constants may grow with iterates or vary across directions. The taxonomy reflects this diversification into several major branches: one focuses on adaptive optimizers like Adam and RMSProp under relaxed smoothness (e.g., Adam Relaxed Assumptions[1], Adam Non-uniform Smoothness[3]); another examines variational inequalities and saddle-point problems with similar relaxations (Variational Inequalities Relaxed[5]); a third explores gradient descent and proximal methods under local or directional smoothness (Directional Smoothness[11], Proximal Local Lipschitz[18]); and yet another tackles stochastic methods when smoothness and variance are unbounded (SignSGD Unbounded[16], Adaptivity Unbounded Gradients[22]). Additional branches cover online learning with gradient-variation adaptivity (Gradient-Variation Online[14]), parameter-free and universal methods (Universal Variational Inequalities[10], Universal Online Convex[27]), acceleration under structural conditions, specialized smoothing techniques, theoretical foundations, and application-specific approaches. Recent work has concentrated on refining convergence guarantees for popular adaptive optimizers beyond standard Lipschitz smoothness. Two Smoothness Notions[0] sits squarely within the branch on Adam and RMSProp convergence theory, alongside Adam Relaxed Assumptions[1], Adam Stochastic Relaxed[2], and Adam Non-uniform Smoothness[3]. While Adam Relaxed Assumptions[1] and Adam Non-uniform Smoothness[3] relax global smoothness to coordinate-wise or non-uniform variants, Two Smoothness Notions[0] introduces dual smoothness frameworks that capture how curvature may adapt to the optimization trajectory itself. This contrasts with RMSProp Adam Generalized[8] and RMSProp Stochastic Oracles[34], which extend similar ideas to broader stochastic settings. A central theme across these studies is balancing theoretical rigor with practical relevance: how to prove convergence when classical assumptions fail, yet still recover meaningful rates. Two Smoothness Notions[0] contributes to this dialogue by offering new analytical tools that bridge the gap between overly restrictive classical theory and the flexible, adaptive behavior observed in deep learning.

Claimed Contributions

Unified nonconvex convergence analysis for adaptive optimizers via adaptive smoothness

The authors extend the theory of adaptive smoothness to the nonconvex setting and prove that it precisely characterizes the convergence of a broad class of adaptive optimizers (including AdaGrad, AdaGrad-Norm, and one-sided Shampoo) on nonconvex functions, achieving an optimal rate that depends on adaptive smoothness rather than standard smoothness.

10 retrieved papers
Acceleration of adaptive optimizers with Nesterov momentum under adaptive smoothness

The authors demonstrate that adaptive smoothness enables an accelerated Õ(T^{-2}) convergence rate for adaptive optimizers equipped with Nesterov momentum in the convex setting, a rate that cannot be achieved under standard l∞ smoothness, thereby showing a concrete benefit of the stronger adaptive smoothness assumption.

10 retrieved papers
Can Refute
Introduction of adaptive variance and dimension-free convergence for NSD

The authors introduce adaptive variance, a noise assumption that parallels adaptive smoothness, and prove that it enables a dimension-free convergence rate for normalized steepest descent with momentum on nonconvex functions, which is unattainable under the standard variance assumption as shown by a matching lower bound.

1 retrieved paper

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Unified nonconvex convergence analysis for adaptive optimizers via adaptive smoothness

The authors extend the theory of adaptive smoothness to the nonconvex setting and prove that it precisely characterizes the convergence of a broad class of adaptive optimizers (including AdaGrad, AdaGrad-Norm, and one-sided Shampoo) on nonconvex functions, achieving an optimal rate that depends on adaptive smoothness rather than standard smoothness.

Contribution

Acceleration of adaptive optimizers with Nesterov momentum under adaptive smoothness

The authors demonstrate that adaptive smoothness enables an accelerated Õ(T^{-2}) convergence rate for adaptive optimizers equipped with Nesterov momentum in the convex setting, a rate that cannot be achieved under standard l∞ smoothness, thereby showing a concrete benefit of the stronger adaptive smoothness assumption.

Contribution

Introduction of adaptive variance and dimension-free convergence for NSD

The authors introduce adaptive variance, a noise assumption that parallels adaptive smoothness, and prove that it enables a dimension-free convergence rate for normalized steepest descent with momentum on nonconvex functions, which is unattainable under the standard variance assumption as shown by a matching lower bound.