A Tale of Two Smoothness Notions: Adaptive Optimizers and Non-Euclidean Descent

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

adaptive optimizersteepest descentloss geometryconvergence rate

Adaptive optimizers can reduce to normalized steepest descent (NSD) when only adapting to the current gradient, suggesting a close connection between the two algorithmic families. A key distinction in their analyses, however, lies in the smoothness assumptions they rely on. In the convex setting, adaptive optimizers are governed by a stronger adaptive smoothness condition, while NSD relies on the standard notion of smoothness. We extend the theory of adaptive smoothness to the nonconvex setting and show that it precisely characterizes the convergence of adaptive optimizers. Moreover, we establish that adaptive smoothness enables acceleration of adaptive optimizers with Nesterov momentum, a guarantee unattainable under standard smoothness. We further develop an analogous comparison for stochastic optimization by introducing adaptive variance, which parallels adaptive smoothness and leads to qualitatively stronger guarantees than the standard variance.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper extends adaptive smoothness theory from convex to nonconvex settings and establishes convergence guarantees for adaptive optimizers with Nesterov momentum. It resides in the 'Adam and RMSProp Convergence Theory' leaf, which contains six papers examining convergence under relaxed smoothness assumptions. This leaf sits within the broader 'Adaptive Optimizer Convergence under Relaxed Smoothness' branch, indicating a moderately populated research direction focused on moving beyond uniform Lipschitz smoothness. The taxonomy shows this is an active but not overcrowded area, with parallel work on AdaGrad and complexity lower bounds occupying sibling leaves.

The taxonomy reveals several neighboring research directions that contextualize this work. The 'AdaGrad and Normalized Steepest Descent' leaf explores connections between adaptive methods and normalized gradient descent under anisotropic smoothness, while the 'Gradient Descent and Proximal Methods under Local or Directional Smoothness' branch examines non-adaptive methods exploiting path-dependent curvature. The paper's focus on adaptive smoothness distinguishes it from these alternatives: unlike directional smoothness approaches that condition on optimization paths, adaptive smoothness captures coordinate-wise scaling inherent to adaptive optimizers. The taxonomy's scope notes clarify that this work excludes variational inequalities and non-adaptive methods, positioning it firmly within optimizer-specific convergence theory.

Among twenty-one candidates examined through semantic search and citation expansion, the analysis identified limited prior work overlap. The first contribution on unified nonconvex convergence examined ten candidates with zero refutations, suggesting novelty in extending adaptive smoothness to nonconvex settings. The second contribution on acceleration with Nesterov momentum examined ten candidates and found one refutable match, indicating some existing work on momentum-based adaptive methods. The third contribution on adaptive variance examined only one candidate with no refutation. These statistics reflect a focused but not exhaustive search scope, suggesting the contributions address gaps in the examined literature while acknowledging that broader searches might reveal additional related work.

Based on the limited search of twenty-one semantically similar papers, the work appears to make substantive theoretical contributions, particularly in unifying nonconvex analysis and introducing adaptive variance concepts. The single refutation for the momentum acceleration claim suggests this aspect has some precedent, though the specific combination with adaptive smoothness may still be novel. The analysis does not cover the full breadth of optimization literature, and a more comprehensive search across related branches like stochastic methods or parameter-free algorithms might reveal additional connections.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: convergence analysis of adaptive optimizers under adaptive smoothness conditions. The field has evolved to address the gap between classical smoothness assumptions and the realities of modern machine learning, where gradient Lipschitz constants may grow with iterates or vary across directions. The taxonomy reflects this diversification into several major branches: one focuses on adaptive optimizers like Adam and RMSProp under relaxed smoothness (e.g., Adam Relaxed Assumptions[1], Adam Non-uniform Smoothness[3]); another examines variational inequalities and saddle-point problems with similar relaxations (Variational Inequalities Relaxed[5]); a third explores gradient descent and proximal methods under local or directional smoothness (Directional Smoothness[11], Proximal Local Lipschitz[18]); and yet another tackles stochastic methods when smoothness and variance are unbounded (SignSGD Unbounded[16], Adaptivity Unbounded Gradients[22]). Additional branches cover online learning with gradient-variation adaptivity (Gradient-Variation Online[14]), parameter-free and universal methods (Universal Variational Inequalities[10], Universal Online Convex[27]), acceleration under structural conditions, specialized smoothing techniques, theoretical foundations, and application-specific approaches. Recent work has concentrated on refining convergence guarantees for popular adaptive optimizers beyond standard Lipschitz smoothness. Two Smoothness Notions[0] sits squarely within the branch on Adam and RMSProp convergence theory, alongside Adam Relaxed Assumptions[1], Adam Stochastic Relaxed[2], and Adam Non-uniform Smoothness[3]. While Adam Relaxed Assumptions[1] and Adam Non-uniform Smoothness[3] relax global smoothness to coordinate-wise or non-uniform variants, Two Smoothness Notions[0] introduces dual smoothness frameworks that capture how curvature may adapt to the optimization trajectory itself. This contrasts with RMSProp Adam Generalized[8] and RMSProp Stochastic Oracles[34], which extend similar ideas to broader stochastic settings. A central theme across these studies is balancing theoretical rigor with practical relevance: how to prove convergence when classical assumptions fail, yet still recover meaningful rates. Two Smoothness Notions[0] contributes to this dialogue by offering new analytical tools that bridge the gap between overly restrictive classical theory and the flexible, adaptive behavior observed in deep learning.

Claimed Contributions

Unified nonconvex convergence analysis for adaptive optimizers via adaptive smoothness

10 retrieved papers

The authors extend the theory of adaptive smoothness to the nonconvex setting and prove that it precisely characterizes the convergence of a broad class of adaptive optimizers (including AdaGrad, AdaGrad-Norm, and one-sided Shampoo) on nonconvex functions, achieving an optimal rate that depends on adaptive smoothness rather than standard smoothness.

10 retrieved papers

Acceleration of adaptive optimizers with Nesterov momentum under adaptive smoothness

Can Refute

10 retrieved papers

The authors demonstrate that adaptive smoothness enables an accelerated Õ(T^{-2}) convergence rate for adaptive optimizers equipped with Nesterov momentum in the convex setting, a rate that cannot be achieved under standard l∞ smoothness, thereby showing a concrete benefit of the stronger adaptive smoothness assumption.

10 retrieved papers

Can Refute

Introduction of adaptive variance and dimension-free convergence for NSD

1 retrieved paper

The authors introduce adaptive variance, a noise assumption that parallels adaptive smoothness, and prove that it enables a dimension-free convergence rate for normalized steepest descent with momentum on nonconvex functions, which is unattainable under the standard variance assumption as shown by a matching lower bound.

1 retrieved paper

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] Convergence of adam under relaxed assumptions PDF

Li, Haochuan, Rakhlin, Alexander, Haochuan Li, Jadbabaie, Ali, A. Jadbabaie, A. Rakhlin (2023)

[2] On convergence of adam for stochastic optimization under relaxed assumptions PDF

Yusu Hong, Jun-Hong Lin (2024)

[3] Provable adaptivity of adam under non-uniform smoothness PDF

Bohan Wang, Yushun Zhang, Huishuai Zhang, Qi Meng, Ruoyu Sun, Zhi-ming Ma, Tie-Yan Liu, Zhirui Ma, Zhi-Quan Luo, Wei Chen, Zhimin Luo (2024)

[8] Convergence guarantees for rmsprop and adam in generalized-smooth non-convex optimization with affine noise variance PDF

Zhang Qi, Zhou Yi, Qi Zhang, Zou, Shaofeng, Yi Zhou, Shaofeng Zou (2024)

[34] Identifying stochastic oracles for fast convergence of RMSProp PDF

J Zhang, A Mukherjee (0)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Unified nonconvex convergence analysis for adaptive optimizers via adaptive smoothness

[3] Provable adaptivity of adam under non-uniform smoothness PDF

Cannot Refute

[6] Complexity Lower Bounds of Adaptive Gradient Algorithms for Non-convex Stochastic Optimization under Relaxed Smoothness PDF

Cannot Refute

[36] AdaGrad stepsizes: Sharp convergence over nonconvex landscapes PDF

Cannot Refute

[37] Near-optimal non-convex stochastic optimization under generalized smoothness PDF

Cannot Refute

[38] On the convergence of adam under non-uniform smoothness: Separability from sgdm and beyond PDF

Cannot Refute

[39] Equilibrated adaptive learning rates for non-convex optimization PDF

Cannot Refute

[40] Convergence of the RMSProp deep learning method with penalty for nonconvex optimization PDF

Cannot Refute

[41] Inertial Bregman Proximal Gradient Algorithm For Nonconvex Problem with Smooth Adaptable Property PDF

Cannot Refute

[42] Generalized-Smooth Nonconvex Optimization is As Efficient As Smooth Nonconvex Optimization PDF

Cannot Refute

[43] Convergence of adaptive algorithms for constrained weakly convex optimization PDF

Cannot Refute

Contribution

Acceleration of adaptive optimizers with Nesterov momentum under adaptive smoothness

[51] SGD with Adaptive Preconditioning: Unified Analysis and Momentum Acceleration PDF

Can Refute

[44] Accelerated distributed aggregative optimization PDF

Cannot Refute

[45] Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models PDF

Cannot Refute

[46] An Adaptive and Parameter-Free Nesterov's Accelerated Gradient Method for Convex Optimization PDF

Cannot Refute

[47] Unified Convergence Analysis of Stochastic Momentum Methods for Convex and Non-convex Optimization PDF

Cannot Refute

[48] A theoretical and empirical study of new adaptive algorithms with additional momentum steps and shifted updates for stochastic non-convex optimization: CD Alecsa PDF

Cannot Refute

[49] Adaptive Nesterov momentum method for solving ill-posed inverse problems PDF

Cannot Refute

[50] A momentum accelerated stochastic method and its application on policy search problems PDF

Cannot Refute

[52] An accelerated gradient method with adaptive restart for convex multiobjective optimization problems PDF

Cannot Refute

[53] Machine learning and deep learning optimization algorithms for unconstrained convex optimization problem PDF

Cannot Refute

Contribution

Introduction of adaptive variance and dimension-free convergence for NSD

[54] A comparison of adaptive algorithms based on the methods of steepest descent and random search PDF

Cannot Refute

A Tale of Two Smoothness Notions: Adaptive Optimizers and Non-Euclidean Descent

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] Convergence of adam under relaxed assumptions PDF

[2] On convergence of adam for stochastic optimization under relaxed assumptions PDF

[3] Provable adaptivity of adam under non-uniform smoothness PDF

[8] Convergence guarantees for rmsprop and adam in generalized-smooth non-convex optimization with affine noise variance PDF

[34] Identifying stochastic oracles for fast convergence of RMSProp PDF

Contribution Analysis

Unified nonconvex convergence analysis for adaptive optimizers via adaptive smoothness

[3] Provable adaptivity of adam under non-uniform smoothness PDF

[6] Complexity Lower Bounds of Adaptive Gradient Algorithms for Non-convex Stochastic Optimization under Relaxed Smoothness PDF

[36] AdaGrad stepsizes: Sharp convergence over nonconvex landscapes PDF

[37] Near-optimal non-convex stochastic optimization under generalized smoothness PDF

[38] On the convergence of adam under non-uniform smoothness: Separability from sgdm and beyond PDF

[39] Equilibrated adaptive learning rates for non-convex optimization PDF

[40] Convergence of the RMSProp deep learning method with penalty for nonconvex optimization PDF

[41] Inertial Bregman Proximal Gradient Algorithm For Nonconvex Problem with Smooth Adaptable Property PDF

[42] Generalized-Smooth Nonconvex Optimization is As Efficient As Smooth Nonconvex Optimization PDF

[43] Convergence of adaptive algorithms for constrained weakly convex optimization PDF

Acceleration of adaptive optimizers with Nesterov momentum under adaptive smoothness

[51] SGD with Adaptive Preconditioning: Unified Analysis and Momentum Acceleration PDF

[44] Accelerated distributed aggregative optimization PDF

[45] Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models PDF

[46] An Adaptive and Parameter-Free Nesterov's Accelerated Gradient Method for Convex Optimization PDF

[47] Unified Convergence Analysis of Stochastic Momentum Methods for Convex and Non-convex Optimization PDF

[48] A theoretical and empirical study of new adaptive algorithms with additional momentum steps and shifted updates for stochastic non-convex optimization: CD Alecsa PDF

[49] Adaptive Nesterov momentum method for solving ill-posed inverse problems PDF

[50] A momentum accelerated stochastic method and its application on policy search problems PDF

[52] An accelerated gradient method with adaptive restart for convex multiobjective optimization problems PDF

[53] Machine learning and deep learning optimization algorithms for unconstrained convex optimization problem PDF

Introduction of adaptive variance and dimension-free convergence for NSD

[54] A comparison of adaptive algorithms based on the methods of steepest descent and random search PDF

Table of Contents