Improving Online-to-Nonconvex Conversion for Smooth Optimization via Double Optimism

ICLR 2026 Conference SubmissionAnonymous Authors
Online-to-nonconvex conversionOptimistic Gradient DescentSmooth Nonconvex Optimization
Abstract:

A recent breakthrough in nonconvex optimization is the online-to-nonconvex conversion framework of Cutkosky et al. (2023), which reformulates the task of finding an ε\varepsilon-first-order stationary point as an online learning problem. When both the gradient and the Hessian are Lipschitz continuous, instantiating this framework with two different online learners achieves a complexity of O(ε1.75log(1/ε))\mathcal{O}(\varepsilon^{-1.75}\log(1/\varepsilon)) in the deterministic case and a complexity of O(ε3.5)\mathcal{O}(\varepsilon^{-3.5}) in the stochastic case. However, this approach suffers from several limitations: (i) the deterministic method relies on a complex double-loop scheme that solves a fixed-point equation to construct hint vectors for an optimistic online learner, introducing an extra logarithmic factor; (ii) the stochastic method assumes a bounded second-order moment of the stochastic gradient, which is stronger than standard variance bounds; and (iii) different online learning algorithms are used in the two settings. In this paper, we address these issues by introducing an online optimistic gradient method based on a novel doubly optimistic hint function. Specifically, we use the gradient at an extrapolated point as the hint, motivated by two optimistic assumptions: that the difference between the hint and the target gradient remains near constant, and that consecutive update directions change slowly due to smoothness. Our method eliminates the need for a double loop and removes the logarithmic factor. Furthermore, by simply replacing full gradients with stochastic gradients and under the standard assumption that their variance is bounded by σ2\sigma^2, we obtain a unified algorithm with complexity O(ε1.75+σ2ε3.5)\mathcal{O}(\varepsilon^{-1.75} + \sigma^2 \varepsilon^{-3.5}), smoothly interpolating between the best-known deterministic rate and the optimal stochastic rate.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces a doubly optimistic hint function within the online-to-nonconvex conversion framework, aiming to find first-order stationary points in smooth nonconvex optimization. It sits in the 'Online-to-Nonconvex Conversion Frameworks' leaf, which contains only two papers including the original work. This is a notably sparse research direction within a taxonomy of 50 papers across 26 leaf nodes, suggesting that the conversion-based approach remains relatively underexplored compared to more established branches like adaptive gradient methods or momentum-based schemes.

The taxonomy reveals that neighboring leaves include 'Adaptive Gradient Methods' (four papers on Adam-type algorithms), 'Fixed-Stepsize and Momentum Methods' (two papers on classical acceleration), and 'Variance-Reduced and Retrospective Methods' (two papers on stochastic gradient refinement). The online-to-nonconvex conversion approach diverges from these by reformulating optimization as an online learning problem with regret-based analysis, rather than directly designing step-size rules or momentum schedules. This structural separation indicates that the paper's core methodology—leveraging optimism from online learning—occupies a distinct conceptual niche within the broader algorithm design landscape.

Among 20 candidates examined through semantic search and citation expansion, no papers were found that clearly refute the three main contributions. The unified algorithm achieving best-known rates in both deterministic and stochastic settings was assessed against 10 candidates, none of which provided overlapping prior work. Similarly, the adaptive step-size scheme for parameter-free optimization was compared to 10 candidates without finding refutable overlap. The doubly optimistic hint function itself was not matched against any candidates in the search. These statistics reflect a limited search scope rather than exhaustive coverage, but suggest that within the examined literature, the specific combination of techniques appears novel.

Given the sparse population of the online-to-nonconvex conversion leaf and the absence of refutable prior work among 20 examined candidates, the paper's contributions appear to advance a relatively young research direction. However, the limited search scale means that closely related work in adjacent areas—such as optimistic online learning or parameter-free adaptive methods—may not have been fully captured. The analysis provides evidence of novelty within the examined scope but does not rule out relevant prior work beyond the top-20 semantic matches.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
20
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: finding first-order stationary points in smooth nonconvex optimization. The field has matured into several well-defined branches that address complementary aspects of this central challenge. Complexity Analysis and Lower Bounds (e.g., Nonconvex Lower Bounds[1], Oracle Complexity[34]) establishes fundamental limits on what any algorithm can achieve, while Algorithm Design and Convergence explores practical schemes—ranging from adaptive methods like Adaptive Nonconvex[3] and UAdam[18] to momentum-based approaches such as Learning Rate Free Momentum[4] and online-to-nonconvex conversion frameworks like Online to Nonconvex[31]. Constrained settings are handled by a dedicated branch (Constrained Nonconvex Optimization) that includes works on functional constraints (Functional Constrained[5]) and affine constraints (Affinely Constrained Bounds[13]), while Nonsmooth and Composite Optimization tackles problems where smoothness assumptions break down (Gradient Free Nonsmooth[7], Proximal Nonsmooth[44]). Higher-Order and Second-Order Methods pursue faster convergence by exploiting curvature information (Fast Newton Variant[16], Second Order Stationary[30]), and Specialized Problem Structures investigates domains like manifolds (Manifold Acceleration[15]) and bilevel optimization (Bilevel Stationary Points[35]). Recent activity has concentrated on bridging online learning techniques with nonconvex settings and on refining complexity guarantees under weaker assumptions. Double Optimism[0] sits within the online-to-nonconvex conversion framework, closely aligned with Online to Nonconvex[31], which systematically translates regret bounds from online convex optimization into nonconvex convergence rates. This line of work contrasts with more classical adaptive gradient methods (Adaptive Nonconvex[3], Adaptive Gradient Convergence[8]) that tune step sizes based on observed gradient norms, and with momentum schemes (Learning Rate Free Momentum[4]) that accelerate convergence through inertial dynamics. By leveraging optimism—a technique originally designed for online regret minimization—Double Optimism[0] offers a distinct algorithmic pathway that exploits predictive gradient information, positioning it as a natural extension of conversion-based approaches while differing in emphasis from direct adaptive or momentum-driven designs.

Claimed Contributions

Doubly optimistic hint function for online-to-nonconvex conversion

The authors propose a new hint function that evaluates the gradient at an extrapolated point, motivated by two optimistic assumptions: that the difference between hint and target gradient remains near constant, and that consecutive update directions change slowly due to smoothness. This eliminates the need for a double-loop fixed-point iteration used in prior work.

0 retrieved papers
Unified algorithm achieving best-known rates in deterministic and stochastic settings

The authors develop a single algorithm that achieves complexity O(ε−1.75) in the deterministic setting (matching the best-known rate without logarithmic factors) and O(σ²ε−3.5) in the stochastic setting (matching the optimal rate), smoothly interpolating between both regimes using only standard variance assumptions.

10 retrieved papers
Adaptive step size scheme for parameter-free optimization

The authors introduce an adaptive step size scheme that automatically adjusts based on past gradient information, eliminating the need for manual tuning while maintaining the same complexity guarantees and offering improved dependence on local rather than global Lipschitz constants.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Doubly optimistic hint function for online-to-nonconvex conversion

The authors propose a new hint function that evaluates the gradient at an extrapolated point, motivated by two optimistic assumptions: that the difference between hint and target gradient remains near constant, and that consecutive update directions change slowly due to smoothness. This eliminates the need for a double-loop fixed-point iteration used in prior work.

Contribution

Unified algorithm achieving best-known rates in deterministic and stochastic settings

The authors develop a single algorithm that achieves complexity O(ε−1.75) in the deterministic setting (matching the best-known rate without logarithmic factors) and O(σ²ε−3.5) in the stochastic setting (matching the optimal rate), smoothly interpolating between both regimes using only standard variance assumptions.

Contribution

Adaptive step size scheme for parameter-free optimization

The authors introduce an adaptive step size scheme that automatically adjusts based on past gradient information, eliminating the need for manual tuning while maintaining the same complexity guarantees and offering improved dependence on local rather than global Lipschitz constants.