Corner Gradient Descent

ICLR 2026 Conference SubmissionAnonymous Authors
mini-batch stochastic gradient descentmomentumsampling noiseconvergence ratesaccelerationpower lawsphase diagramcontour integrationrational approximationsasymptotic methodsMNISTfrequency response function
Abstract:

We consider SGD-type optimization on infinite-dimensional quadratic problems with power law spectral conditions. It is well-known that on such problems deterministic GD has loss convergence rates Lt=O(tζ)L_t=O(t^{-\zeta}), which can be improved to Lt=O(t2ζ)L_t=O(t^{-2\zeta}) by using Heavy Ball with a non-stationary Jacobi-based schedule (and the latter rate is optimal among fixed schedules). However, in the mini-batch Stochastic GD setting, the sampling noise causes the Jacobi HB to diverge; accordingly no O(t2ζ)O(t^{-2\zeta}) algorithm is known. In this paper we show that rates up to O(t2ζ)O(t^{-2\zeta}) can be achieved by a generalized stationary SGD with infinite memory. We start by identifying generalized (S)GD algorithms with contours in the complex plane. We then show that contours that have a corner with external angle θπ\theta\pi accelerate the plain GD rate O(tζ)O(t^{-\zeta}) to O(tθζ)O(t^{-\theta\zeta}). For deterministic GD, increasing θ\theta allows to achieve rates arbitrarily close to O(t2ζ)O(t^{-2\zeta}). However, in Stochastic GD, increasing θ\theta also amplifies the sampling noise, so in general θ\theta needs to be optimized by balancing the acceleration and noise effects. We prove that the optimal rate is given by θmax=min(2,ν,2ζ+1/ν)\theta_{\max}=\min(2,\nu,\tfrac{2}{\zeta+1/\nu}), where ν,ζ\nu,\zeta are the exponents appearing in the capacity and source spectral conditions. Furthermore, using fast rational approximations of the power functions, we show that ideal corner algorithms can be efficiently approximated by practical finite-memory algorithms.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes generalized stationary SGD algorithms with infinite memory that achieve convergence rates up to O(t^{-2ζ}) on infinite-dimensional quadratic problems with power-law spectra. It resides in the 'Generalized Stationary SGD with Infinite Memory' leaf under 'Momentum and Acceleration Techniques', where it is currently the sole paper. This placement indicates a sparse research direction within the broader momentum-based acceleration landscape, suggesting the infinite-memory perspective on SGD acceleration is relatively unexplored compared to finite-memory momentum methods or explicit preconditioning approaches.

The taxonomy reveals neighboring work in adjacent leaves: Heavy Ball Momentum Under Gradient Noise addresses anisotropic noise effects, Nesterov Acceleration for Stochastic Settings focuses on interpolation-based guarantees, and Krylov Subspace Acceleration projects gradients onto subspaces. The paper's contour-based framework and corner algorithms diverge from these by framing acceleration through complex-plane geometry rather than classical momentum schedules or subspace projections. The exclude_note clarifies that finite-memory momentum methods belong elsewhere, positioning this work as fundamentally distinct in its use of infinite weighted gradient history to handle power-law spectral conditions.

Among 23 candidates examined across three contributions, none were found to clearly refute the proposed ideas. The contour view examined 10 candidates with no refutations, corner algorithms with acceleration factor θ examined 3 candidates with none refuting, and finite-memory approximations examined 10 candidates with no overlapping prior work identified. This suggests that within the limited search scope—focused on top semantic matches and citation expansion—the specific combination of contour-based analysis, corner geometry for acceleration, and infinite-memory stationary schemes appears novel. The absence of sibling papers in the same taxonomy leaf reinforces that this particular framing has not been extensively studied.

The analysis covers a targeted literature search rather than an exhaustive survey of all optimization methods. The taxonomy structure shows the field has many alternative acceleration strategies (preconditioning, finite-memory momentum, learning rate schedules), but the infinite-memory stationary perspective with complex-plane contours represents a distinct angle. The limited candidate pool and sparse taxonomy leaf suggest this direction is relatively underexplored, though the search scope does not rule out related work in adjacent mathematical optimization communities not captured by the semantic search.

Taxonomy

Core-task Taxonomy Papers
33
3
Claimed Contributions
23
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Accelerating stochastic gradient descent on ill-conditioned quadratic problems. The field addresses the challenge of slow convergence when the objective's Hessian has widely varying eigenvalues, a common bottleneck in optimization. The taxonomy reveals several complementary strategies: Preconditioning and Adaptive Scaling Methods reshape the geometry by transforming the problem space, often using second-order information or sketching techniques like those in Preconditioned SGD[1] and Sketched Conditioning[26]. Momentum and Acceleration Techniques exploit temporal structure in gradient estimates to dampen oscillations and speed convergence, with works ranging from classical heavy-ball variants to more recent noise-adaptive schemes such as Noise-adaptive Heavy-Ball[30]. Learning Rate Scheduling and Restart Strategies, exemplified by Warm Restarts[2], dynamically adjust step sizes or periodically reset momentum to escape plateaus. Distributed and Federated Optimization tackles ill-conditioning in decentralized settings, as seen in Local SGD[6] and Local SGD Quadratic[28]. Problem-Specific Acceleration tailors methods to particular structures like matrix completion or fine-tuning tasks, while Theoretical Analysis and Convergence Guarantees provide rigorous understanding of convergence rates under various noise and conditioning assumptions, and Empirical Studies validate these approaches on real applications. A particularly active line of research explores how to incorporate long-range gradient history without prohibitive memory costs, balancing the benefits of infinite-memory schemes against computational overhead. Corner Gradient Descent[0] sits within the Momentum and Acceleration Techniques branch, specifically under Generalized Stationary SGD with Infinite Memory, emphasizing methods that maintain weighted averages of past gradients to implicitly precondition updates. This contrasts with explicit preconditioning approaches like Kalman SGD[3], which uses Kalman filtering to estimate curvature, and differs from finite-memory momentum methods that discard older information. The interplay between memory depth, noise robustness, and per-iteration cost remains a central trade-off: while infinite-memory strategies can better capture global curvature, they must carefully manage numerical stability and computational complexity, especially when condition numbers vary over time as studied in Varying Condition Numbers[8].

Claimed Contributions

Contour view of generalized (S)GD algorithms

The authors establish that stationary SGD algorithms with arbitrary linear memory can be represented as contours in the complex plane through a response map Ψ. This geometric viewpoint allows algorithm design to be reformulated as contour design, where the loss evolution is determined by the contour properties.

10 retrieved papers
Corner algorithms with acceleration factor θ

The authors introduce corner algorithms based on contours with corners having external angle θπ (1 < θ < 2). They prove these algorithms accelerate convergence from O(t−ζ) to O(t−θζ) and establish a complete phase diagram showing the maximum achievable acceleration θmax = min(2, ν, 2/(ζ+1/ν)) in the signal-dominated regime.

3 retrieved papers
Finite-memory approximation of corner algorithms

The authors show that ideal corner algorithms, which require infinite memory, can be efficiently approximated by practical finite-memory algorithms. This is achieved using fast rational approximations of power functions with error bounds of O(e−c√M), making the theoretical corner algorithms implementable in practice.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Contour view of generalized (S)GD algorithms

The authors establish that stationary SGD algorithms with arbitrary linear memory can be represented as contours in the complex plane through a response map Ψ. This geometric viewpoint allows algorithm design to be reformulated as contour design, where the loss evolution is determined by the contour properties.

Contribution

Corner algorithms with acceleration factor θ

The authors introduce corner algorithms based on contours with corners having external angle θπ (1 < θ < 2). They prove these algorithms accelerate convergence from O(t−ζ) to O(t−θζ) and establish a complete phase diagram showing the maximum achievable acceleration θmax = min(2, ν, 2/(ζ+1/ν)) in the signal-dominated regime.

Contribution

Finite-memory approximation of corner algorithms

The authors show that ideal corner algorithms, which require infinite memory, can be efficiently approximated by practical finite-memory algorithms. This is achieved using fast rational approximations of power functions with error bounds of O(e−c√M), making the theoretical corner algorithms implementable in practice.