Sign-SGD via Parameter-Free Optimization

ICLR 2026 Conference SubmissionAnonymous Authors
Parameter-free optimizationSign descentConvex optimizationStochastic optimization
Abstract:

Large language models have achieved major advances across domains, yet training them remains extremely resource-intensive. We revisit Sign-SGD, which serves both as a memory-efficient optimizer for single-node training and as a gradient compression mechanism for distributed learning. This paper addresses a central limitation: the effective stepsize cannot be determined a priori because it relies on unknown, problem-specific quantities. We present a parameter-free Sign-SGD that removes manual stepsize selection. We analyze the deterministic single-node case, and extend the method to stochastic single-node training and multi-node settings. We also incorporate the momentum technique into our algorithms and propose a memory-efficient variant that stores only gradient signs instead of full gradients. We evaluate our methods on pre-training LLaMA models (130M and 350M) and fine-tuning a Swin Transformer (28M). Across considered tasks, the proposed methods match the performance of tuned Sign-SGD and AdamW (grid-searched stepsizes with a cosine schedule), while avoiding tuning overhead. Employing parameter-free training yields approximately 1.5×1.5\times end-to-end speedup compared to runs with grid-searched stepsizes.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a parameter-free variant of Sign-SGD that automatically determines stepsizes without manual tuning, addressing a core limitation in sign-based optimization. Within the taxonomy, it resides in the 'Sign-Based Parameter-Free Optimization' leaf alongside one sibling paper. This leaf contains only two papers total, indicating a relatively sparse and focused research direction. The work targets the intersection of memory-efficient optimization and automatic stepsize adaptation, a niche area within the broader landscape of parameter-free methods.

The taxonomy reveals that parameter-free stepsize adaptation is organized into two main directions: sign-based methods and variance-reduced approaches. The paper's leaf sits under 'Parameter-Free Stepsize Adaptation Methods', which excludes general tuned optimization and focuses on automatic determination mechanisms. Neighboring branches include 'Theoretical Foundations' covering deep learning optimization theory and 'Specialized Applications' addressing adversarial attacks and quantization. The scope notes clarify that this work belongs specifically to sign-based parameter-free optimization rather than broader adaptive methods or application-specific techniques.

Among thirty candidates examined, the first contribution (parameter-free Sign-SGD with automatic stepsize) shows no clear refutation across ten candidates, suggesting relative novelty in this specific formulation. The second contribution (stochastic and distributed extensions with theory) encountered three refutable candidates among ten examined, indicating moderate prior work overlap. The third contribution (memory-efficient variant and momentum extension) found two refutable candidates among ten, suggesting some existing techniques address similar memory and momentum concerns. The limited search scope means these findings reflect top-ranked semantic matches rather than exhaustive coverage.

Based on the analysis of thirty candidates, the core parameter-free mechanism appears relatively novel within sign-based optimization, while extensions to distributed settings and memory-efficient variants show more substantial connections to prior work. The sparse taxonomy leaf and limited sibling papers suggest this specific combination of sign-based updates and parameter-free adaptation remains an emerging area, though the search scope does not capture the full breadth of related adaptive optimization literature.

Taxonomy

Core-task Taxonomy Papers
5
3
Claimed Contributions
30
Contribution Candidate Papers Compared
5
Refutable Paper

Research Landscape Overview

Core task: parameter-free optimization for sign-based stochastic gradient descent. The field centers on developing adaptive stepsize methods that eliminate manual tuning of learning rates in sign-based gradient algorithms. The taxonomy reveals three main branches: Parameter-Free Stepsize Adaptation Methods, which focus on automatic learning rate schedules and adaptive mechanisms; Theoretical Foundations and Optimization Perspectives, which provide convergence guarantees and mathematical insights into sign-based dynamics; and Sign-Based Methods in Specialized Applications, which explore domain-specific uses such as communication-efficient distributed learning and deep hashing. Representative works like Sign-SGD Parameter-Free[0] and Sign-SGD Golden Gate[5] illustrate how parameter-free techniques can be tailored to sign-based updates, while studies such as Adaptive Variance Reduction[1] demonstrate broader adaptive strategies that may inform sign-based designs. A particularly active line of work explores the interplay between sign-based compression and parameter-free adaptation, addressing the challenge of maintaining convergence guarantees when gradient information is reduced to sign bits. Sign Bits Blackbox[3] exemplifies early efforts to understand optimization with minimal gradient information, while more recent approaches like Sign-SGD Golden Gate[5] integrate adaptive stepsize rules directly into sign-based frameworks. Sign-SGD Parameter-Free[0] sits squarely within this cluster, emphasizing automatic tuning mechanisms that avoid hyperparameter search in sign-based settings. Compared to Sign-SGD Golden Gate[5], which may focus on specific architectural or algorithmic innovations, Sign-SGD Parameter-Free[0] appears to prioritize general-purpose parameter-free strategies. The main open questions revolve around balancing the simplicity of sign-based updates with the need for robust, adaptive stepsizes that perform well across diverse problem settings without manual intervention.

Claimed Contributions

Parameter-free SIGN-SGD algorithm with automatic stepsize selection

The authors introduce ALIAS (Automatic Local per-Iteration Approximation of the Stepsize), a parameter-free variant of SIGN-SGD that automatically adapts the stepsize at each iteration by estimating problem-specific quantities (initial distance to solution and smoothness constant) without requiring prior knowledge or manual tuning.

10 retrieved papers
Extension to stochastic and distributed settings with theoretical analysis

The authors extend their parameter-free SIGN-SGD method from the deterministic exact-gradient setting to both stochastic gradient oracles and distributed multi-node training scenarios, providing comprehensive theoretical convergence guarantees for each setting.

10 retrieved papers
Can Refute
Memory-efficient variant and momentum-based extension

The authors develop two practical extensions: a memory-efficient version that stores only gradient signs from the previous iteration rather than full gradients, and a momentum-based variant (ALIAS Adam version) that incorporates exponential moving averages similar to ADAM for improved practical performance.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Parameter-free SIGN-SGD algorithm with automatic stepsize selection

The authors introduce ALIAS (Automatic Local per-Iteration Approximation of the Stepsize), a parameter-free variant of SIGN-SGD that automatically adapts the stepsize at each iteration by estimating problem-specific quantities (initial distance to solution and smoothness constant) without requiring prior knowledge or manual tuning.

Contribution

Extension to stochastic and distributed settings with theoretical analysis

The authors extend their parameter-free SIGN-SGD method from the deterministic exact-gradient setting to both stochastic gradient oracles and distributed multi-node training scenarios, providing comprehensive theoretical convergence guarantees for each setting.

Contribution

Memory-efficient variant and momentum-based extension

The authors develop two practical extensions: a memory-efficient version that stores only gradient signs from the previous iteration rather than full gradients, and a momentum-based variant (ALIAS Adam version) that incorporates exponential moving averages similar to ADAM for improved practical performance.