Sign-SGD via Parameter-Free Optimization

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Parameter-free optimizationSign descentConvex optimizationStochastic optimization

Large language models have achieved major advances across domains, yet training them remains extremely resource-intensive. We revisit Sign-SGD, which serves both as a memory-efficient optimizer for single-node training and as a gradient compression mechanism for distributed learning. This paper addresses a central limitation: the effective stepsize cannot be determined a priori because it relies on unknown, problem-specific quantities. We present a parameter-free Sign-SGD that removes manual stepsize selection. We analyze the deterministic single-node case, and extend the method to stochastic single-node training and multi-node settings. We also incorporate the momentum technique into our algorithms and propose a memory-efficient variant that stores only gradient signs instead of full gradients. We evaluate our methods on pre-training LLaMA models (130M and 350M) and fine-tuning a Swin Transformer (28M). Across considered tasks, the proposed methods match the performance of tuned Sign-SGD and AdamW (grid-searched stepsizes with a cosine schedule), while avoiding tuning overhead. Employing parameter-free training yields approximately $1.5\times$ end-to-end speedup compared to runs with grid-searched stepsizes.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a parameter-free variant of Sign-SGD that automatically determines stepsizes without manual tuning, addressing a core limitation in sign-based optimization. Within the taxonomy, it resides in the 'Sign-Based Parameter-Free Optimization' leaf alongside one sibling paper. This leaf contains only two papers total, indicating a relatively sparse and focused research direction. The work targets the intersection of memory-efficient optimization and automatic stepsize adaptation, a niche area within the broader landscape of parameter-free methods.

The taxonomy reveals that parameter-free stepsize adaptation is organized into two main directions: sign-based methods and variance-reduced approaches. The paper's leaf sits under 'Parameter-Free Stepsize Adaptation Methods', which excludes general tuned optimization and focuses on automatic determination mechanisms. Neighboring branches include 'Theoretical Foundations' covering deep learning optimization theory and 'Specialized Applications' addressing adversarial attacks and quantization. The scope notes clarify that this work belongs specifically to sign-based parameter-free optimization rather than broader adaptive methods or application-specific techniques.

Among thirty candidates examined, the first contribution (parameter-free Sign-SGD with automatic stepsize) shows no clear refutation across ten candidates, suggesting relative novelty in this specific formulation. The second contribution (stochastic and distributed extensions with theory) encountered three refutable candidates among ten examined, indicating moderate prior work overlap. The third contribution (memory-efficient variant and momentum extension) found two refutable candidates among ten, suggesting some existing techniques address similar memory and momentum concerns. The limited search scope means these findings reflect top-ranked semantic matches rather than exhaustive coverage.

Based on the analysis of thirty candidates, the core parameter-free mechanism appears relatively novel within sign-based optimization, while extensions to distributed settings and memory-efficient variants show more substantial connections to prior work. The sparse taxonomy leaf and limited sibling papers suggest this specific combination of sign-based updates and parameter-free adaptation remains an emerging area, though the search scope does not capture the full breadth of related adaptive optimization literature.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: parameter-free optimization for sign-based stochastic gradient descent. The field centers on developing adaptive stepsize methods that eliminate manual tuning of learning rates in sign-based gradient algorithms. The taxonomy reveals three main branches: Parameter-Free Stepsize Adaptation Methods, which focus on automatic learning rate schedules and adaptive mechanisms; Theoretical Foundations and Optimization Perspectives, which provide convergence guarantees and mathematical insights into sign-based dynamics; and Sign-Based Methods in Specialized Applications, which explore domain-specific uses such as communication-efficient distributed learning and deep hashing. Representative works like Sign-SGD Parameter-Free[0] and Sign-SGD Golden Gate[5] illustrate how parameter-free techniques can be tailored to sign-based updates, while studies such as Adaptive Variance Reduction[1] demonstrate broader adaptive strategies that may inform sign-based designs. A particularly active line of work explores the interplay between sign-based compression and parameter-free adaptation, addressing the challenge of maintaining convergence guarantees when gradient information is reduced to sign bits. Sign Bits Blackbox[3] exemplifies early efforts to understand optimization with minimal gradient information, while more recent approaches like Sign-SGD Golden Gate[5] integrate adaptive stepsize rules directly into sign-based frameworks. Sign-SGD Parameter-Free[0] sits squarely within this cluster, emphasizing automatic tuning mechanisms that avoid hyperparameter search in sign-based settings. Compared to Sign-SGD Golden Gate[5], which may focus on specific architectural or algorithmic innovations, Sign-SGD Parameter-Free[0] appears to prioritize general-purpose parameter-free strategies. The main open questions revolve around balancing the simplicity of sign-based updates with the need for robust, adaptive stepsizes that perform well across diverse problem settings without manual intervention.

Claimed Contributions

Parameter-free SIGN-SGD algorithm with automatic stepsize selection

10 retrieved papers

The authors introduce ALIAS (Automatic Local per-Iteration Approximation of the Stepsize), a parameter-free variant of SIGN-SGD that automatically adapts the stepsize at each iteration by estimating problem-specific quantities (initial distance to solution and smoothness constant) without requiring prior knowledge or manual tuning.

10 retrieved papers

Extension to stochastic and distributed settings with theoretical analysis

Can Refute

10 retrieved papers

The authors extend their parameter-free SIGN-SGD method from the deterministic exact-gradient setting to both stochastic gradient oracles and distributed multi-node training scenarios, providing comprehensive theoretical convergence guarantees for each setting.

10 retrieved papers

Can Refute

Memory-efficient variant and momentum-based extension

Can Refute

10 retrieved papers

The authors develop two practical extensions: a memory-efficient version that stores only gradient signs from the previous iteration rather than full gradients, and a momentum-based variant (ALIAS Adam version) that incorporates exponential moving averages similar to ADAM for improved practical performance.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[5] Sign-SGD is the Golden Gate between Multi-Node to Single-Node Learning: Significant Boost via Parameter-Free Optimization PDF

Petrov Egor, Beznosikov, Aleksandr (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Parameter-free SIGN-SGD algorithm with automatic stepsize selection

[6] On linear convergence of adaptive sign-based gradient descent PDF

Cannot Refute

[16] Minimally distorted adversarial images with a step-adaptive iterative fast gradient sign method PDF

Cannot Refute

[17] Efficient sign-based optimization: Accelerating convergence via variance reduction PDF

Cannot Refute

[18] A Qualitative Study of the Dynamic Behavior of Adaptive Gradient Algorithms PDF

Cannot Refute

[19] Accuracy improvement in Ag: a-Si memristive synaptic device-based neural network through Adadelta learning method on handwritten-digit recognition PDF

Cannot Refute

[20] Dissecting adam: The sign, magnitude and variance of stochastic gradients PDF

Cannot Refute

[21] -SignFedAvg: A unified sign-based stochastic compression for federated learning PDF

Cannot Refute

[22] Synthesising Audio Adversarial Examples for Automatic Speech Recognition PDF

Cannot Refute

[23] signSGD via zeroth-order oracle PDF

Cannot Refute

[24] An Adaptive Learning Rate Schedule for SIGNSGD Optimizer in Neural Networks PDF

Cannot Refute

Contribution

Extension to stochastic and distributed settings with theoretical analysis

[7] Sign operator for coping with heavy-tailed noise: High probability convergence bounds with extensions to distributed optimization and comparison oracle PDF

Can Refute

[10] signSGD: compressed optimisation for non-convex problems PDF

Can Refute

[13] On faster convergence of scaled sign gradient descent PDF

Can Refute

[6] On linear convergence of adaptive sign-based gradient descent PDF

Cannot Refute

[8] Sign-Entropy Regularization for Personalized Federated Learning PDF

Cannot Refute

[9] Momentum ensures convergence of signsgd under weaker assumptions PDF

Cannot Refute

[11] z-signfedavg: A unified stochastic sign-based compression for federated learning PDF

Cannot Refute

[12] On the Byzantine Fault Tolerance of signSGD with Majority Vote PDF

Cannot Refute

[14] Communication efficient distributed training with distributed lion PDF

Cannot Refute

[15] Adaptive Time Synchronization in Time Sensitive-Wireless Sensor Networks Based on Stochastic Gradient Algorithms Framework PDF

Cannot Refute

Contribution

Memory-efficient variant and momentum-based extension

[10] signSGD: compressed optimisation for non-convex problems PDF

Can Refute

[26] FRUGAL: Memory-efficient optimization by reducing state overhead for scalable training PDF

Can Refute

[9] Momentum ensures convergence of signsgd under weaker assumptions PDF

Cannot Refute

[14] Communication efficient distributed training with distributed lion PDF

Cannot Refute

[17] Efficient sign-based optimization: Accelerating convergence via variance reduction PDF

Cannot Refute

[25] 8-bit optimizers via block-wise quantization PDF

Cannot Refute

[27] AdaComp : Adaptive Residual Gradient Compression for Data-Parallel Distributed Training PDF

Cannot Refute

[28] Communication-Efficient Distributed Blockwise Momentum SGD with Error-Feedback PDF

Cannot Refute

[29] Stepping forward on the last mile PDF

Cannot Refute

[30] Don't waste your bits! squeeze activations and gradients for deep neural networks via tinyscript PDF

Cannot Refute

Sign-SGD via Parameter-Free Optimization

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[5] Sign-SGD is the Golden Gate between Multi-Node to Single-Node Learning: Significant Boost via Parameter-Free Optimization PDF

Contribution Analysis

Parameter-free SIGN-SGD algorithm with automatic stepsize selection

[6] On linear convergence of adaptive sign-based gradient descent PDF

[16] Minimally distorted adversarial images with a step-adaptive iterative fast gradient sign method PDF

[17] Efficient sign-based optimization: Accelerating convergence via variance reduction PDF

[18] A Qualitative Study of the Dynamic Behavior of Adaptive Gradient Algorithms PDF

[19] Accuracy improvement in Ag: a-Si memristive synaptic device-based neural network through Adadelta learning method on handwritten-digit recognition PDF

[20] Dissecting adam: The sign, magnitude and variance of stochastic gradients PDF

[21] -SignFedAvg: A unified sign-based stochastic compression for federated learning PDF

[22] Synthesising Audio Adversarial Examples for Automatic Speech Recognition PDF

[23] signSGD via zeroth-order oracle PDF

[24] An Adaptive Learning Rate Schedule for SIGNSGD Optimizer in Neural Networks PDF

Extension to stochastic and distributed settings with theoretical analysis

[7] Sign operator for coping with heavy-tailed noise: High probability convergence bounds with extensions to distributed optimization and comparison oracle PDF

[10] signSGD: compressed optimisation for non-convex problems PDF

[13] On faster convergence of scaled sign gradient descent PDF

[6] On linear convergence of adaptive sign-based gradient descent PDF

[8] Sign-Entropy Regularization for Personalized Federated Learning PDF

[9] Momentum ensures convergence of signsgd under weaker assumptions PDF

[11] z-signfedavg: A unified stochastic sign-based compression for federated learning PDF

[12] On the Byzantine Fault Tolerance of signSGD with Majority Vote PDF

[14] Communication efficient distributed training with distributed lion PDF

[15] Adaptive Time Synchronization in Time Sensitive-Wireless Sensor Networks Based on Stochastic Gradient Algorithms Framework PDF

Memory-efficient variant and momentum-based extension

[10] signSGD: compressed optimisation for non-convex problems PDF

[26] FRUGAL: Memory-efficient optimization by reducing state overhead for scalable training PDF

[9] Momentum ensures convergence of signsgd under weaker assumptions PDF

[14] Communication efficient distributed training with distributed lion PDF

[17] Efficient sign-based optimization: Accelerating convergence via variance reduction PDF

[25] 8-bit optimizers via block-wise quantization PDF

[27] AdaComp : Adaptive Residual Gradient Compression for Data-Parallel Distributed Training PDF

[28] Communication-Efficient Distributed Blockwise Momentum SGD with Error-Feedback PDF

[29] Stepping forward on the last mile PDF

[30] Don't waste your bits! squeeze activations and gradients for deep neural networks via tinyscript PDF

Table of Contents