Eliminating the first moment state in Adam optimizer

ICLR 2026 Conference SubmissionAnonymous Authors
Half-memory Adamefficient AdamMemory efficient optimizer
Abstract:

The Adam optimizer and its variants are widely used in large-scale machine learning, but their memory footprint is high because they maintain two state variables per parameter. In Adam, the exponential moving average (EMA) of gradients (m) serves as a first-moment estimator, but it also carries variance information that can be exploited to estimate the second moment. Furthermore, the gradient buffer can be repurposed to handle both gradient accumulation and a proxy for the first moment, effectively folding m into the gradient buffer itself. These modifications reduce the number of optimizer state variables from two to one, yielding Half-Memory Adam (HMAdam) and its decoupled-weight-decay variant (HMAdamW). Both variants retain the Adam update rule and hyperparameters. Experiments across discriminative and generative tasks including CNNs, transformers, and diffusion models show that HMAdamW matches the performance of standard AdamW in convergence speed, final accuracy, and runtime, while substantially lowering memory usage. Moreover, this version of Adam retains its convergence properties. This makes it a practical choice for memory-constrained training scenarios such as large-scale language modeling.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
27
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: Reducing optimizer state memory in adaptive gradient methods. The field addresses the substantial memory overhead of adaptive optimizers like Adam, which maintain per-parameter momentum and variance statistics. The taxonomy reveals several major strategies: Low-Rank and Subspace Projection Methods exploit gradient structure through techniques like GaLore[2] and SVD-Free Low-Rank[3] to compress state representations; Quantization and Compression Methods such as 4-bit States[7] and 4-bit Preconditioned[4] reduce precision while preserving convergence; Selective Parameter Update Methods including Adam-mini[11] and Selective Diffusion Optimization[6] update only critical subsets of parameters; and Optimizer State Reduction and Elimination approaches like Memory-Efficient Adaptive[5] and Came[15] fundamentally redesign state tracking. Additional branches cover federated settings, meta-learning, theoretical foundations, and specialized domains, reflecting the breadth of contexts where memory constraints matter. Recent work has intensified around balancing memory savings with training stability and convergence speed. Low-rank methods offer dramatic reductions but require careful subspace selection, while quantization approaches must manage numerical precision trade-offs. Within the Optimizer State Reduction and Elimination branch, Eliminating First Moment[0] takes a particularly aggressive stance by removing the momentum term entirely, contrasting with neighbors like Apollo[23] which retains some adaptive structure while simplifying state management. This direction raises fundamental questions about which optimizer components are truly essential: while first-moment elimination achieves maximal memory savings, it must demonstrate that convergence quality remains competitive with methods like AdaLomo[16] or No More Adam[22] that preserve partial adaptivity. The interplay between memory efficiency and optimization performance continues to drive exploration across quantization, selective updates, and state elimination strategies.

Claimed Contributions

Half-Memory Adam optimizer with single state variable

The authors propose a memory-efficient variant of the Adam optimizer that eliminates the first-moment state variable by reusing the gradient buffer and estimating the second moment from the exponentially weighted gradient average, thereby halving optimizer state memory while preserving Adam's update rule and convergence behavior.

10 retrieved papers
Can Refute
Gradient buffer reuse for first-moment estimation

The method replaces the standard zero-gradient operation with a decay operation, transforming the gradient accumulator into an exponential moving average across training steps that serves dual purposes: accumulating gradients within a step and maintaining the first-moment estimate without requiring a separate state variable.

7 retrieved papers
Second-moment estimation from first-moment variance

The authors exploit the observation that the exponential moving average of gradients inherently contains variance information due to gradient stochasticity, enabling construction of a second-moment estimator from the first-moment accumulator rather than from raw gradients, particularly effective under low signal-to-noise ratio conditions typical in deep learning.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Half-Memory Adam optimizer with single state variable

The authors propose a memory-efficient variant of the Adam optimizer that eliminates the first-moment state variable by reusing the gradient buffer and estimating the second moment from the exponentially weighted gradient average, thereby halving optimizer state memory while preserving Adam's update rule and convergence behavior.

Contribution

Gradient buffer reuse for first-moment estimation

The method replaces the standard zero-gradient operation with a decay operation, transforming the gradient accumulator into an exponential moving average across training steps that serves dual purposes: accumulating gradients within a step and maintaining the first-moment estimate without requiring a separate state variable.

Contribution

Second-moment estimation from first-moment variance

The authors exploit the observation that the exponential moving average of gradients inherently contains variance information due to gradient stochasticity, enabling construction of a second-moment estimator from the first-moment accumulator rather than from raw gradients, particularly effective under low signal-to-noise ratio conditions typical in deep learning.