Eliminating the first moment state in Adam optimizer

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Half-memory Adamefficient AdamMemory efficient optimizer

The Adam optimizer and its variants are widely used in large-scale machine learning, but their memory footprint is high because they maintain two state variables per parameter. In Adam, the exponential moving average (EMA) of gradients (m) serves as a first-moment estimator, but it also carries variance information that can be exploited to estimate the second moment. Furthermore, the gradient buffer can be repurposed to handle both gradient accumulation and a proxy for the first moment, effectively folding m into the gradient buffer itself. These modifications reduce the number of optimizer state variables from two to one, yielding Half-Memory Adam (HMAdam) and its decoupled-weight-decay variant (HMAdamW). Both variants retain the Adam update rule and hyperparameters. Experiments across discriminative and generative tasks including CNNs, transformers, and diffusion models show that HMAdamW matches the performance of standard AdamW in convergence speed, final accuracy, and runtime, while substantially lowering memory usage. Moreover, this version of Adam retains its convergence properties. This makes it a practical choice for memory-constrained training scenarios such as large-scale language modeling.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Reducing optimizer state memory in adaptive gradient methods. The field addresses the substantial memory overhead of adaptive optimizers like Adam, which maintain per-parameter momentum and variance statistics. The taxonomy reveals several major strategies: Low-Rank and Subspace Projection Methods exploit gradient structure through techniques like GaLore[2] and SVD-Free Low-Rank[3] to compress state representations; Quantization and Compression Methods such as 4-bit States[7] and 4-bit Preconditioned[4] reduce precision while preserving convergence; Selective Parameter Update Methods including Adam-mini[11] and Selective Diffusion Optimization[6] update only critical subsets of parameters; and Optimizer State Reduction and Elimination approaches like Memory-Efficient Adaptive[5] and Came[15] fundamentally redesign state tracking. Additional branches cover federated settings, meta-learning, theoretical foundations, and specialized domains, reflecting the breadth of contexts where memory constraints matter. Recent work has intensified around balancing memory savings with training stability and convergence speed. Low-rank methods offer dramatic reductions but require careful subspace selection, while quantization approaches must manage numerical precision trade-offs. Within the Optimizer State Reduction and Elimination branch, Eliminating First Moment[0] takes a particularly aggressive stance by removing the momentum term entirely, contrasting with neighbors like Apollo[23] which retains some adaptive structure while simplifying state management. This direction raises fundamental questions about which optimizer components are truly essential: while first-moment elimination achieves maximal memory savings, it must demonstrate that convergence quality remains competitive with methods like AdaLomo[16] or No More Adam[22] that preserve partial adaptivity. The interplay between memory efficiency and optimization performance continues to drive exploration across quantization, selective updates, and state elimination strategies.

Claimed Contributions

Half-Memory Adam optimizer with single state variable

Can Refute

10 retrieved papers

The authors propose a memory-efficient variant of the Adam optimizer that eliminates the first-moment state variable by reusing the gradient buffer and estimating the second moment from the exponentially weighted gradient average, thereby halving optimizer state memory while preserving Adam's update rule and convergence behavior.

10 retrieved papers

Can Refute

Gradient buffer reuse for first-moment estimation

7 retrieved papers

The method replaces the standard zero-gradient operation with a decay operation, transforming the gradient accumulator into an exponential moving average across training steps that serves dual purposes: accumulating gradients within a step and maintaining the first-moment estimate without requiring a separate state variable.

7 retrieved papers

Second-moment estimation from first-moment variance

10 retrieved papers

The authors exploit the observation that the exponential moving average of gradients inherently contains variance information due to gradient stochasticity, enabling construction of a second-moment estimator from the first-moment accumulator rather than from raw gradients, particularly effective under low signal-to-noise ratio conditions typical in deep learning.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[23] Apollo: Sgd-like memory, adamw-level performance PDF

Zhu Han-qing, Zhang Zhen-yu, Hanqing Zhu, Cong, Wenyan, Zhenyu (Allen) Zhang, Liu Xi, Wenyan Cong, Xi Liu, Chandra, Vikas, Sem Park, Long, Bo, Vikas Chandra, Pan, David Z., Bo Long, Wang, Zhangyang, David Z. Pan, Lee Jinwon, Zhangyang Wang, Jinwon Lee (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Half-Memory Adam optimizer with single state variable

[16] AdaLomo: Low-memory Optimization with Adaptive Learning Rate PDF

Can Refute

[11] Adam-mini: Use fewer learning rates to gain more PDF

Cannot Refute

[15] Came: Confidence-guided adaptive memory efficient optimization PDF

Cannot Refute

[23] Apollo: Sgd-like memory, adamw-level performance PDF

Cannot Refute

[51] Adam: A Method for Stochastic Optimization PDF

Cannot Refute

[52] Q-adam-mini: Memory-efficient 8-bit quantized optimizer for large language model training PDF

Cannot Refute

[53] BAdam: A memory efficient full parameter optimization method for large language models PDF

Cannot Refute

[54] Adamem: Memory efficient momentum for adafactor PDF

Cannot Refute

[55] Adapprox: Adaptive approximation in adam optimization via randomized low-rank matrices PDF

Cannot Refute

[56] SPAM: Spike-aware adam with momentum reset for stable LLM training PDF

Cannot Refute

Contribution

Gradient buffer reuse for first-moment estimation

[67] Structured convergence through latent epoch reshaping for reordering intermediate computations in large language model training PDF

Cannot Refute

[68] Efficient asynchronous federated learning with prospective momentum aggregation and fine-grained correction PDF

Cannot Refute

[69] Complex momentum for optimization in games PDF

Cannot Refute

[70] Fisher Scoring Method for Neural Networks Optimization PDF

Cannot Refute

[71] Reaching for Resilience: Understanding How Optimizers Affect the Stability Gap in Continual Learning PDF

Cannot Refute

[72] 3D fMRI classification with squeeze-and-excitation and multiscale dilated convolutions PDF

Cannot Refute

[73] Probabilistic Orthogonal Decay for Gradient Alignment Modulation in Large Language Model Pretraining PDF

Cannot Refute

Contribution

Second-moment estimation from first-moment variance

[57] Robust compliance topology optimization using the first-order second-moment method PDF

Cannot Refute

[58] Stochastic gradient estimation PDF

Cannot Refute

[59] Alada: Alternating Adaptation of Momentum Method for Memory-Efficient Matrix Optimization PDF

Cannot Refute

[60] First-order and second-order reliability methods PDF

Cannot Refute

[61] Moment equation methods for nonlinear stochastic systems PDF

Cannot Refute

[62] First-order second-moment approximation in reliability of structural systems: critical review and alternative approach PDF

Cannot Refute

[63] Second-moment combination of stochastic loads PDF

Cannot Refute

[64] Deâtrending of wind speed variance based on firstâorder and secondâorder statistical moments only PDF

Cannot Refute

[65] The first and second moments of some probability distributions arising from points on a lattice and their application PDF

Cannot Refute

[66] Advance first order second moment (AFOSM) method for single reservoir operation reliability analysis: a case study PDF

Cannot Refute

Eliminating the first moment state in Adam optimizer

Overview

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[23] Apollo: Sgd-like memory, adamw-level performance PDF

Contribution Analysis

Half-Memory Adam optimizer with single state variable

[16] AdaLomo: Low-memory Optimization with Adaptive Learning Rate PDF

[11] Adam-mini: Use fewer learning rates to gain more PDF

[15] Came: Confidence-guided adaptive memory efficient optimization PDF

[23] Apollo: Sgd-like memory, adamw-level performance PDF

[51] Adam: A Method for Stochastic Optimization PDF

[52] Q-adam-mini: Memory-efficient 8-bit quantized optimizer for large language model training PDF

[53] BAdam: A memory efficient full parameter optimization method for large language models PDF

[54] Adamem: Memory efficient momentum for adafactor PDF

[55] Adapprox: Adaptive approximation in adam optimization via randomized low-rank matrices PDF

[56] SPAM: Spike-aware adam with momentum reset for stable LLM training PDF

Gradient buffer reuse for first-moment estimation

[67] Structured convergence through latent epoch reshaping for reordering intermediate computations in large language model training PDF

[68] Efficient asynchronous federated learning with prospective momentum aggregation and fine-grained correction PDF

[69] Complex momentum for optimization in games PDF

[70] Fisher Scoring Method for Neural Networks Optimization PDF

[71] Reaching for Resilience: Understanding How Optimizers Affect the Stability Gap in Continual Learning PDF

[72] 3D fMRI classification with squeeze-and-excitation and multiscale dilated convolutions PDF

[73] Probabilistic Orthogonal Decay for Gradient Alignment Modulation in Large Language Model Pretraining PDF

Second-moment estimation from first-moment variance

[57] Robust compliance topology optimization using the first-order second-moment method PDF

[58] Stochastic gradient estimation PDF

[59] Alada: Alternating Adaptation of Momentum Method for Memory-Efficient Matrix Optimization PDF

[60] First-order and second-order reliability methods PDF

[61] Moment equation methods for nonlinear stochastic systems PDF

[62] First-order second-moment approximation in reliability of structural systems: critical review and alternative approach PDF

[63] Second-moment combination of stochastic loads PDF

[64] Deâtrending of wind speed variance based on firstâorder and secondâorder statistical moments only PDF

[65] The first and second moments of some probability distributions arising from points on a lattice and their application PDF

[66] Advance first order second moment (AFOSM) method for single reservoir operation reliability analysis: a case study PDF

Table of Contents

[64] Deâtrending of wind speed variance based on firstâorder and secondâorder statistical moments only PDF