Taming Momentum: Rethinking Optimizer States Through Low-Rank Approximation

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Large Language Models; Efficient Training; Low-Rank; LoRA

Modern optimizers like Adam and Muon are central to training large language models, but their reliance on first- and second-order momenta introduces significant memory overhead, which constrains scalability and computational efficiency. In this work, we re-frame the exponential moving average (EMA) used in these momenta as the training of a linear regressor via online gradient flow. Building on this equivalence, we introduce LoRA-Pre, a novel low-rank optimizer designed for efficient pre-training. Specifically, LoRA-Pre reduces the optimizer's memory footprint by decomposing the full momentum matrix into a compact low-rank subspace within the online linear learner, thereby maintaining optimization performance while improving memory efficiency. We empirically validate LoRA-Pre's efficacy by pre-training models from the Llama architecture family, scaling from 60M to 1B parameters. LoRA-Pre achieves the highest performance across all model sizes. Notably, LoRA-Pre demonstrates remarkable rank efficiency, achieving comparable or superior results using only 1/8 the rank of baseline methods. Beyond pre-training, we evaluate LoRA-Pre's effectiveness in fine-tuning scenarios. With the same rank, LoRA-Pre consistently outperforms all efficient fine-tuning baselines. Specifically, compared to standard LoRA, LoRA-Pre achieves substantial improvements of 3.14 points on Llama-3.1-8B and 6.17 points on Llama-2-7B, validating our approach's effectiveness across both pre-training and fine-tuning paradigms.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces LoRA-Pre, an optimizer that reduces memory overhead by decomposing momentum states into low-rank subspaces, framed through an online linear regressor perspective. Within the taxonomy, it occupies the 'Online Linear Regressor Framework' leaf under 'Direct Momentum State Compression'. Notably, this leaf contains only the original paper itself—no sibling papers are present—indicating a relatively sparse and novel research direction. The broader 'Direct Momentum State Compression' branch contains five papers across three leaves, suggesting moderate exploration of momentum factorization strategies but limited work on the specific online learning formulation proposed here.

The taxonomy reveals that neighboring approaches pursue momentum compression through different mechanisms. The sibling leaves 'Rank-One Factorization Approaches' and 'Flexible-Rank Factorization Methods' contain papers employing direct matrix decomposition without the online regressor framing. Adjacent branches explore 'Gradient-Based Low-Rank Optimization' (projecting gradients into subspaces) and 'Parameter-Efficient Fine-Tuning with Memory Optimization' (combining LoRA adaptation with activation memory reduction). The paper's positioning suggests it bridges momentum compression with online learning theory, diverging from purely empirical factorization methods and gradient-projection techniques that dominate neighboring leaves.

Among 23 candidates examined, the core theoretical contribution—equivalence between EMA momentum and online linear regression—shows no clear refutation across 3 candidates reviewed. However, the LoRA-Pre optimizer itself faces substantial prior work: 8 of 10 candidates examined appear to provide overlapping momentum factorization techniques, indicating this contribution operates in a more crowded space. The experimental validation contribution remains unrefuted across 10 candidates, though this likely reflects the specificity of the Llama architecture experiments rather than fundamental novelty. The limited search scope (23 papers, not exhaustive) means these findings characterize the immediate semantic neighborhood rather than the entire field.

Given the top-23 semantic matches analyzed, the online regressor framing appears conceptually distinct within momentum compression literature, while the low-rank factorization mechanism itself aligns with established techniques. The taxonomy structure—particularly the isolated leaf position—suggests the theoretical angle may be underexplored, though the practical optimizer design overlaps with prior factorization work. A broader literature search beyond semantic similarity might reveal additional connections, especially in online learning or regressor-based optimization domains not captured by the current taxonomy scope.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: memory-efficient optimization through low-rank momentum compression. The field addresses the challenge of reducing memory overhead in modern optimizers—particularly adaptive methods like Adam—by exploiting low-rank structure in optimizer states. The taxonomy reveals several complementary strategies: Direct Momentum State Compression methods explicitly factorize or compress momentum buffers (e.g., Low-rank Momentum Factorization[1], SMMF[5]); Gradient-Based Low-Rank Optimization techniques apply rank constraints directly to gradient updates or parameter trajectories (e.g., Adarankgrad[13], Feature-based Low-rank[10]); Parameter-Efficient Fine-Tuning with Memory Optimization combines low-rank adaptation with optimizer state reduction (e.g., LoRA-FA[3], Alada[4]); Second-Order and Kronecker-Factored Methods leverage structured approximations of curvature information (e.g., MKOR[14]); and Hybrid Compression and Model-Specific Optimization blends multiple compression strategies or tailors them to specific architectures (e.g., MLorc[8], SVD-Free Adaptive[9]). Together, these branches span a spectrum from purely memory-focused compression to methods that also aim for computational efficiency or improved convergence. Recent work has explored diverse trade-offs between compression fidelity, computational overhead, and convergence guarantees. Some approaches perform explicit low-rank decompositions at each step, while others maintain implicit rank constraints or use randomized sketching to avoid expensive SVD operations. Taming Momentum[0] sits within the Direct Momentum State Compression branch, specifically under an Online Linear Regressor Framework, emphasizing a principled view of momentum as a predictive model that can be compressed without sacrificing theoretical guarantees. This contrasts with methods like Low-rank Momentum Factorization[1] or Factorized Hamiltonian Descent[2], which may focus more on empirical tuning of rank schedules or integration with second-order information. By framing compression through an online learning lens, Taming Momentum[0] offers a distinct angle on balancing memory savings with optimizer stability, complementing the broader landscape of low-rank optimization techniques.

Claimed Contributions

Equivalence between EMA momentum and online linear regression

3 retrieved papers

The authors prove that the exponential moving average update used in optimizer momentum can be reformulated as gradient descent on an online linear regression objective. This equivalence reveals that momentum accumulation is mathematically identical to fitting a linear model to approximate gradient history.

3 retrieved papers

LoRA-Pre optimizer with low-rank momentum factorization

Can Refute

10 retrieved papers

The authors introduce LoRA-Pre, which decomposes the full momentum matrix into a product of two low-rank matrices, reducing memory complexity from p×q to (p+q)×r. They provide closed-form update rules (Theorem 3.1) and construct variants for both Adam and Muon optimizers.

10 retrieved papers

Can Refute

Experimental validation across pre-training and fine-tuning tasks

10 retrieved papers

The authors validate LoRA-Pre on Llama models ranging from 60M to 1B parameters for pre-training and on Llama-2-7B and Llama-3.1-8B for fine-tuning. They show LoRA-Pre achieves comparable or better results using only 1/8 the rank of baseline methods.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Equivalence between EMA momentum and online linear regression

[18] Comparing BFGS and OGR for Second-Order Optimization PDF

Cannot Refute

[19] Fisher Flow: An Information-Geometric Framework for Sequential Estimation PDF

Cannot Refute

[20] A sequential predictor retraining algorithm and its application to market prediction PDF

Cannot Refute

Contribution

LoRA-Pre optimizer with low-rank momentum factorization

[1] Low-rank momentum factorization for memory efficient training PDF

Can Refute

[2] Memory-efficient optimization with factorized hamiltonian descent PDF

Cannot Refute

[29] Roselora: Row and column-wise sparse low-rank adaptation of pre-trained language model for knowledge editing and fine-tuning PDF

Cannot Refute

Taming Momentum: Rethinking Optimizer States Through Low-Rank Approximation

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Equivalence between EMA momentum and online linear regression

[18] Comparing BFGS and OGR for Second-Order Optimization PDF

[19] Fisher Flow: An Information-Geometric Framework for Sequential Estimation PDF

[20] A sequential predictor retraining algorithm and its application to market prediction PDF

LoRA-Pre optimizer with low-rank momentum factorization

[1] Low-rank momentum factorization for memory efficient training PDF

[2] Memory-efficient optimization with factorized hamiltonian descent PDF

[4] Alada: Alternating Adaptation of Momentum Method for Memory-Efficient Matrix Optimization PDF

[5] SMMF: Square-Matricized Momentum Factorization for Memory-Efficient Optimization PDF

[7] Adamem: Memory efficient momentum for adafactor PDF

[8] MLorc: Momentum Low-rank Compression for Large Language Model Adaptation PDF

[9] SVD-Free Low-Rank Adaptive Gradient Optimization for Large Language Models PDF

[16] H-Fac: Memory-Efficient Optimization with Factorized Hamiltonian Descent PDF

[14] MKOR: Momentum-enabled kronecker-factor-based optimizer using rank-1 updates PDF

[17] Dion: Distributed orthonormalized updates PDF

Experimental validation across pre-training and fine-tuning tasks

[3] LoRA-FA: Memory-efficient Low-rank Adaptation for Large Language Models Fine-tuning PDF

[21] Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning PDF

[22] Fine-Tuning Language Models with Just Forward Passes PDF

[23] LoRA: Low-Rank Adaptation of Large Language Models PDF

[24] Flexora: Flexible low-rank adaptation for large language models PDF

[25] HiRA: Parameter-efficient hadamard high-rank adaptation for large language models PDF

[26] Qdylora: Quantized dynamic low-rank adaptation for efficient large language model tuning PDF

[27] Towards Federated Low-Rank Adaptation of Language Models with Rank Heterogeneity PDF

[28] Lolcats: On low-rank linearizing of large language models PDF

[29] Roselora: Row and column-wise sparse low-rank adaptation of pre-trained language model for knowledge editing and fine-tuning PDF

Table of Contents