Achieving low-bit Muon through subspace preservation and grid quantization

ICLR 2026 Conference SubmissionAnonymous Authors
LLMmemory-efficientquantizationlow-bitMuon optimizer
Abstract:

Training Large Language Models (LLMs) faces severe memory constraints due to the increasing size of model parameters and optimizer states. The Muon optimizer, which is based on matrix orthogonalization, has recently demonstrated significant potential and offers considerable memory advantages over AdamW by utilizing only the first moment. However, how to apply memory-reduction techniques to further compress the optimizer states of Muon remains underexplored. Directly applying existing methods may encounter significant difficulties due to the orthogonalization process. In this work, we investigate the low-bit compression of Muon and systematically analyze the quantization error exacerbated by orthogonalization. We identify that the error primarily originates from the top singular subspace and the outlier patterns of moment matrix appearing across both dimensions. To address this, we propose 4-bit-Muon-GRASP (GRid And Subspace Preserving), which compresses the Muon optimizer states to 4 bits using grid quantization, while preserving the top singular subspace with minimal overhead. We evaluate 4-bit-Muon-GRASP through pre-training on LLaMA-130M, 350M, and 1.1B architectures and fine-tuning on 7B models for various reasoning tasks. Extensive experiment results show that our 4-bit-Muon-GRASP achieves accuracy comparable to full-precision counterparts while reducing training memory consumption by up to 28%. Code will be made public upon acceptance.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes 4-bit-Muon-GRASP, a compression method tailored to the Muon optimizer, which replaces AdamW's dual-moment states with orthogonalized first-moment updates. It resides in the 'Quantization for Second-Order and Novel Optimizers' leaf alongside two sibling papers: one on 4-bit Shampoo compression and another on quantized Muon states. This leaf contains only three papers within the broader 'Optimizer State Quantization Methods' branch, indicating a relatively sparse but emerging research direction focused on memory reduction for non-standard optimizers.

The taxonomy tree shows that the broader field divides into six main branches: direct optimizer quantization, low-rank subspace methods, zeroth-order optimization, quantized fine-tuning, ultra-low-bit model quantization, and theoretical analysis. The paper's leaf sits under direct quantization but targets a novel optimizer architecture, distinguishing it from the crowded 'Block-Wise Quantization for First-Order Optimizers' leaf (three papers on Adam/AdamW compression). Neighboring leaves include 'Gradient Low-Rank Projection' and 'State Folding,' which avoid quantization entirely by exploiting gradient structure or approximation, whereas this work directly compresses Muon's states while preserving critical subspace information.

Among 27 candidates examined across three contributions, no refutable prior work was identified. The systematic quantization error analysis examined 10 candidates with no overlaps; the 4-bit-Muon-GRASP method examined 7 candidates with no refutations; and the empirical validation examined 10 candidates, also without conflicts. Given the limited search scope—27 papers from semantic and citation-based retrieval—the absence of refutations suggests the specific combination of Muon quantization, grid-based compression, and top singular subspace preservation has not been directly addressed in the examined literature, though the search does not cover the entire field exhaustively.

Based on the top-27 semantic matches and taxonomy structure, the work appears to occupy a sparsely populated niche within optimizer state compression, specifically targeting the Muon optimizer's unique orthogonalization properties. The analysis does not extend to broader quantization literature outside the LLM training domain or to methods published after the search cutoff, so the novelty assessment remains contingent on this bounded scope.

Taxonomy

Core-task Taxonomy Papers
38
3
Claimed Contributions
27
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: low-bit compression of optimizer states for large language model training. The field addresses the memory bottleneck imposed by optimizer states during LLM training by exploring diverse compression strategies. The taxonomy reveals several main branches: direct quantization of optimizer states (e.g., 8-bit Optimizers[7], 4-bit State Optimizers[15]) reduces memory by storing momentum and variance in lower precision; low-rank and subspace methods (e.g., Galore[9], Q-GaLore[16]) exploit gradient structure to compress updates; zeroth-order optimization (e.g., QuZO[1], ZO2[20]) sidesteps gradient computation entirely; quantized fine-tuning techniques (e.g., QFT[23], Sub-4-bit Fine-tuning[6]) combine model and optimizer compression for parameter-efficient adaptation; ultra-low-bit model quantization (e.g., BitNet Scaling[3], Bitnet One-bit Pretraining[2]) pushes weights and activations to extreme precision; theoretical and empirical analyses (e.g., Floating-point Optimizer Convergence[35], FP8 Training Stability[30]) provide convergence guarantees and stability insights; and surveys (e.g., Quantization Survey[14], On-Device LLM Survey[22]) offer broad overviews of quantization landscapes. Recent work has intensified around second-order and novel optimizers, where methods like 4-bit Shampoo[17] and Quantized Muon States[28] compress preconditioner matrices that are traditionally memory-intensive. Low-bit Muon[0] sits within this cluster, focusing on quantizing the Muon optimizer's states to enable efficient large-scale training. Compared to 4-bit Shampoo[17], which targets Shampoo's second-order statistics, Low-bit Muon[0] addresses a different optimizer architecture, while Quantized Muon States[28] explores similar themes but may differ in quantization granularity or training stability. A key tension across branches is the trade-off between compression aggressiveness and convergence quality: ultra-low-bit approaches (e.g., 1-bit Optimization Rethinking[27]) promise maximal memory savings but risk training instability, whereas moderate quantization (e.g., 8-bit Optimizers[7]) offers safer convergence. Low-bit Muon[0] contributes to understanding how novel optimizers can be compressed without sacrificing the benefits of advanced update rules, bridging optimizer innovation and memory efficiency.

Claimed Contributions

Systematic analysis of quantization error in Muon optimizer

The authors conduct a systematic analysis revealing that Newton-Schulz orthogonalization amplifies quantization error primarily in the top singular subspace of the moment matrix, and identify outlier patterns appearing across both dimensions. This analysis motivates dividing the moment matrix into top and residual singular subspaces for separate compression.

10 retrieved papers
4-bit-Muon-GRASP optimizer with subspace preservation and grid quantization

The authors propose a novel 4-bit compression method for Muon that uses 8-bit compression for the top singular subspace (obtained via power iteration) and 4-bit grid quantization for the residual subspace. Grid quantization normalizes both row and column directions to handle outliers appearing across both dimensions.

7 retrieved papers
Empirical validation on LLM pre-training and fine-tuning tasks

The authors evaluate their method through extensive experiments on LLaMA models of various sizes (130M to 7B parameters) for both pre-training and fine-tuning tasks, demonstrating that 4-bit-Muon-GRASP matches full-precision performance while reducing total training memory by up to 28 percent.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Systematic analysis of quantization error in Muon optimizer

The authors conduct a systematic analysis revealing that Newton-Schulz orthogonalization amplifies quantization error primarily in the top singular subspace of the moment matrix, and identify outlier patterns appearing across both dimensions. This analysis motivates dividing the moment matrix into top and residual singular subspaces for separate compression.

Contribution

4-bit-Muon-GRASP optimizer with subspace preservation and grid quantization

The authors propose a novel 4-bit compression method for Muon that uses 8-bit compression for the top singular subspace (obtained via power iteration) and 4-bit grid quantization for the residual subspace. Grid quantization normalizes both row and column directions to handle outliers appearing across both dimensions.

Contribution

Empirical validation on LLM pre-training and fine-tuning tasks

The authors evaluate their method through extensive experiments on LLaMA models of various sizes (130M to 7B parameters) for both pre-training and fine-tuning tasks, demonstrating that 4-bit-Muon-GRASP matches full-precision performance while reducing total training memory by up to 28 percent.

Achieving low-bit Muon through subspace preservation and grid quantization | Novelty Validation