Achieving low-bit Muon through subspace preservation and grid quantization

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

LLMmemory-efficientquantizationlow-bitMuon optimizer

Training Large Language Models (LLMs) faces severe memory constraints due to the increasing size of model parameters and optimizer states. The Muon optimizer, which is based on matrix orthogonalization, has recently demonstrated significant potential and offers considerable memory advantages over AdamW by utilizing only the first moment. However, how to apply memory-reduction techniques to further compress the optimizer states of Muon remains underexplored. Directly applying existing methods may encounter significant difficulties due to the orthogonalization process. In this work, we investigate the low-bit compression of Muon and systematically analyze the quantization error exacerbated by orthogonalization. We identify that the error primarily originates from the top singular subspace and the outlier patterns of moment matrix appearing across both dimensions. To address this, we propose 4-bit-Muon-GRASP (GRid And Subspace Preserving), which compresses the Muon optimizer states to 4 bits using grid quantization, while preserving the top singular subspace with minimal overhead. We evaluate 4-bit-Muon-GRASP through pre-training on LLaMA-130M, 350M, and 1.1B architectures and fine-tuning on 7B models for various reasoning tasks. Extensive experiment results show that our 4-bit-Muon-GRASP achieves accuracy comparable to full-precision counterparts while reducing training memory consumption by up to 28%. Code will be made public upon acceptance.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes 4-bit-Muon-GRASP, a compression method tailored to the Muon optimizer, which replaces AdamW's dual-moment states with orthogonalized first-moment updates. It resides in the 'Quantization for Second-Order and Novel Optimizers' leaf alongside two sibling papers: one on 4-bit Shampoo compression and another on quantized Muon states. This leaf contains only three papers within the broader 'Optimizer State Quantization Methods' branch, indicating a relatively sparse but emerging research direction focused on memory reduction for non-standard optimizers.

The taxonomy tree shows that the broader field divides into six main branches: direct optimizer quantization, low-rank subspace methods, zeroth-order optimization, quantized fine-tuning, ultra-low-bit model quantization, and theoretical analysis. The paper's leaf sits under direct quantization but targets a novel optimizer architecture, distinguishing it from the crowded 'Block-Wise Quantization for First-Order Optimizers' leaf (three papers on Adam/AdamW compression). Neighboring leaves include 'Gradient Low-Rank Projection' and 'State Folding,' which avoid quantization entirely by exploiting gradient structure or approximation, whereas this work directly compresses Muon's states while preserving critical subspace information.

Among 27 candidates examined across three contributions, no refutable prior work was identified. The systematic quantization error analysis examined 10 candidates with no overlaps; the 4-bit-Muon-GRASP method examined 7 candidates with no refutations; and the empirical validation examined 10 candidates, also without conflicts. Given the limited search scope—27 papers from semantic and citation-based retrieval—the absence of refutations suggests the specific combination of Muon quantization, grid-based compression, and top singular subspace preservation has not been directly addressed in the examined literature, though the search does not cover the entire field exhaustively.

Based on the top-27 semantic matches and taxonomy structure, the work appears to occupy a sparsely populated niche within optimizer state compression, specifically targeting the Muon optimizer's unique orthogonalization properties. The analysis does not extend to broader quantization literature outside the LLM training domain or to methods published after the search cutoff, so the novelty assessment remains contingent on this bounded scope.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: low-bit compression of optimizer states for large language model training. The field addresses the memory bottleneck imposed by optimizer states during LLM training by exploring diverse compression strategies. The taxonomy reveals several main branches: direct quantization of optimizer states (e.g., 8-bit Optimizers[7], 4-bit State Optimizers[15]) reduces memory by storing momentum and variance in lower precision; low-rank and subspace methods (e.g., Galore[9], Q-GaLore[16]) exploit gradient structure to compress updates; zeroth-order optimization (e.g., QuZO[1], ZO2[20]) sidesteps gradient computation entirely; quantized fine-tuning techniques (e.g., QFT[23], Sub-4-bit Fine-tuning[6]) combine model and optimizer compression for parameter-efficient adaptation; ultra-low-bit model quantization (e.g., BitNet Scaling[3], Bitnet One-bit Pretraining[2]) pushes weights and activations to extreme precision; theoretical and empirical analyses (e.g., Floating-point Optimizer Convergence[35], FP8 Training Stability[30]) provide convergence guarantees and stability insights; and surveys (e.g., Quantization Survey[14], On-Device LLM Survey[22]) offer broad overviews of quantization landscapes. Recent work has intensified around second-order and novel optimizers, where methods like 4-bit Shampoo[17] and Quantized Muon States[28] compress preconditioner matrices that are traditionally memory-intensive. Low-bit Muon[0] sits within this cluster, focusing on quantizing the Muon optimizer's states to enable efficient large-scale training. Compared to 4-bit Shampoo[17], which targets Shampoo's second-order statistics, Low-bit Muon[0] addresses a different optimizer architecture, while Quantized Muon States[28] explores similar themes but may differ in quantization granularity or training stability. A key tension across branches is the trade-off between compression aggressiveness and convergence quality: ultra-low-bit approaches (e.g., 1-bit Optimization Rethinking[27]) promise maximal memory savings but risk training instability, whereas moderate quantization (e.g., 8-bit Optimizers[7]) offers safer convergence. Low-bit Muon[0] contributes to understanding how novel optimizers can be compressed without sacrificing the benefits of advanced update rules, bridging optimizer innovation and memory efficiency.

Claimed Contributions

Systematic analysis of quantization error in Muon optimizer

10 retrieved papers

The authors conduct a systematic analysis revealing that Newton-Schulz orthogonalization amplifies quantization error primarily in the top singular subspace of the moment matrix, and identify outlier patterns appearing across both dimensions. This analysis motivates dividing the moment matrix into top and residual singular subspaces for separate compression.

10 retrieved papers

4-bit-Muon-GRASP optimizer with subspace preservation and grid quantization

7 retrieved papers

The authors propose a novel 4-bit compression method for Muon that uses 8-bit compression for the top singular subspace (obtained via power iteration) and 4-bit grid quantization for the residual subspace. Grid quantization normalizes both row and column directions to handle outliers appearing across both dimensions.

7 retrieved papers

Empirical validation on LLM pre-training and fine-tuning tasks

10 retrieved papers

The authors evaluate their method through extensive experiments on LLaMA models of various sizes (130M to 7B parameters) for both pre-training and fine-tuning tasks, demonstrating that 4-bit-Muon-GRASP matches full-precision performance while reducing total training memory by up to 28 percent.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[17] 4-bit Shampoo for Memory-Efficient Network Training PDF

Hua Huang, Jia Li, Sike Wang, Pan Zhou (2024)

[28] Effective Quantization of Muon Optimizer States PDF

Gupta Aman, Aman Gupta, Rafael Celente, Abhishek Shivanna, Dexter, Gregory, D. T. Braithwaite, Tang Shao, Gregory Dexter, Shao Tang, Silva, Daniel, Hiroto Udagawa, Ramanath, Rohan, Daniel Silva, Keerthi, S. Sathiya, R. Ramanath, S. Keerthi (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Systematic analysis of quantization error in Muon optimizer

[17] 4-bit Shampoo for Memory-Efficient Network Training PDF

Cannot Refute

[39] Quantized Approximately Orthogonal Recurrent Neural Networks PDF

Cannot Refute

[40] OMPQ: Orthogonal Mixed Precision Quantization PDF

Cannot Refute

[41] One loss for all: Deep hashing with a single cosine similarity based learning objective PDF

Cannot Refute

[42] Understanding How Orthogonality of Parameters Improves Quantization of Neural Networks PDF

Cannot Refute

[43] Efficient Deep Learning Model Compression for Sensor-Based Vision Systems via Outlier-Aware Quantization PDF

Cannot Refute

[44] Deep hashing via householder quantization PDF

Cannot Refute

[45] Optimal unsupervised learning in a single-layer linear feedforward neural network PDF

Cannot Refute

[46] Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval PDF

Cannot Refute

[47] Human Activity Recognition on Microcontrollers with Quantized and Adaptive Deep Neural Networks PDF

Cannot Refute

Contribution

4-bit-Muon-GRASP optimizer with subspace preservation and grid quantization

[22] On-Device Large Language Models: A Survey of Model Compression and System Optimization PDF

Cannot Refute

[48] Kv cache is 1 bit per channel: Efficient large language model inference with coupled quantization PDF

Cannot Refute

[49] Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression PDF

Cannot Refute

[50] A survey of low-bit large language models: Basics, systems, and algorithms PDF

Cannot Refute

[51] QSLR: Post-Training Compression via Quantized Sparse and Low-Rank Factorization PDF

Cannot Refute

[52] Lean and Mean Adaptive Optimization via Subset-Norm and Subspace-Momentum with Convergence Guarantees PDF

Cannot Refute

[53] Self-Adaptive Spherical Search With a Low-Precision Projection Matrix for Real-World Optimization. PDF

Cannot Refute

Contribution

Empirical validation on LLM pre-training and fine-tuning tasks

[6] Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization PDF

Cannot Refute

[9] Galore: Memory-efficient llm training by gradient low-rank projection PDF

Cannot Refute

[54] LoRA-FA: Memory-efficient Low-rank Adaptation for Large Language Models Fine-tuning PDF

Cannot Refute

[55] Beyond efficiency: A systematic survey of resource-efficient large language models PDF

Cannot Refute

[56] Sparse Low-rank Adaptation of Pre-trained Language Models PDF

Cannot Refute

[57] Group-in-Group Policy Optimization for LLM Agent Training PDF

Cannot Refute

[58] A Minimalist Optimizer Design for LLM Pretraining PDF

Cannot Refute

[59] Fine-Tuning Language Models with Just Forward Passes PDF

Cannot Refute

[60] QLoRA: Efficient Finetuning of Quantized LLMs PDF

Cannot Refute

[61] Keep the cost down: A review on methods to optimize LLM's KV-cache consumption PDF

Cannot Refute

Achieving low-bit Muon through subspace preservation and grid quantization

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[17] 4-bit Shampoo for Memory-Efficient Network Training PDF

[28] Effective Quantization of Muon Optimizer States PDF

Contribution Analysis

Systematic analysis of quantization error in Muon optimizer

[17] 4-bit Shampoo for Memory-Efficient Network Training PDF

[39] Quantized Approximately Orthogonal Recurrent Neural Networks PDF

[40] OMPQ: Orthogonal Mixed Precision Quantization PDF

[41] One loss for all: Deep hashing with a single cosine similarity based learning objective PDF

[42] Understanding How Orthogonality of Parameters Improves Quantization of Neural Networks PDF

[43] Efficient Deep Learning Model Compression for Sensor-Based Vision Systems via Outlier-Aware Quantization PDF

[44] Deep hashing via householder quantization PDF

[45] Optimal unsupervised learning in a single-layer linear feedforward neural network PDF

[46] Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval PDF

[47] Human Activity Recognition on Microcontrollers with Quantized and Adaptive Deep Neural Networks PDF

4-bit-Muon-GRASP optimizer with subspace preservation and grid quantization

[22] On-Device Large Language Models: A Survey of Model Compression and System Optimization PDF

[48] Kv cache is 1 bit per channel: Efficient large language model inference with coupled quantization PDF

[49] Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression PDF

[50] A survey of low-bit large language models: Basics, systems, and algorithms PDF

[51] QSLR: Post-Training Compression via Quantized Sparse and Low-Rank Factorization PDF

[52] Lean and Mean Adaptive Optimization via Subset-Norm and Subspace-Momentum with Convergence Guarantees PDF

[53] Self-Adaptive Spherical Search With a Low-Precision Projection Matrix for Real-World Optimization. PDF

Empirical validation on LLM pre-training and fine-tuning tasks

[6] Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization PDF

[9] Galore: Memory-efficient llm training by gradient low-rank projection PDF

[54] LoRA-FA: Memory-efficient Low-rank Adaptation for Large Language Models Fine-tuning PDF

[55] Beyond efficiency: A systematic survey of resource-efficient large language models PDF

[56] Sparse Low-rank Adaptation of Pre-trained Language Models PDF

[57] Group-in-Group Policy Optimization for LLM Agent Training PDF

[58] A Minimalist Optimizer Design for LLM Pretraining PDF

[59] Fine-Tuning Language Models with Just Forward Passes PDF

[60] QLoRA: Efficient Finetuning of Quantized LLMs PDF

[61] Keep the cost down: A review on methods to optimize LLM's KV-cache consumption PDF

Table of Contents