Convergence of Muon with Newton-Schulz

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

MuonNewton–SchulzOrthogonalizationNonconvex Optimization

We analyze Muon as originally proposed and used in practice---using the momentum orthogonalization with a few Newton-Schulz steps. The prior theoretical results replace this key step in Muon with an exact SVD-based polar factor. We prove that Muon with Newton-Schulz converges to a stationary point with the same rate as the SVD-polar idealization, up to a constant factor for given the number of Newton-Schulz steps $q$ . We further analyze this constant factor, and prove that it converges to 1 doubly exponentially in $q$ and improves with $\kappa$ , which is the degree of a polynomial used in Newton-Schulz required when approximating the orthogonalization direction. We also prove that Muon removes the typical square-root-of-rank loss compared to its vector-based counterpart, SGD with momentum. Our results explain why Muon with a few low-degree Newton-Schulz steps matches exact-polar (SVD) behavior at much faster wall-clock time, and explain how much momentum matrix orthogonalization via Newton-Schulz benefits over the vector-based optimizer. Overall, our theory justifies the practical Newton-Schulz design of Muon, narrowing its practice–theory gap.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper establishes convergence guarantees for Muon using Newton-Schulz iterations for approximate orthogonalization, proving that it matches the convergence rate of the idealized SVD-based version up to a constant factor. It resides in the 'Momentum-Based Optimizers with Approximate Orthogonalization' leaf, which contains only three papers total. This is a sparse research direction within the broader taxonomy of fifteen papers, suggesting the specific combination of momentum-based matrix optimization with approximate orthogonalization remains relatively underexplored theoretically.

The taxonomy reveals neighboring work in 'Accelerated Methods with Orthogonality Constraints' focusing on condition number dependence, and in 'Approximate and Efficient Orthogonalization' addressing computational efficiency without momentum dynamics. The paper bridges these areas by analyzing how approximate orthogonalization via Newton-Schulz interacts with momentum acceleration. Its sibling papers examine inexact orthogonalization and isotropy properties, indicating the leaf concentrates on understanding approximation quality trade-offs in momentum schemes rather than exact methods or non-momentum approaches found in adjacent branches.

Among thirteen candidates examined, the first contribution (convergence with Newton-Schulz) shows one refutable candidate from seven examined, while the second contribution (polar approximation error analysis) also has one refutable candidate from three examined. The third contribution (sharper rank dependence) appears more novel with zero refutable candidates among three examined. The limited search scope means these statistics reflect top-K semantic matches rather than exhaustive coverage. The first two contributions face more substantial prior work overlap within this constrained candidate set, while the rank-dependence analysis appears less anticipated by nearby literature.

Based on the limited literature search of thirteen candidates, the work addresses a recognized gap in analyzing practical Newton-Schulz implementations versus idealized SVD assumptions. The sparse taxonomy leaf and modest candidate pool suggest the analysis covers a focused slice of the field rather than comprehensive prior art. The rank-dependence result shows stronger novelty signals within the examined scope, though broader literature may contain additional relevant work not captured by semantic search.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: convergence analysis of momentum-based matrix optimization with approximate orthogonalization. The field centers on designing and analyzing optimization algorithms that maintain approximate orthogonality constraints while leveraging momentum to accelerate convergence. The taxonomy divides naturally into three main branches. Convergence Theory and Analysis focuses on establishing rigorous guarantees for momentum-based optimizers that incorporate approximate orthogonalization schemes, examining how iterative projections or corrections affect convergence rates and stability. Algorithmic Design and Implementation explores practical variants and computational strategies, including parallelizable schemes like Parallelizable Orthogonality[3] and adaptive normalization approaches such as NorMuon[5]. Applications and Extensions branch out to specialized domains—ranging from federated learning settings in FedMuon[7] to tensor decompositions in Tensor Norm[9] and sparse factorization problems like Sparse Orthogonal NMF[6]—demonstrating how these core ideas adapt to diverse problem structures. Recent work has concentrated on refining the interplay between momentum dynamics and orthogonality maintenance, with several studies exploring trade-offs between computational cost and approximation quality. Muon Newton-Schulz[0] sits squarely within the convergence theory branch, providing rigorous analysis of momentum-based optimizers that use Newton-Schulz iterations for approximate orthogonalization. It shares thematic ground with Inexact Muon[4], which examines how inexact orthogonalization steps influence convergence, and with Isotropic Muon[10], which investigates isotropy properties under similar momentum schemes. These works collectively address a central question: how much approximation error can be tolerated in orthogonalization while preserving the benefits of momentum acceleration? By establishing convergence guarantees under relaxed orthogonality conditions, Muon Newton-Schulz[0] contributes to a growing understanding of feasible, scalable optimization on matrix manifolds.

Claimed Contributions

First convergence result of MUON with Newton-Schulz

Can Refute

7 retrieved papers

The authors provide the first theoretical convergence analysis for the practical MUON optimizer that uses Newton-Schulz iterations for momentum orthogonalization, rather than the exact SVD-based polar decomposition assumed in prior work. This analysis covers the algorithm as actually implemented and used in practice.

7 retrieved papers

Can Refute

Analysis of polar approximation error and wall-clock convergence

Can Refute

3 retrieved papers

The authors establish that the approximation error from using Newton-Schulz instead of exact SVD vanishes doubly exponentially in the number of Newton-Schulz steps and improves with polynomial degree. This shows that even a few Newton-Schulz steps achieve convergence rates arbitrarily close to the idealized SVD variant while being computationally much cheaper.

3 retrieved papers

Can Refute

Sharper rank dependence in MUON with Newton-Schulz

3 retrieved papers

The authors prove that MUON with Newton-Schulz removes the square-root-of-rank factor from the convergence rate compared to SGD with momentum, demonstrating a concrete theoretical advantage of matrix-aware optimization over vector-based methods under the same stationarity metric.

3 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[4] Beyond the ideal: Analyzing the inexact muon update PDF

Shulgin, Egor, Egor Shulgin, Orabona, Francesco, Sultan Alrashed, RichtÃ¡rik, Peter, Francesco Orabona, Peter Richt'arik (2025)

[10] High-dimensional isotropic scaling dynamics of Muon and SGD PDF

G Wang, E Paquette, A Agarwala (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

First convergence result of MUON with Newton-Schulz

[4] Beyond the ideal: Analyzing the inexact muon update PDF

Can Refute

[10] High-dimensional isotropic scaling dynamics of Muon and SGD PDF

Cannot Refute

[21] Preconditioned Inexact Stochastic ADMM for Deep Model PDF

Cannot Refute

[22] ROOT: Robust Orthogonalized Optimizer for Neural Network Training PDF

Cannot Refute

[23] AuON: A Linear-time Alternative to Orthogonal Momentum Updates PDF

Cannot Refute

[24] ANDI: Adaptive Norm-Distribution Interface PDF

Cannot Refute

[25] MGUP: A Momentum-Gradient Alignment Update Policy for Stochastic Optimization PDF

Cannot Refute

Contribution

Analysis of polar approximation error and wall-clock convergence

[4] Beyond the ideal: Analyzing the inexact muon update PDF

Can Refute

[16] The polar express: Optimal matrix sign methods and their application to the muon algorithm PDF

Cannot Refute

[17] Computing fundamental matrix decompositions accurately via the matrix sign function in two iterations: The power of Zolotarev's functions PDF

Cannot Refute

Contribution

Sharper rank dependence in MUON with Newton-Schulz

[18] Low-rank Momentum Factorization for Memory Efficient Training PDF

Cannot Refute

[19] On the O(Ãd/T1/4) Convergence Rate of RMSProp and Its Momentum Extension Measured by ðl Norm: Better Dependence on the Dimension PDF

Cannot Refute

[20] Momentum Tracking: Momentum Acceleration for Decentralized Deep Learning on Heterogeneous Data PDF

Cannot Refute

Convergence of Muon with Newton-Schulz

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[4] Beyond the ideal: Analyzing the inexact muon update PDF

[10] High-dimensional isotropic scaling dynamics of Muon and SGD PDF

Contribution Analysis

First convergence result of MUON with Newton-Schulz

[4] Beyond the ideal: Analyzing the inexact muon update PDF

[10] High-dimensional isotropic scaling dynamics of Muon and SGD PDF

[21] Preconditioned Inexact Stochastic ADMM for Deep Model PDF

[22] ROOT: Robust Orthogonalized Optimizer for Neural Network Training PDF

[23] AuON: A Linear-time Alternative to Orthogonal Momentum Updates PDF

[24] ANDI: Adaptive Norm-Distribution Interface PDF

[25] MGUP: A Momentum-Gradient Alignment Update Policy for Stochastic Optimization PDF

Analysis of polar approximation error and wall-clock convergence

[4] Beyond the ideal: Analyzing the inexact muon update PDF

[16] The polar express: Optimal matrix sign methods and their application to the muon algorithm PDF

[17] Computing fundamental matrix decompositions accurately via the matrix sign function in two iterations: The power of Zolotarev's functions PDF

Sharper rank dependence in MUON with Newton-Schulz

[18] Low-rank Momentum Factorization for Memory Efficient Training PDF

[19] On the O(Ãd/T1/4) Convergence Rate of RMSProp and Its Momentum Extension Measured by ðl Norm: Better Dependence on the Dimension PDF

[20] Momentum Tracking: Momentum Acceleration for Decentralized Deep Learning on Heterogeneous Data PDF

Table of Contents

[19] On the O(Ãd/T1/4) Convergence Rate of RMSProp and Its Momentum Extension Measured by ðl Norm: Better Dependence on the Dimension PDF