Understanding and improving Shampoo and SOAP via Kullback-Leibler Minimization

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

ShampooSOAPcovariance estimationKullback–Leibler divergenceGaussianoptimization

Shampoo and its efficient, Adam-stabilized variant SOAP, employ structured second-moment estimation and have received growing attention for their effectiveness. In practice, Shampoo requires step-size grafting with Adam to achieve competitive performance. SOAP mitigates this by applying Adam in Shampoo's eigenbasis and further reducing per-iteration runtime. However, reliance on Adam introduces additional memory overhead in both methods. Prior theoretical interpretations have primarily examined their estimation schemes using the Frobenius norm. Motivated by the natural correspondence between the second moment and a covariance matrix, we reinterpret the estimation procedures in Shampoo and SOAP as instances of covariance estimation through the lens of Kullback–Leibler (KL) divergence minimization. This perspective reveals a previously overlooked theoretical limitation and motivates principled improvements to their design. Building on the KL perspective, we propose practical estimation schemes---KL-Shampoo and KL-SOAP---that match or exceed the performance of Shampoo and SOAP for pre-training a range of neural network models while maintaining SOAP-level per-iteration runtime. Notably, KL-Shampoo does not rely on Adam to achieve superior performance, thereby avoiding the associated memory overhead. Surprisingly, KL-Shampoo consistently outperforms the other methods in our experiments.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes KL-Shampoo and KL-SOAP, reinterpreting Shampoo and SOAP's second-moment estimation through Kullback–Leibler divergence minimization rather than Frobenius norm. It resides in the Kronecker-Factored and Block-Diagonal Approximations leaf, which contains four papers total including this work. This leaf sits within the broader Structured Preconditioner Design and Approximation branch, indicating a moderately populated research direction focused on computationally tractable curvature approximations. The sibling papers address related Kronecker factorizations and block-diagonal structures, suggesting the paper enters an active but not overcrowded subfield.

The taxonomy reveals neighboring leaves addressing Low-Rank and Eigenspace Methods and Diagonal and Structured Diagonal Preconditioners, both offering alternative approximation strategies. The Adaptive Moment Methods branch, particularly Exponential Moving Average-Based Optimizers, provides context for Adam-based techniques that Shampoo and SOAP incorporate. The paper's KL divergence lens bridges structured preconditioning with covariance estimation principles found in the Covariance and Correlation Structure Learning branch, though it remains firmly within optimization rather than statistical modeling. This positioning suggests the work synthesizes ideas across multiple taxonomy branches while maintaining focus on preconditioner design.

Among 26 candidates examined across three contributions, none clearly refute the proposed methods. The KL divergence perspective examined 10 candidates with zero refutable overlaps, suggesting this theoretical lens is relatively unexplored in prior Shampoo literature. The KL-Shampoo and KL-SOAP methods similarly faced 10 candidates without clear prior instantiation. The memory-efficient variant without Adam grafting examined 6 candidates, also without refutation. These statistics indicate that within the limited search scope, the specific combination of KL-based estimation and memory-efficient design appears novel, though the search does not cover the entire optimization literature.

The analysis suggests the paper introduces a fresh theoretical perspective and practical variants within an established research direction. The limited search scope means we cannot rule out related work in broader optimization or information geometry communities. The taxonomy placement and sibling papers indicate the work builds on well-known Shampoo foundations while proposing a distinct estimation principle. The absence of refuting candidates among 26 examined supports novelty claims, though exhaustive verification would require deeper literature coverage beyond top-K semantic matches.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: structured second-moment estimation for neural network optimization. The field organizes around several complementary branches that address how to efficiently capture and exploit curvature information during training. Structured Preconditioner Design and Approximation focuses on computationally tractable approximations to the full Hessian or Fisher information matrix, often using Kronecker factorizations or block-diagonal structures to reduce memory and computation while preserving useful geometric information. Adaptive Moment Methods and Gradient Statistics encompasses first- and second-moment estimators that adapt learning rates based on gradient history, bridging classical stochastic methods with modern variance-reduction techniques. Covariance and Correlation Structure Learning examines how to model dependencies among parameters or activations, sometimes drawing on statistical estimation of high-dimensional covariance matrices. Theoretical Analysis and Optimization Dynamics investigates convergence guarantees, curvature properties, and the interplay between batch size and noise structure. Domain-Specific Applications of Second-Order Methods tailors these ideas to specialized settings such as computer vision, natural language processing, or scientific computing, where problem structure can be further exploited. A particularly active line of work revolves around Kronecker-factored and block-diagonal approximations, which balance scalability with the benefits of second-order information. Scalable Second Order[2] and Tensor Normal Training[5] exemplify efforts to decompose large curvature matrices into manageable factors, while Kronecker Fisher Matrix[10] laid foundational ideas for factorizing the Fisher information. Shampoo SOAP KL[0] sits squarely within this branch, proposing a structured preconditioner that leverages Kronecker products and block structures to achieve efficient updates. Compared to Tensor Normal Training[5], which emphasizes tensor-based reparameterizations, Shampoo SOAP KL[0] focuses more directly on preconditioning via second-moment approximations. Meanwhile, works like Hubble Covariance Networks[3] and Self-Supervised Covariance[1] explore covariance structure in different contexts, highlighting ongoing questions about how best to estimate and regularize second-moment information across diverse architectures and training regimes.

Claimed Contributions

KL divergence perspective for Shampoo and SOAP estimation

10 retrieved papers

The authors introduce a novel theoretical framework that reinterprets the second-moment estimation schemes in Shampoo and SOAP optimizers as covariance estimation problems solved via KL divergence minimization. This perspective reveals a previously overlooked theoretical limitation in these methods and provides a principled foundation for improvements.

10 retrieved papers

KL-Shampoo and KL-SOAP optimization methods

10 retrieved papers

The authors develop two new optimization methods, KL-Shampoo and KL-SOAP, that implement improved estimation schemes based on their KL perspective. These methods achieve competitive or superior performance compared to existing Shampoo and SOAP optimizers while maintaining efficient per-iteration runtime.

10 retrieved papers

Memory-efficient KL-Shampoo without Adam grafting

6 retrieved papers

The authors demonstrate that their KL-Shampoo method eliminates the need for step-size grafting with Adam, which is required by standard Shampoo for competitive performance. This design choice reduces memory overhead while maintaining or improving optimization performance.

6 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[2] Scalable second order optimization for deep learning PDF

Anil, Rohan, Gupta, Vineet, Koren, Tomer, Regan, Kevin, Singer, Yoram (2020)

[5] Tensor normal training for deep learning models PDF

Yi Ren, Donald Goldfarb, D. Goldfarb (2021)

[10] A kronecker-factored approximate fisher matrix for convolution layers PDF

Grosse, Roger, Martens, James, R. Grosse, James Martens (2016)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

KL divergence perspective for Shampoo and SOAP estimation

[57] Improving Mean Covariance Matrix Estimation by Minimizing Within-class Dissimilarities Using Asymmetry of Kullback-Leibler Divergence in MI-Based BCI PDF

Cannot Refute

[58] Dpo kernels: A semantically-aware, kernel-enhanced, and divergence-rich paradigm for direct preference optimization PDF

Cannot Refute

[59] A geometric unification of distributionally robust covariance estimators: Shrinking the spectrum by inflating the ambiguity set PDF

Cannot Refute

[60] Comparing KL Divergence and MSE for Covariance Estimation in Target Detection PDF

Cannot Refute

[61] Covariance alignment: from maximum likelihood estimation to GromovâWasserstein PDF

Cannot Refute

[62] Robust Gaussian Mixture Modeling: A -Divergence Based Approach PDF

Cannot Refute

[63] Estimation of clutter covariance matrix in stap based on knowledge-aided and geometric methods PDF

Cannot Refute

[64] On the normalized signal to noise ratio in covariance estimation PDF

Cannot Refute

[65] On the Minimum -Divergence Estimator PDF

Cannot Refute

[66] Differentially Private Distribution Release of Gaussian Mixture Models via KL-Divergence Minimization PDF

Cannot Refute

Contribution

KL-Shampoo and KL-SOAP optimization methods

[67] Sophia: A scalable stochastic second-order optimizer for language model pre-training PDF

Cannot Refute

[68] Unconstrained optimization in neural network training PDF

Cannot Refute

[69] When Does Second-Order Optimization Speed Up Training? PDF

Cannot Refute

[70] The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton PDF

Cannot Refute

[71] Practical Efficiency of Muon for Pretraining PDF

Cannot Refute

[72] 4-bit shampoo for memory-efficient network training PDF

Cannot Refute

[73] Towards fast, specialized machine learning force fields: Distilling foundation models via energy hessians PDF

Cannot Refute

[74] Recursion Newton-Like Algorithm for l2,0-ReLU Deep Neural Networks PDF

Cannot Refute

[75] Understanding data influence with differential approximation PDF

Cannot Refute

[76] Nonlinear discretizations and Newton's method: characterizing stationary points of regression objectives PDF

Cannot Refute

Contribution

Memory-efficient KL-Shampoo without Adam grafting

[51] Dual space preconditioning for gradient descent PDF

Cannot Refute

[52] NysAct: A Scalable Preconditioned Gradient Descent using NystrÃ¶m Approximation PDF

Cannot Refute

[53] A Scalable and Flexible Framework for Gaussian Processes via Matrix-Vector Multiplication PDF

Cannot Refute

[54] Pipelined Preconditioned s-step Conjugate Gradient Methods for Distributed Memory Systems PDF

Cannot Refute

[55] Extreme Tensoring for Low-Memory Preconditioning PDF

Cannot Refute

[56] Preconditioned Gradient Descent Algorithm for Inverse Filtering on Spatially Distributed Networks PDF

Cannot Refute

Understanding and improving Shampoo and SOAP via Kullback-Leibler Minimization

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[2] Scalable second order optimization for deep learning PDF

[5] Tensor normal training for deep learning models PDF

[10] A kronecker-factored approximate fisher matrix for convolution layers PDF

Contribution Analysis

KL divergence perspective for Shampoo and SOAP estimation

[57] Improving Mean Covariance Matrix Estimation by Minimizing Within-class Dissimilarities Using Asymmetry of Kullback-Leibler Divergence in MI-Based BCI PDF

[58] Dpo kernels: A semantically-aware, kernel-enhanced, and divergence-rich paradigm for direct preference optimization PDF

[59] A geometric unification of distributionally robust covariance estimators: Shrinking the spectrum by inflating the ambiguity set PDF

[60] Comparing KL Divergence and MSE for Covariance Estimation in Target Detection PDF

[61] Covariance alignment: from maximum likelihood estimation to GromovâWasserstein PDF

[62] Robust Gaussian Mixture Modeling: A -Divergence Based Approach PDF

[63] Estimation of clutter covariance matrix in stap based on knowledge-aided and geometric methods PDF

[64] On the normalized signal to noise ratio in covariance estimation PDF

[65] On the Minimum -Divergence Estimator PDF

[66] Differentially Private Distribution Release of Gaussian Mixture Models via KL-Divergence Minimization PDF

KL-Shampoo and KL-SOAP optimization methods

[67] Sophia: A scalable stochastic second-order optimizer for language model pre-training PDF

[68] Unconstrained optimization in neural network training PDF

[69] When Does Second-Order Optimization Speed Up Training? PDF

[70] The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton PDF

[71] Practical Efficiency of Muon for Pretraining PDF

[72] 4-bit shampoo for memory-efficient network training PDF

[73] Towards fast, specialized machine learning force fields: Distilling foundation models via energy hessians PDF

[74] Recursion Newton-Like Algorithm for l2,0-ReLU Deep Neural Networks PDF

[75] Understanding data influence with differential approximation PDF

[76] Nonlinear discretizations and Newton's method: characterizing stationary points of regression objectives PDF

Memory-efficient KL-Shampoo without Adam grafting

[51] Dual space preconditioning for gradient descent PDF

[52] NysAct: A Scalable Preconditioned Gradient Descent using NystrÃ¶m Approximation PDF

[53] A Scalable and Flexible Framework for Gaussian Processes via Matrix-Vector Multiplication PDF

[54] Pipelined Preconditioned s-step Conjugate Gradient Methods for Distributed Memory Systems PDF

[55] Extreme Tensoring for Low-Memory Preconditioning PDF

[56] Preconditioned Gradient Descent Algorithm for Inverse Filtering on Spatially Distributed Networks PDF

Table of Contents

[61] Covariance alignment: from maximum likelihood estimation to GromovâWasserstein PDF