Understanding and improving Shampoo and SOAP via Kullback-Leibler Minimization

ICLR 2026 Conference SubmissionAnonymous Authors
ShampooSOAPcovariance estimationKullback–Leibler divergenceGaussianoptimization
Abstract:

Shampoo and its efficient, Adam-stabilized variant SOAP, employ structured second-moment estimation and have received growing attention for their effectiveness. In practice, Shampoo requires step-size grafting with Adam to achieve competitive performance. SOAP mitigates this by applying Adam in Shampoo's eigenbasis and further reducing per-iteration runtime. However, reliance on Adam introduces additional memory overhead in both methods. Prior theoretical interpretations have primarily examined their estimation schemes using the Frobenius norm. Motivated by the natural correspondence between the second moment and a covariance matrix, we reinterpret the estimation procedures in Shampoo and SOAP as instances of covariance estimation through the lens of Kullback–Leibler (KL) divergence minimization. This perspective reveals a previously overlooked theoretical limitation and motivates principled improvements to their design. Building on the KL perspective, we propose practical estimation schemes---KL-Shampoo and KL-SOAP---that match or exceed the performance of Shampoo and SOAP for pre-training a range of neural network models while maintaining SOAP-level per-iteration runtime. Notably, KL-Shampoo does not rely on Adam to achieve superior performance, thereby avoiding the associated memory overhead. Surprisingly, KL-Shampoo consistently outperforms the other methods in our experiments.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes KL-Shampoo and KL-SOAP, reinterpreting Shampoo and SOAP's second-moment estimation through Kullback–Leibler divergence minimization rather than Frobenius norm. It resides in the Kronecker-Factored and Block-Diagonal Approximations leaf, which contains four papers total including this work. This leaf sits within the broader Structured Preconditioner Design and Approximation branch, indicating a moderately populated research direction focused on computationally tractable curvature approximations. The sibling papers address related Kronecker factorizations and block-diagonal structures, suggesting the paper enters an active but not overcrowded subfield.

The taxonomy reveals neighboring leaves addressing Low-Rank and Eigenspace Methods and Diagonal and Structured Diagonal Preconditioners, both offering alternative approximation strategies. The Adaptive Moment Methods branch, particularly Exponential Moving Average-Based Optimizers, provides context for Adam-based techniques that Shampoo and SOAP incorporate. The paper's KL divergence lens bridges structured preconditioning with covariance estimation principles found in the Covariance and Correlation Structure Learning branch, though it remains firmly within optimization rather than statistical modeling. This positioning suggests the work synthesizes ideas across multiple taxonomy branches while maintaining focus on preconditioner design.

Among 26 candidates examined across three contributions, none clearly refute the proposed methods. The KL divergence perspective examined 10 candidates with zero refutable overlaps, suggesting this theoretical lens is relatively unexplored in prior Shampoo literature. The KL-Shampoo and KL-SOAP methods similarly faced 10 candidates without clear prior instantiation. The memory-efficient variant without Adam grafting examined 6 candidates, also without refutation. These statistics indicate that within the limited search scope, the specific combination of KL-based estimation and memory-efficient design appears novel, though the search does not cover the entire optimization literature.

The analysis suggests the paper introduces a fresh theoretical perspective and practical variants within an established research direction. The limited search scope means we cannot rule out related work in broader optimization or information geometry communities. The taxonomy placement and sibling papers indicate the work builds on well-known Shampoo foundations while proposing a distinct estimation principle. The absence of refuting candidates among 26 examined supports novelty claims, though exhaustive verification would require deeper literature coverage beyond top-K semantic matches.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
26
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: structured second-moment estimation for neural network optimization. The field organizes around several complementary branches that address how to efficiently capture and exploit curvature information during training. Structured Preconditioner Design and Approximation focuses on computationally tractable approximations to the full Hessian or Fisher information matrix, often using Kronecker factorizations or block-diagonal structures to reduce memory and computation while preserving useful geometric information. Adaptive Moment Methods and Gradient Statistics encompasses first- and second-moment estimators that adapt learning rates based on gradient history, bridging classical stochastic methods with modern variance-reduction techniques. Covariance and Correlation Structure Learning examines how to model dependencies among parameters or activations, sometimes drawing on statistical estimation of high-dimensional covariance matrices. Theoretical Analysis and Optimization Dynamics investigates convergence guarantees, curvature properties, and the interplay between batch size and noise structure. Domain-Specific Applications of Second-Order Methods tailors these ideas to specialized settings such as computer vision, natural language processing, or scientific computing, where problem structure can be further exploited. A particularly active line of work revolves around Kronecker-factored and block-diagonal approximations, which balance scalability with the benefits of second-order information. Scalable Second Order[2] and Tensor Normal Training[5] exemplify efforts to decompose large curvature matrices into manageable factors, while Kronecker Fisher Matrix[10] laid foundational ideas for factorizing the Fisher information. Shampoo SOAP KL[0] sits squarely within this branch, proposing a structured preconditioner that leverages Kronecker products and block structures to achieve efficient updates. Compared to Tensor Normal Training[5], which emphasizes tensor-based reparameterizations, Shampoo SOAP KL[0] focuses more directly on preconditioning via second-moment approximations. Meanwhile, works like Hubble Covariance Networks[3] and Self-Supervised Covariance[1] explore covariance structure in different contexts, highlighting ongoing questions about how best to estimate and regularize second-moment information across diverse architectures and training regimes.

Claimed Contributions

KL divergence perspective for Shampoo and SOAP estimation

The authors introduce a novel theoretical framework that reinterprets the second-moment estimation schemes in Shampoo and SOAP optimizers as covariance estimation problems solved via KL divergence minimization. This perspective reveals a previously overlooked theoretical limitation in these methods and provides a principled foundation for improvements.

10 retrieved papers
KL-Shampoo and KL-SOAP optimization methods

The authors develop two new optimization methods, KL-Shampoo and KL-SOAP, that implement improved estimation schemes based on their KL perspective. These methods achieve competitive or superior performance compared to existing Shampoo and SOAP optimizers while maintaining efficient per-iteration runtime.

10 retrieved papers
Memory-efficient KL-Shampoo without Adam grafting

The authors demonstrate that their KL-Shampoo method eliminates the need for step-size grafting with Adam, which is required by standard Shampoo for competitive performance. This design choice reduces memory overhead while maintaining or improving optimization performance.

6 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

KL divergence perspective for Shampoo and SOAP estimation

The authors introduce a novel theoretical framework that reinterprets the second-moment estimation schemes in Shampoo and SOAP optimizers as covariance estimation problems solved via KL divergence minimization. This perspective reveals a previously overlooked theoretical limitation in these methods and provides a principled foundation for improvements.

Contribution

KL-Shampoo and KL-SOAP optimization methods

The authors develop two new optimization methods, KL-Shampoo and KL-SOAP, that implement improved estimation schemes based on their KL perspective. These methods achieve competitive or superior performance compared to existing Shampoo and SOAP optimizers while maintaining efficient per-iteration runtime.

Contribution

Memory-efficient KL-Shampoo without Adam grafting

The authors demonstrate that their KL-Shampoo method eliminates the need for step-size grafting with Adam, which is required by standard Shampoo for competitive performance. This design choice reduces memory overhead while maintaining or improving optimization performance.