FFT-based Dynamic Subspace Selection for Low-Rank Adaptive Optimization of Large Language Models

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.6 Download Report PDF

low-rank optimizationfast fourier transformcomputational efficiencymemory efficiencyefficient optimizationlarge language models

Low-rank optimization has emerged as a promising direction in training large language models (LLMs) to improve running time and reduce the memory usage of adaptive optimizers by constraining learning to a lower-dimensional space. Prior work typically projects gradients of linear layers using approaches based on Singular Value Decomposition (SVD) or QR-decomposition. Applying these techniques individually to each layer in large models is computationally expensive and incurs additional memory costs due to storing the projection matrices. In this work, we propose a computationally efficient and conceptually simple, two-step procedure to approximate SVD/QR-based gradient projections into lower-dimensional spaces by using a predefined orthogonal matrix of the Discrete Cosine Transform (DCT). We dynamically select columns from the DCT matrix based on their alignment with the gradient of each layer. The effective projection matrices are obtained via a simple matmul with the DCT matrix in $O(n^3)$ time, followed by a lightweight sorting step to identify the most relevant basis vectors. For large layers, DCT can be computed via Makhoul's $N$ -point algorithm based on Fast Fourier Transform (FFT) in $O(n^2 \log(n))$ time. Due to the predefined nature of the orthogonal bases, they are computed once at the start of training. Our numerical experiments on both pre-training and fine-tuning tasks demonstrate the effectiveness of our dual strategy in approximating optimal low-rank projections, obtaining an approach with rank-independent running time that matches the performance of costly SVD/QR-based methods while achieving faster runtime and reduced memory usage by up to $25\\%$ across different model sizes.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes using Discrete Cosine Transform (DCT) matrices with dynamic column selection to approximate SVD-based gradient projections for low-rank optimization of large language models. It resides in the 'SVD-Free and FFT-Based Projection' leaf under 'Subspace Optimization and Gradient Projection', a specialized branch containing only two papers. This leaf represents a sparse research direction focused on computationally efficient alternatives to singular value decomposition, suggesting the work addresses a relatively underexplored niche within the broader low-rank adaptation landscape.

The parent branch 'Subspace Optimization and Gradient Projection' encompasses three distinct approaches: SVD-free methods (this leaf), gradient-free optimization techniques, and tensor decomposition strategies. Neighboring leaves include derivative-free methods that avoid backpropagation entirely and ultra-low-rank tensor-train decompositions. The taxonomy structure reveals that while the broader field of low-rank LLM optimization is mature (50 papers across 36 topics), the specific pursuit of FFT-based projection methods remains a narrow technical direction, distinct from mainstream LoRA variants in 'Core LoRA Methods' and quantization-aware approaches.

Among 22 candidates examined through limited semantic search, the three identified contributions show no clear refutation. The DCT-based dynamic column selection examined 8 candidates with zero refutable matches; the Trion optimizer examined 10 candidates with none refutable; DCT-AdamW examined 4 candidates with none refutable. These statistics suggest that within the bounded search scope, no prior work directly overlaps with the specific combination of DCT matrices, dynamic column selection, and the proposed optimizer variants. However, the limited search scale means unexplored literature may exist beyond the top-K semantic matches.

Given the sparse taxonomy leaf (one sibling paper) and absence of refutable candidates in the limited search, the work appears to occupy a relatively novel position within FFT-based projection methods. The analysis covers top-22 semantic matches and does not constitute exhaustive prior art review. The specific technical choices—predefined DCT matrices, dynamic column selection via gradient alignment, and O(n³) matmul followed by sorting—may represent incremental refinements over existing subspace projection techniques, but the bounded search scope prevents definitive assessment of their novelty relative to the full literature.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: low-rank adaptive optimization of large language models. The field has evolved from the foundational LoRA[1] approach into a rich taxonomy spanning theoretical extensions, memory-efficient variants, and application-specific adaptations. Top-level branches include Core LoRA Methods and Theoretical Foundations, which establish the mathematical underpinnings; Quantization-Aware and Memory-Efficient LoRA (e.g., QA-LoRA[5], LoftQ[6]), which address deployment constraints; Subspace Optimization and Gradient Projection, exploring alternatives to standard low-rank factorizations; and specialized directions such as Mixture of Experts and Multi-Task Adaptation (e.g., MoELoRA[42]), Federated and Decentralized LoRA (e.g., Decentralized LoRA[16]), and Domain-Specific Applications. These branches reflect a progression from improving parameter efficiency to tackling diverse training regimes, uncertainty quantification (Bayesian LoRA[4]), and structured sparsity, illustrating how the community balances theoretical rigor with practical scalability. Within Subspace Optimization and Gradient Projection, a small cluster of works explores SVD-free and FFT-based projection methods that avoid expensive singular value decompositions during training. FFT Dynamic Subspace[0] sits squarely in this niche, leveraging fast Fourier transforms to dynamically adjust low-rank subspaces without explicit SVD computations. Its closest neighbor, SVD-Free Adaptive[8], similarly pursues efficient subspace updates but may differ in the specific projection mechanism or convergence guarantees. Compared to broader gradient projection techniques like GaLore[44], which projects gradients into low-rank spaces, FFT Dynamic Subspace[0] emphasizes frequency-domain transformations to achieve computational savings. This line of work addresses a key trade-off: maintaining expressive subspace adaptation while minimizing the overhead that can bottleneck large-scale fine-tuning, a challenge that remains active as models grow and deployment scenarios diversify.

Claimed Contributions

DCT-based dynamic column selection for low-rank gradient projection

8 retrieved papers

The authors introduce a method that selects columns from a predefined orthogonal DCT matrix based on alignment with gradient matrices, enabling efficient low-rank projections without computing expensive SVD or QR decompositions per layer. This approach achieves rank-independent running time while storing only column indices rather than full projection matrices.

8 retrieved papers

Trion optimizer

10 retrieved papers

The authors develop Trion as an improved version of the Dion optimizer that replaces Power-Iteration and QR-decomposition with DCT-based dynamic column selection followed by Newton-Schulz orthogonalization applied to low-rank momentum. This reduces computational overhead while maintaining or improving performance.

10 retrieved papers

DCT-AdamW optimizer

4 retrieved papers

The authors propose DCT-AdamW as a standalone low-rank AdamW variant that uses DCT-based projections instead of SVD, incorporates optional quantized error feedback, and rotates momentum buffers to correctly integrate gradients from changing low-rank subspaces at each step.

4 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[8] SVD-Free Low-Rank Adaptive Gradient Optimization for Large Language Models PDF

IV Modoranu, M Safaryan, E Schultheis (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

DCT-based dynamic column selection for low-rank gradient projection

[8] SVD-Free Low-Rank Adaptive Gradient Optimization for Large Language Models PDF

Cannot Refute

[54] Multilabel feature selection via shared latent sublabel structure and simultaneous orthogonal basis clustering PDF

Cannot Refute

[55] Dynamically Orthogonal Runge-Kutta Schemes with Perturbative Retractions for the Dynamical Low-Rank Approximation PDF

Cannot Refute

[56] Flattening Sharpness for Dynamic Gradient Projection Memory Benefits Continual Learning PDF

Cannot Refute

[57] A geometric approach to dynamical model order reduction PDF

Cannot Refute

[58] Low-rank adaptive filters PDF

Cannot Refute

[59] Adaptive reduced-rank constrained constant modulus algorithms based on joint iterative optimization of filters for beamforming PDF

Cannot Refute

[60] Randomized Projection for Rank-Revealing Matrix Factorizations and Low-Rank Approximations PDF

Cannot Refute

Contribution

Trion optimizer

[61] Accelerating Newton-Schulz Iteration for Orthogonalization via Chebyshev-type Polynomials PDF

Cannot Refute

[62] Preconditioned Inexact Stochastic ADMM for Deep Model PDF

Cannot Refute

[63] Beyond the Ideal: Analyzing the Inexact Muon Update PDF

Cannot Refute

[64] Turbo-Muon: Accelerating Orthogonality-Based Optimization with Pre-Conditioning PDF

Cannot Refute

[65] AuON: A Linear-time Alternative to Orthogonal Momentum Updates PDF

Cannot Refute

[66] AuON: A Linear-time Alternative to Semi-Orthogonal Momentum Updates PDF

Cannot Refute

[67] Novel Tensor Norm Optimization for Neural Network Training Acceleration PDF

Cannot Refute

[68] DeMuon: A Decentralized Muon for Matrix Optimization over Graphs PDF

Cannot Refute

[69] AuON: A Survey For Linear-time Orthogonal Optimizer PDF

Cannot Refute

[70] Towards Scalable Backpropagation-Free PDF

Cannot Refute

Contribution

DCT-AdamW optimizer

[8] SVD-Free Low-Rank Adaptive Gradient Optimization for Large Language Models PDF

Cannot Refute

[51] LoFT: Low-Rank Adaptation That Behaves Like Full Fine-Tuning PDF

Cannot Refute

[52] An Optimal Control View of LoRA and Binary Controller Design for Vision Transformers PDF

Cannot Refute

[53] MUZO: Leveraging Multiple Queries and Momentum for Zeroth-Order Fine-Tuning of Large Language Models PDF

Cannot Refute

FFT-based Dynamic Subspace Selection for Low-Rank Adaptive Optimization of Large Language Models

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[8] SVD-Free Low-Rank Adaptive Gradient Optimization for Large Language Models PDF

Contribution Analysis

DCT-based dynamic column selection for low-rank gradient projection

[8] SVD-Free Low-Rank Adaptive Gradient Optimization for Large Language Models PDF

[54] Multilabel feature selection via shared latent sublabel structure and simultaneous orthogonal basis clustering PDF

[55] Dynamically Orthogonal Runge-Kutta Schemes with Perturbative Retractions for the Dynamical Low-Rank Approximation PDF

[56] Flattening Sharpness for Dynamic Gradient Projection Memory Benefits Continual Learning PDF

[57] A geometric approach to dynamical model order reduction PDF

[58] Low-rank adaptive filters PDF

[59] Adaptive reduced-rank constrained constant modulus algorithms based on joint iterative optimization of filters for beamforming PDF

[60] Randomized Projection for Rank-Revealing Matrix Factorizations and Low-Rank Approximations PDF

Trion optimizer

[61] Accelerating Newton-Schulz Iteration for Orthogonalization via Chebyshev-type Polynomials PDF

[62] Preconditioned Inexact Stochastic ADMM for Deep Model PDF

[63] Beyond the Ideal: Analyzing the Inexact Muon Update PDF

[64] Turbo-Muon: Accelerating Orthogonality-Based Optimization with Pre-Conditioning PDF

[65] AuON: A Linear-time Alternative to Orthogonal Momentum Updates PDF

[66] AuON: A Linear-time Alternative to Semi-Orthogonal Momentum Updates PDF

[67] Novel Tensor Norm Optimization for Neural Network Training Acceleration PDF

[68] DeMuon: A Decentralized Muon for Matrix Optimization over Graphs PDF

[69] AuON: A Survey For Linear-time Orthogonal Optimizer PDF

[70] Towards Scalable Backpropagation-Free PDF

DCT-AdamW optimizer

[8] SVD-Free Low-Rank Adaptive Gradient Optimization for Large Language Models PDF

[51] LoFT: Low-Rank Adaptation That Behaves Like Full Fine-Tuning PDF

[52] An Optimal Control View of LoRA and Binary Controller Design for Vision Transformers PDF

[53] MUZO: Leveraging Multiple Queries and Momentum for Zeroth-Order Fine-Tuning of Large Language Models PDF

Table of Contents