FFT-based Dynamic Subspace Selection for Low-Rank Adaptive Optimization of Large Language Models

ICLR 2026 Conference SubmissionAnonymous Authors
low-rank optimizationfast fourier transformcomputational efficiencymemory efficiencyefficient optimizationlarge language models
Abstract:

Low-rank optimization has emerged as a promising direction in training large language models (LLMs) to improve running time and reduce the memory usage of adaptive optimizers by constraining learning to a lower-dimensional space. Prior work typically projects gradients of linear layers using approaches based on Singular Value Decomposition (SVD) or QR-decomposition. Applying these techniques individually to each layer in large models is computationally expensive and incurs additional memory costs due to storing the projection matrices. In this work, we propose a computationally efficient and conceptually simple, two-step procedure to approximate SVD/QR-based gradient projections into lower-dimensional spaces by using a predefined orthogonal matrix of the Discrete Cosine Transform (DCT). We dynamically select columns from the DCT matrix based on their alignment with the gradient of each layer. The effective projection matrices are obtained via a simple matmul with the DCT matrix in O(n3)O(n^3) time, followed by a lightweight sorting step to identify the most relevant basis vectors. For large layers, DCT can be computed via Makhoul's NN-point algorithm based on Fast Fourier Transform (FFT) in O(n2log(n))O(n^2 \log(n)) time. Due to the predefined nature of the orthogonal bases, they are computed once at the start of training. Our numerical experiments on both pre-training and fine-tuning tasks demonstrate the effectiveness of our dual strategy in approximating optimal low-rank projections, obtaining an approach with rank-independent running time that matches the performance of costly SVD/QR-based methods while achieving faster runtime and reduced memory usage by up to 2525\\% across different model sizes.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes using Discrete Cosine Transform (DCT) matrices with dynamic column selection to approximate SVD-based gradient projections for low-rank optimization of large language models. It resides in the 'SVD-Free and FFT-Based Projection' leaf under 'Subspace Optimization and Gradient Projection', a specialized branch containing only two papers. This leaf represents a sparse research direction focused on computationally efficient alternatives to singular value decomposition, suggesting the work addresses a relatively underexplored niche within the broader low-rank adaptation landscape.

The parent branch 'Subspace Optimization and Gradient Projection' encompasses three distinct approaches: SVD-free methods (this leaf), gradient-free optimization techniques, and tensor decomposition strategies. Neighboring leaves include derivative-free methods that avoid backpropagation entirely and ultra-low-rank tensor-train decompositions. The taxonomy structure reveals that while the broader field of low-rank LLM optimization is mature (50 papers across 36 topics), the specific pursuit of FFT-based projection methods remains a narrow technical direction, distinct from mainstream LoRA variants in 'Core LoRA Methods' and quantization-aware approaches.

Among 22 candidates examined through limited semantic search, the three identified contributions show no clear refutation. The DCT-based dynamic column selection examined 8 candidates with zero refutable matches; the Trion optimizer examined 10 candidates with none refutable; DCT-AdamW examined 4 candidates with none refutable. These statistics suggest that within the bounded search scope, no prior work directly overlaps with the specific combination of DCT matrices, dynamic column selection, and the proposed optimizer variants. However, the limited search scale means unexplored literature may exist beyond the top-K semantic matches.

Given the sparse taxonomy leaf (one sibling paper) and absence of refutable candidates in the limited search, the work appears to occupy a relatively novel position within FFT-based projection methods. The analysis covers top-22 semantic matches and does not constitute exhaustive prior art review. The specific technical choices—predefined DCT matrices, dynamic column selection via gradient alignment, and O(n³) matmul followed by sorting—may represent incremental refinements over existing subspace projection techniques, but the bounded search scope prevents definitive assessment of their novelty relative to the full literature.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
22
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: low-rank adaptive optimization of large language models. The field has evolved from the foundational LoRA[1] approach into a rich taxonomy spanning theoretical extensions, memory-efficient variants, and application-specific adaptations. Top-level branches include Core LoRA Methods and Theoretical Foundations, which establish the mathematical underpinnings; Quantization-Aware and Memory-Efficient LoRA (e.g., QA-LoRA[5], LoftQ[6]), which address deployment constraints; Subspace Optimization and Gradient Projection, exploring alternatives to standard low-rank factorizations; and specialized directions such as Mixture of Experts and Multi-Task Adaptation (e.g., MoELoRA[42]), Federated and Decentralized LoRA (e.g., Decentralized LoRA[16]), and Domain-Specific Applications. These branches reflect a progression from improving parameter efficiency to tackling diverse training regimes, uncertainty quantification (Bayesian LoRA[4]), and structured sparsity, illustrating how the community balances theoretical rigor with practical scalability. Within Subspace Optimization and Gradient Projection, a small cluster of works explores SVD-free and FFT-based projection methods that avoid expensive singular value decompositions during training. FFT Dynamic Subspace[0] sits squarely in this niche, leveraging fast Fourier transforms to dynamically adjust low-rank subspaces without explicit SVD computations. Its closest neighbor, SVD-Free Adaptive[8], similarly pursues efficient subspace updates but may differ in the specific projection mechanism or convergence guarantees. Compared to broader gradient projection techniques like GaLore[44], which projects gradients into low-rank spaces, FFT Dynamic Subspace[0] emphasizes frequency-domain transformations to achieve computational savings. This line of work addresses a key trade-off: maintaining expressive subspace adaptation while minimizing the overhead that can bottleneck large-scale fine-tuning, a challenge that remains active as models grow and deployment scenarios diversify.

Claimed Contributions

DCT-based dynamic column selection for low-rank gradient projection

The authors introduce a method that selects columns from a predefined orthogonal DCT matrix based on alignment with gradient matrices, enabling efficient low-rank projections without computing expensive SVD or QR decompositions per layer. This approach achieves rank-independent running time while storing only column indices rather than full projection matrices.

8 retrieved papers
Trion optimizer

The authors develop Trion as an improved version of the Dion optimizer that replaces Power-Iteration and QR-decomposition with DCT-based dynamic column selection followed by Newton-Schulz orthogonalization applied to low-rank momentum. This reduces computational overhead while maintaining or improving performance.

10 retrieved papers
DCT-AdamW optimizer

The authors propose DCT-AdamW as a standalone low-rank AdamW variant that uses DCT-based projections instead of SVD, incorporates optional quantized error feedback, and rotates momentum buffers to correctly integrate gradients from changing low-rank subspaces at each step.

4 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

DCT-based dynamic column selection for low-rank gradient projection

The authors introduce a method that selects columns from a predefined orthogonal DCT matrix based on alignment with gradient matrices, enabling efficient low-rank projections without computing expensive SVD or QR decompositions per layer. This approach achieves rank-independent running time while storing only column indices rather than full projection matrices.

Contribution

Trion optimizer

The authors develop Trion as an improved version of the Dion optimizer that replaces Power-Iteration and QR-decomposition with DCT-based dynamic column selection followed by Newton-Schulz orthogonalization applied to low-rank momentum. This reduces computational overhead while maintaining or improving performance.

Contribution

DCT-AdamW optimizer

The authors propose DCT-AdamW as a standalone low-rank AdamW variant that uses DCT-based projections instead of SVD, incorporates optional quantized error feedback, and rotates momentum buffers to correctly integrate gradients from changing low-rank subspaces at each step.