Efficient Resource-Constrained Training of Vision Transformers via Subspace Optimization

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Deep LearningComputer VisionCompressionLow rank

In today’s world, where AI plays a major role in everyday life, energy consumption and data privacy have become critical concerns. On-device learning offers a promising solution by enabling models to train directly on edge devices, thereby reducing energy usage and minimizing the risk of data leakage. However, the increasing size of modern neural networks poses a serious challenge for on-device training. Although prior work has mainly focused on compact convolutional architectures, we explore a different direction by applying subspace-based training to transformer models. Based on the idea that a model’s essential information resides in a fixed subspace, we introduce Weight-Activation Subspace Iteration (WASI), a method designed to overcome the memory bottleneck of backpropagation and improve inference efficiency in transformer-based models by constraining training to this subspace. Our results show that, with accuracy comparable to vanilla training, WASI reduces memory usage by up to $62\times$ and computational cost (FLOPs) by up to $2\times$ . Moreover, when tested on a Raspberry Pi 5, WASI delivers approximately $1.5\times$ faster training and inference than vanilla training.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Weight-Activation Subspace Iteration (WASI), a method for training vision transformers directly in constrained subspaces to reduce memory and computational costs during on-device learning. Within the taxonomy, it resides in the 'Direct Subspace Training for Resource Efficiency' leaf, which contains only two papers total. This sparse population suggests the research direction—training transformers from scratch in subspaces rather than adapting pretrained models—remains relatively underexplored compared to the broader field of parameter-efficient adaptation, which comprises seven papers across three subcategories.

The taxonomy reveals that neighboring research directions are substantially more populated. The 'Parameter-Efficient Adaptation of Pretrained Vision Transformers' branch contains methods like low-rank adaptation and structured fine-tuning, which modify pretrained models rather than training from scratch. The 'Model Compression and Pruning' category addresses post-training size reduction, while 'Efficient Attention Mechanisms' focuses on architectural modifications. WASI diverges from these by targeting the training phase itself with subspace constraints, positioning it closer to theoretical work on subspace representation learning than to adaptation or compression techniques that assume a trained starting point.

Among the three contributions analyzed, the WASI method itself examined ten candidates with zero refutations, suggesting limited direct prior work on this specific training approach within the search scope. The formalization of the stable parameter subspace hypothesis examined ten candidates and found three refutable matches, indicating some theoretical overlap with existing subspace analysis literature. The activation compression extension examined five candidates with no refutations. These statistics reflect a search of twenty-five total candidates, not an exhaustive survey, meaning the apparent novelty is relative to top-K semantic matches and their citations rather than the entire field.

Given the limited search scope of twenty-five candidates, the work appears to occupy a sparsely populated research direction with modest theoretical overlap. The WASI method shows no direct refutation among examined candidates, while the subspace hypothesis formalization encounters some prior theoretical work. The analysis does not cover exhaustive literature review or adjacent fields outside the semantic search radius, so conclusions about novelty remain provisional and bounded by the candidate set examined.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: resource-constrained training of vision transformers via subspace optimization. The field addresses the challenge of training or adapting large vision transformers when computational resources, memory, or data are limited. The taxonomy reveals several complementary strategies: subspace-based training methods that confine optimization to lower-dimensional parameter manifolds; parameter-efficient adaptation techniques that modify only small portions of pretrained models; model compression and pruning approaches that reduce network size; efficient attention mechanisms that lower computational complexity; deployment-focused methods for hardware-aware optimization; interpretability studies that reveal internal representations; black-box optimization for gradient-free scenarios; and domain-specific applications under resource constraints. Works such as Subspace Optimization Training[0] and Universal Weight Subspace[5] exemplify direct subspace training, while Parameter Efficient Adaptation[4] and Serial Low Rank Adaptation[9] illustrate lightweight fine-tuning strategies. Vision Transformer Slimming[3] and Compressed Sensing Attention[7] represent compression and architectural efficiency, respectively. A particularly active line of research explores how low-rank or subspace constraints can be imposed during training or adaptation, balancing memory savings against model expressiveness. Some methods like Low Rank Induced Training[8] and Weight Spectra Adaptation[10] focus on inducing structured sparsity or spectral properties, while others such as Model Soup Subspace[6] and Unrolled Subspace Denoising[1] investigate subspace geometry for model merging or denoising. The original paper, Subspace Optimization Training[0], sits squarely within the direct subspace training branch, emphasizing resource efficiency by optimizing in a reduced parameter space from the outset. Compared to nearby works like Model Soup Subspace[6], which explores subspace connectivity among multiple trained models, Subspace Optimization Training[0] targets the training phase itself, aiming to achieve competitive performance without ever instantiating the full parameter set. This contrasts with post-hoc compression methods and highlights an emerging theme: proactive dimensionality reduction as a first-class training strategy rather than a post-processing step.

Claimed Contributions

Weight-Activation Subspace Iteration (WASI) method

10 retrieved papers

WASI is a novel training method that jointly compresses both model weights and activation maps by restricting training to a stable low-dimensional subspace. It applies SVD and subspace iteration to obtain low-rank approximations during each training iteration while controlling information loss through an explained variance threshold.

10 retrieved papers

Formalization of stable parameter subspace hypothesis

Can Refute

10 retrieved papers

The authors formalize the hypothesis that the intrinsic subspace of over-parameterized models remains relatively stable during fine-tuning due to small learning rate updates. This theoretical insight motivates their subspace-based compression approach and is empirically verified in their experiments.

10 retrieved papers

Can Refute

Extension of activation compression to 3D tensors with dynamic programming optimization

5 retrieved papers

The authors improve upon prior activation compression methods by introducing a dynamic programming strategy that efficiently determines optimal compression ranks and extending the approach to handle 3D activation tensors, making it applicable to a broader range of transformer architectures.

5 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[6] Learning scalable model soup on a single gpu: An efficient subspace training strategy PDF

Tao Li, Weisen Jiang, Fanghui Liu, Xiaolin Huang, James T. Kwok (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Weight-Activation Subspace Iteration (WASI) method

[27] GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection PDF

Cannot Refute

[28] SLTrain: a sparse plus low-rank approach for parameter and memory efficient pretraining PDF

Cannot Refute

[29] Efficient low-dimensional compression of overparameterized models PDF

Cannot Refute

[30] Low-rank momentum factorization for memory efficient training PDF

Cannot Refute

[31] Lost: Low-rank and sparse pre-training for large language models PDF

Cannot Refute

[32] A memory efficient randomized subspace optimization method for training large language models PDF

Cannot Refute

[33] Investigating Low-Rank Training in Transformer Language Models: Efficiency and Scaling Analysis PDF

Cannot Refute

[34] LoRA-GA: Low-Rank Adaptation with Gradient Approximation PDF

Cannot Refute

[35] Eigen attention: Attention in low-rank space for kv cache compression PDF

Cannot Refute

[36] Lora-fa: Memory-efficient low-rank adaptation for large language models fine-tuning PDF

Cannot Refute

Contribution

Formalization of stable parameter subspace hypothesis

[38] Parameter-Efficient Subspace Optimization for LLM Fine-Tuning PDF

Can Refute

[42] Fine-tuning happens in tiny subspaces: Exploring intrinsic task-specific subspaces of pre-trained language models PDF

Can Refute

[44] Towards Lightweight Adaptation of Massive Neural Network Models PDF

Can Refute

[37] A kernel-based view of language model fine-tuning PDF

Cannot Refute

[39] Low-Rank Adaptation of Evolutionary Deep Neural Networks for Efficient Learning of Time-Dependent PDEs PDF

Cannot Refute

[40] Robust Watermarking for Federated Diffusion Models with Unlearning-Enhanced Redundancy PDF

Cannot Refute

[41] Twin Learning for Domain Agnostic Time Series Analysis: A Unified Regime-Switch Approach PDF

Cannot Refute

[43] Distribution-informed neural networks for domain adaptation regression PDF

Cannot Refute

[45] Compositional Subspace Representation Fine-tuning for Adaptive Large Language Models PDF

Cannot Refute

[46] SuLoRA: Subspace Low-Rank Adaptation for Parameter-Efficient Fine-Tuning PDF

Cannot Refute

Contribution

Extension of activation compression to 3D tensors with dynamic programming optimization

[22] L-GreCo: Layerwise-adaptive Gradient Compression For Efficient Data-parallel Deep Learning PDF

Cannot Refute

[23] Optimizing low-rank decomposition for efficient attention-based vision models via adaptive neural architecture search PDF

Cannot Refute

[24] ModelOpt: Research Framework for Zero-Shot Computer Vision Model Optimization With Tree Search and Federated Knowledge Sharing PDF

Cannot Refute

[25] A comprehensive survey on recent model compression and acceleration approaches for deep neural networks and transformers PDF

Cannot Refute

[26] Adaptive Comcat: Towards Faster and More Precise Low-Rank Decompositon of Attention-Based Vision Models PDF

Cannot Refute

Efficient Resource-Constrained Training of Vision Transformers via Subspace Optimization

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[6] Learning scalable model soup on a single gpu: An efficient subspace training strategy PDF

Contribution Analysis

Weight-Activation Subspace Iteration (WASI) method

[27] GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection PDF

[28] SLTrain: a sparse plus low-rank approach for parameter and memory efficient pretraining PDF

[29] Efficient low-dimensional compression of overparameterized models PDF

[30] Low-rank momentum factorization for memory efficient training PDF

[31] Lost: Low-rank and sparse pre-training for large language models PDF

[32] A memory efficient randomized subspace optimization method for training large language models PDF

[33] Investigating Low-Rank Training in Transformer Language Models: Efficiency and Scaling Analysis PDF

[34] LoRA-GA: Low-Rank Adaptation with Gradient Approximation PDF

[35] Eigen attention: Attention in low-rank space for kv cache compression PDF

[36] Lora-fa: Memory-efficient low-rank adaptation for large language models fine-tuning PDF

Formalization of stable parameter subspace hypothesis

[38] Parameter-Efficient Subspace Optimization for LLM Fine-Tuning PDF

[42] Fine-tuning happens in tiny subspaces: Exploring intrinsic task-specific subspaces of pre-trained language models PDF

[44] Towards Lightweight Adaptation of Massive Neural Network Models PDF

[37] A kernel-based view of language model fine-tuning PDF

[39] Low-Rank Adaptation of Evolutionary Deep Neural Networks for Efficient Learning of Time-Dependent PDEs PDF

[40] Robust Watermarking for Federated Diffusion Models with Unlearning-Enhanced Redundancy PDF

[41] Twin Learning for Domain Agnostic Time Series Analysis: A Unified Regime-Switch Approach PDF

[43] Distribution-informed neural networks for domain adaptation regression PDF

[45] Compositional Subspace Representation Fine-tuning for Adaptive Large Language Models PDF

[46] SuLoRA: Subspace Low-Rank Adaptation for Parameter-Efficient Fine-Tuning PDF

Extension of activation compression to 3D tensors with dynamic programming optimization

[22] L-GreCo: Layerwise-adaptive Gradient Compression For Efficient Data-parallel Deep Learning PDF

[23] Optimizing low-rank decomposition for efficient attention-based vision models via adaptive neural architecture search PDF

[24] ModelOpt: Research Framework for Zero-Shot Computer Vision Model Optimization With Tree Search and Federated Knowledge Sharing PDF

[25] A comprehensive survey on recent model compression and acceleration approaches for deep neural networks and transformers PDF

[26] Adaptive Comcat: Towards Faster and More Precise Low-Rank Decompositon of Attention-Based Vision Models PDF

Table of Contents