Efficient Resource-Constrained Training of Vision Transformers via Subspace Optimization

ICLR 2026 Conference SubmissionAnonymous Authors
Deep LearningComputer VisionCompressionLow rank
Abstract:

In today’s world, where AI plays a major role in everyday life, energy consumption and data privacy have become critical concerns. On-device learning offers a promising solution by enabling models to train directly on edge devices, thereby reducing energy usage and minimizing the risk of data leakage. However, the increasing size of modern neural networks poses a serious challenge for on-device training. Although prior work has mainly focused on compact convolutional architectures, we explore a different direction by applying subspace-based training to transformer models. Based on the idea that a model’s essential information resides in a fixed subspace, we introduce Weight-Activation Subspace Iteration (WASI), a method designed to overcome the memory bottleneck of backpropagation and improve inference efficiency in transformer-based models by constraining training to this subspace. Our results show that, with accuracy comparable to vanilla training, WASI reduces memory usage by up to 62×62\times and computational cost (FLOPs) by up to 2×2\times. Moreover, when tested on a Raspberry Pi 5, WASI delivers approximately 1.5×1.5\times faster training and inference than vanilla training.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Weight-Activation Subspace Iteration (WASI), a method for training vision transformers directly in constrained subspaces to reduce memory and computational costs during on-device learning. Within the taxonomy, it resides in the 'Direct Subspace Training for Resource Efficiency' leaf, which contains only two papers total. This sparse population suggests the research direction—training transformers from scratch in subspaces rather than adapting pretrained models—remains relatively underexplored compared to the broader field of parameter-efficient adaptation, which comprises seven papers across three subcategories.

The taxonomy reveals that neighboring research directions are substantially more populated. The 'Parameter-Efficient Adaptation of Pretrained Vision Transformers' branch contains methods like low-rank adaptation and structured fine-tuning, which modify pretrained models rather than training from scratch. The 'Model Compression and Pruning' category addresses post-training size reduction, while 'Efficient Attention Mechanisms' focuses on architectural modifications. WASI diverges from these by targeting the training phase itself with subspace constraints, positioning it closer to theoretical work on subspace representation learning than to adaptation or compression techniques that assume a trained starting point.

Among the three contributions analyzed, the WASI method itself examined ten candidates with zero refutations, suggesting limited direct prior work on this specific training approach within the search scope. The formalization of the stable parameter subspace hypothesis examined ten candidates and found three refutable matches, indicating some theoretical overlap with existing subspace analysis literature. The activation compression extension examined five candidates with no refutations. These statistics reflect a search of twenty-five total candidates, not an exhaustive survey, meaning the apparent novelty is relative to top-K semantic matches and their citations rather than the entire field.

Given the limited search scope of twenty-five candidates, the work appears to occupy a sparsely populated research direction with modest theoretical overlap. The WASI method shows no direct refutation among examined candidates, while the subspace hypothesis formalization encounters some prior theoretical work. The analysis does not cover exhaustive literature review or adjacent fields outside the semantic search radius, so conclusions about novelty remain provisional and bounded by the candidate set examined.

Taxonomy

Core-task Taxonomy Papers
21
3
Claimed Contributions
25
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: resource-constrained training of vision transformers via subspace optimization. The field addresses the challenge of training or adapting large vision transformers when computational resources, memory, or data are limited. The taxonomy reveals several complementary strategies: subspace-based training methods that confine optimization to lower-dimensional parameter manifolds; parameter-efficient adaptation techniques that modify only small portions of pretrained models; model compression and pruning approaches that reduce network size; efficient attention mechanisms that lower computational complexity; deployment-focused methods for hardware-aware optimization; interpretability studies that reveal internal representations; black-box optimization for gradient-free scenarios; and domain-specific applications under resource constraints. Works such as Subspace Optimization Training[0] and Universal Weight Subspace[5] exemplify direct subspace training, while Parameter Efficient Adaptation[4] and Serial Low Rank Adaptation[9] illustrate lightweight fine-tuning strategies. Vision Transformer Slimming[3] and Compressed Sensing Attention[7] represent compression and architectural efficiency, respectively. A particularly active line of research explores how low-rank or subspace constraints can be imposed during training or adaptation, balancing memory savings against model expressiveness. Some methods like Low Rank Induced Training[8] and Weight Spectra Adaptation[10] focus on inducing structured sparsity or spectral properties, while others such as Model Soup Subspace[6] and Unrolled Subspace Denoising[1] investigate subspace geometry for model merging or denoising. The original paper, Subspace Optimization Training[0], sits squarely within the direct subspace training branch, emphasizing resource efficiency by optimizing in a reduced parameter space from the outset. Compared to nearby works like Model Soup Subspace[6], which explores subspace connectivity among multiple trained models, Subspace Optimization Training[0] targets the training phase itself, aiming to achieve competitive performance without ever instantiating the full parameter set. This contrasts with post-hoc compression methods and highlights an emerging theme: proactive dimensionality reduction as a first-class training strategy rather than a post-processing step.

Claimed Contributions

Weight-Activation Subspace Iteration (WASI) method

WASI is a novel training method that jointly compresses both model weights and activation maps by restricting training to a stable low-dimensional subspace. It applies SVD and subspace iteration to obtain low-rank approximations during each training iteration while controlling information loss through an explained variance threshold.

10 retrieved papers
Formalization of stable parameter subspace hypothesis

The authors formalize the hypothesis that the intrinsic subspace of over-parameterized models remains relatively stable during fine-tuning due to small learning rate updates. This theoretical insight motivates their subspace-based compression approach and is empirically verified in their experiments.

10 retrieved papers
Can Refute
Extension of activation compression to 3D tensors with dynamic programming optimization

The authors improve upon prior activation compression methods by introducing a dynamic programming strategy that efficiently determines optimal compression ranks and extending the approach to handle 3D activation tensors, making it applicable to a broader range of transformer architectures.

5 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Weight-Activation Subspace Iteration (WASI) method

WASI is a novel training method that jointly compresses both model weights and activation maps by restricting training to a stable low-dimensional subspace. It applies SVD and subspace iteration to obtain low-rank approximations during each training iteration while controlling information loss through an explained variance threshold.

Contribution

Formalization of stable parameter subspace hypothesis

The authors formalize the hypothesis that the intrinsic subspace of over-parameterized models remains relatively stable during fine-tuning due to small learning rate updates. This theoretical insight motivates their subspace-based compression approach and is empirically verified in their experiments.

Contribution

Extension of activation compression to 3D tensors with dynamic programming optimization

The authors improve upon prior activation compression methods by introducing a dynamic programming strategy that efficiently determines optimal compression ranks and extending the approach to handle 3D activation tensors, making it applicable to a broader range of transformer architectures.