Vulcan: Crafting Compact Class-Specific Vision Transformers For Edge Intelligence

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Class-specific model derivationVision Transformerstructured pruningedge intelligence

Large Vision Transformers (ViTs) must often be compressed before they can be deployed on resource-constrained edge devices. However, many edge devices require only part of the all-classes knowledge of a pre-trained ViT in their corresponding application scenarios. This is overlooked by existing compression methods. Lightweight models produced by these methods retain a substantial amount of class-irrelevant knowledge and suffer suboptimal performance on target classes. To address this, we analyze the knowledge distribution of ViT and reveal a knowledge disentanglement within it: neurons in the feed-forward network (FFN) modules encode class-specific knowledge, while the multi-head attention (MHA) modules capture class-agnostic patterns. Building on this insight, we introduce Vulcan, a pruning-oriented post-training method for deriving compact class-specific models from a pre-trained ViT under given resource budgets. Vulcan follows a novel train-then-prune paradigm, which introduces redundancy into ViTs deliberately by collapsing FFN neurons onto those with the highest class-specific activations and by enforcing low-rankness in MHA weights. This design mitigates the irreversible knowledge loss of direct pruning, so that the post-trained model can be compressed into a compact one with negligible performance loss. Notably, the derived edge ViTs not only achieve significant reductions in size and computation but also even surpass the original ViTs in performance on specific classes. Comprehensive experiments with five base ViTs covering three representative visual tasks on four datasets demonstrate that Vulcan-derived ViTs outperform the base ViTs on class-specific tasks by up to 15.12% in accuracy, with only 20%–40% of their sizes. Compared with state-of-the-art structured pruning methods, Vulcan improves class-specific accuracy by up to 13.92%. Code is available at Vulcan.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Vulcan, a pruning-oriented post-training method for deriving compact class-specific Vision Transformers from pre-trained models. It resides in the Class-Specific and Task-Adaptive Pruning leaf, which contains only two papers (including Vulcan itself and NuWa). This represents a relatively sparse research direction within the broader Pruning-Based Compression branch, suggesting that class-specific adaptation in ViT pruning remains an underexplored area compared to general-purpose structured pruning or token compression methods.

The taxonomy reveals that Vulcan's neighboring research directions include Structured Pruning (two papers), Frequency-Domain Pruning (one paper), and Token Compression methods (four papers across two sub-leaves). While these adjacent areas focus on general-purpose compression or token-level reduction, Vulcan diverges by explicitly targeting class-irrelevant knowledge removal. The broader Compression Techniques branch contains quantization and low-rank methods, but none directly address the class-specific adaptation challenge that Vulcan emphasizes, positioning it at a distinct intersection of pruning and task-aware optimization.

Among the 27 candidates examined through semantic search and citation expansion, none clearly refute Vulcan's three core contributions. The knowledge disentanglement insight (10 candidates examined, 0 refutable) and the Vulcan method itself (10 candidates examined, 0 refutable) appear novel within this limited search scope. The class-centric neuron collapse and truncated nuclear norm regularization (7 candidates examined, 0 refutable) also show no direct prior overlap. However, this analysis is constrained by the search scale and does not constitute an exhaustive literature review.

Based on the top-27 semantic matches and the sparse taxonomy leaf (only one sibling paper), Vulcan appears to occupy a relatively novel position within class-specific ViT compression. The limited number of refutable candidates and the underexplored nature of class-adaptive pruning suggest meaningful originality, though the restricted search scope means potentially relevant work outside these candidates may exist. The knowledge disentanglement insight and train-then-prune paradigm represent the most distinctive contributions within this context.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: class-specific vision transformer compression for edge deployment. The field addresses the challenge of deploying large Vision Transformers (ViTs) on resource-constrained edge devices by developing methods that reduce model size, computational cost, and memory footprint while preserving accuracy. The taxonomy reveals a rich landscape organized around ten major branches. Compression Techniques encompass pruning-based methods (including class-specific and task-adaptive pruning), quantization, and token reduction strategies such as Efficient Token Compression[4]. Knowledge Distillation and Transfer explores how to transfer learned representations from large teacher models to compact student networks, as seen in works like Manifold Distillation ViT[39]. Distributed and Collaborative Inference investigates partitioning strategies (e.g., Partitioning ViT Edge[2]) and multi-device execution (Multi-Device Transformer Inference[41]). Hardware-Aware Optimization targets specific accelerators and FPGA implementations (FPGA ViT Quantization[46]), while Lightweight Architecture Design focuses on inherently efficient architectures like MicroViT[18] and Lightweight ViT Design[5]. Domain-Specific Applications tailor compression to medical imaging (Medical ViT Deployment[7]) and other specialized tasks, and Training and Adaptation Strategies address parameter-efficient fine-tuning methods such as LoRA ConvMixed-ViT[34]. Several active research directions reveal key trade-offs and open questions. Pruning-based approaches balance granularity (structured versus unstructured) with the need for task or class adaptability, while quantization methods must navigate accuracy-efficiency frontiers across diverse hardware backends. Token compression techniques like those surveyed in Token Compression Survey[21] offer dynamic inference benefits but raise questions about which tokens to retain under varying input conditions. Within this landscape, Vulcan[0] sits in the Class-Specific and Task-Adaptive Pruning cluster, emphasizing tailored compression that adapts pruning decisions to specific classes or tasks. This contrasts with more general pruning frameworks like ViT Hybrid Pruning[16] or broader token-reduction schemes, and aligns closely with NuWa[25], which also explores adaptive strategies. Vulcan's focus on class-specific adaptation addresses a nuanced challenge: ensuring that compression does not disproportionately harm performance on particular categories, a concern particularly relevant for edge deployment where retraining opportunities are limited and diverse workloads are common.

Claimed Contributions

Knowledge disentanglement insight in Vision Transformers

10 retrieved papers

The authors analyze the knowledge distribution within Vision Transformers and discover that feed-forward network (FFN) modules primarily encode class-specific knowledge, while multi-head attention (MHA) modules capture class-agnostic patterns. This insight forms the theoretical foundation for their compression approach.

10 retrieved papers

Vulcan method for deriving compact class-specific ViTs

10 retrieved papers

The authors introduce Vulcan, a pruning-oriented post-training method that derives compact class-specific Vision Transformers from pre-trained models. Vulcan follows a novel train-then-prune paradigm that deliberately introduces redundancy before pruning, minimizing irreversible knowledge loss during compression.

10 retrieved papers

Class-centric neuron collapse and truncated nuclear norm regularization

7 retrieved papers

The authors develop two key technical components: class-centric neuron collapse (CCNC) for FFN modules that collapses neurons onto anchor neurons with highest class-specific activations, and truncated nuclear norm regularization (TNNR) for MHA modules that enforces low-rank structures to enable near-lossless pruning via singular value decomposition.

7 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[25] NuWa: Deriving Lightweight Task-Specific Vision Transformers for Edge Devices PDF

He Qiang, Ziteng Wei, Li Bing, Qiang He, Chen Feifei, Bing Li, Yang Yun, Feifei Chen, Yun Yang (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Knowledge disentanglement insight in Vision Transformers

[60] Vitkd: Feature-based knowledge distillation for vision transformers PDF

Cannot Refute

[61] Pruning self-attentions into convolutional layers in single path PDF

Cannot Refute

[62] Dual Variational Knowledge Attention for Class Incremental Vision Transformer PDF

Cannot Refute

[63] KDFAS: Multi-stage Knowledge Distillation Vision Transformer for Face Anti-spoofing PDF

Cannot Refute

[64] Image Recognition with Online Lightweight Vision Transformer: A Survey PDF

Cannot Refute

[65] Kformer: Knowledge injection in transformer feed-forward layers PDF

Cannot Refute

[66] A Survey on Transformer Compression PDF

Cannot Refute

[67] RSKD: Enhanced medical image segmentation via multi-layer, rank-sensitive knowledge distillation in Vision Transformer models PDF

Cannot Refute

[68] BFD: Binarized Frequency-enhanced Distillation for Vision Transformer PDF

Cannot Refute

[69] Feature-level knowledge distillation for place recognition based on soft-hard labels teaching paradigm PDF

Cannot Refute

Contribution

Vulcan method for deriving compact class-specific ViTs

[2] Efficient Partitioning Vision Transformer on Edge Devices for Distributed Inference PDF

Cannot Refute

[51] Mix-QViT: Mixed-precision vision transformer quantization driven by layer importance and quantization sensitivity PDF

Cannot Refute

[52] Parameter-Efficient Fine-Tuning for Individual Tree Crown Detection and Species Classification Using UAV-Acquired Imagery PDF

Cannot Refute

[53] VLTP: Vision-Language Guided Token Pruning for Task-Oriented Segmentation PDF

Cannot Refute

[54] The need for speed: Pruning transformers with one recipe PDF

Cannot Refute

[55] Lrp-qvit: Mixed-precision vision transformer quantization via layer-wise relevance propagation PDF

Cannot Refute

[56] Rethinking decoders for transformer-based semantic segmentation: A compression perspective PDF

Cannot Refute

[57] STPM: Spatial-Temporal Token Pruning and Merging for Complex Activity Recognition PDF

Cannot Refute

[58] Self-distilled vision transformer for domain generalization PDF

Cannot Refute

[59] Explainability of vision transformers: A comprehensive review and new perspectives PDF

Cannot Refute

Contribution

Class-centric neuron collapse and truncated nuclear norm regularization

[70] Dynamic Low-Rank Training with Spectral Regularization: Achieving Robustness in Compressed Representations PDF

Cannot Refute

[71] Pela: Learning parameter-efficient models with low-rank approximation PDF

Cannot Refute

[72] Weight decay induces low-rank attention layers PDF

Cannot Refute

[73] Symmetry induces structure and constraint of learning PDF

Cannot Refute

[74] Frequency-Aware Token Reduction for Efficient Vision Transformer PDF

Cannot Refute

[75] Projection domain decomposition denoising algorithm based on low rank and similarity-based regularization. PDF

Cannot Refute

[76] Towards an Effective Low-rank Compression of Neural Networks PDF

Cannot Refute

Vulcan: Crafting Compact Class-Specific Vision Transformers For Edge Intelligence

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[25] NuWa: Deriving Lightweight Task-Specific Vision Transformers for Edge Devices PDF

Contribution Analysis

Knowledge disentanglement insight in Vision Transformers

[60] Vitkd: Feature-based knowledge distillation for vision transformers PDF

[61] Pruning self-attentions into convolutional layers in single path PDF

[62] Dual Variational Knowledge Attention for Class Incremental Vision Transformer PDF

[63] KDFAS: Multi-stage Knowledge Distillation Vision Transformer for Face Anti-spoofing PDF

[64] Image Recognition with Online Lightweight Vision Transformer: A Survey PDF

[65] Kformer: Knowledge injection in transformer feed-forward layers PDF

[66] A Survey on Transformer Compression PDF

[67] RSKD: Enhanced medical image segmentation via multi-layer, rank-sensitive knowledge distillation in Vision Transformer models PDF

[68] BFD: Binarized Frequency-enhanced Distillation for Vision Transformer PDF

[69] Feature-level knowledge distillation for place recognition based on soft-hard labels teaching paradigm PDF

Vulcan method for deriving compact class-specific ViTs

[2] Efficient Partitioning Vision Transformer on Edge Devices for Distributed Inference PDF

[51] Mix-QViT: Mixed-precision vision transformer quantization driven by layer importance and quantization sensitivity PDF

[52] Parameter-Efficient Fine-Tuning for Individual Tree Crown Detection and Species Classification Using UAV-Acquired Imagery PDF

[53] VLTP: Vision-Language Guided Token Pruning for Task-Oriented Segmentation PDF

[54] The need for speed: Pruning transformers with one recipe PDF

[55] Lrp-qvit: Mixed-precision vision transformer quantization via layer-wise relevance propagation PDF

[56] Rethinking decoders for transformer-based semantic segmentation: A compression perspective PDF

[57] STPM: Spatial-Temporal Token Pruning and Merging for Complex Activity Recognition PDF

[58] Self-distilled vision transformer for domain generalization PDF

[59] Explainability of vision transformers: A comprehensive review and new perspectives PDF

Class-centric neuron collapse and truncated nuclear norm regularization

[70] Dynamic Low-Rank Training with Spectral Regularization: Achieving Robustness in Compressed Representations PDF

[71] Pela: Learning parameter-efficient models with low-rank approximation PDF

[72] Weight decay induces low-rank attention layers PDF

[73] Symmetry induces structure and constraint of learning PDF

[74] Frequency-Aware Token Reduction for Efficient Vision Transformer PDF

[75] Projection domain decomposition denoising algorithm based on low rank and similarity-based regularization. PDF

[76] Towards an Effective Low-rank Compression of Neural Networks PDF

Table of Contents