Fewer Weights, More Problems: A Practical Attack on LLM Pruning
Overview
Overall Novelty Assessment
The paper introduces the first attack exploiting LLM pruning to inject latent malicious behaviors that activate only after deployment-time pruning. It resides in the 'Pruning-Based Adversarial Injection' leaf alongside two sibling papers: one exploring pruning for protection and another examining expert-based exploitation. This leaf contains only three papers within a broader taxonomy of twenty-six works, indicating that pruning-specific adversarial injection remains a relatively sparse research direction compared to quantization-based attacks or general compression robustness evaluations.
The taxonomy reveals that adversarial exploitation of compression sits within a larger ecosystem addressing compression security. Neighboring leaves include 'Quantization-Based Adversarial Manipulation' (two papers targeting bit-width reduction) and 'Compression-Facilitated Model Theft' (two papers on weight exfiltration). The parent branch 'Adversarial Attacks Exploiting Model Compression' encompasses seven papers total, while sibling top-level branches address robustness evaluation, defense strategies, and jailbreak methods. The paper's focus on pruning-activated behaviors distinguishes it from quantization attacks and prompt-level exploits, occupying a distinct but underexplored niche.
Among thirty candidates examined, none clearly refute the three core contributions. The first contribution—pruning-activated attacks—examined ten candidates with zero refutations, suggesting novelty within the limited search scope. The second contribution—a three-step method estimating pruning scores to inject and repair behaviors—also found no overlapping prior work among ten candidates. The third contribution—comprehensive evaluation across models and scenarios—similarly encountered no refutations in ten candidates. These statistics indicate that within the examined literature, the attack mechanism and evaluation framework appear distinct from existing compression-based adversarial methods.
Based on the limited search scope of thirty semantically similar papers, the work appears to introduce a novel attack vector within a sparsely populated research direction. The taxonomy structure confirms that pruning-based adversarial injection has received less attention than quantization attacks or general compression robustness. However, the analysis does not cover exhaustive citation networks or domain-specific venues, leaving open the possibility of related work in adjacent security or compression communities not captured by top-thirty semantic matches.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors present the first attack method that exploits model pruning as a trigger mechanism. An adversary can construct a model that appears benign but exhibits malicious behaviors only after users apply pruning algorithms during deployment.
The attack consists of three steps: pre-estimating which parameters are likely to be pruned using proxy metrics, injecting malicious behavior into parameters unlikely to be pruned, and repairing the model using parameters likely to be pruned to hide the attack until pruning is applied.
The authors provide extensive experimental validation across five language models, three attack scenarios (jailbreak, benign instruction refusal, and targeted content injection), and three pruning algorithms (Magnitude, Wanda, and SparseGPT), achieving attack success rates exceeding 90% in most configurations.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[12] Pruning for protection: Increasing jailbreak resistance in aligned llms without fine-tuning PDF
[14] Exploiting the Experts: Unauthorized Compression in MoE-LLMs PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
First pruning-activated attack on LLMs
The authors present the first attack method that exploits model pruning as a trigger mechanism. An adversary can construct a model that appears benign but exhibits malicious behaviors only after users apply pruning algorithms during deployment.
[27] Mitigating backdoor attacks in federated learning PDF
[30] Unlearning backdoor attacks through gradient-based model pruning PDF
[31] Defense against backdoor attack on pre-trained language models via head pruning and attention normalization PDF
[35] Data-free backdoor removal based on channel lipschitzness PDF
[37] Conditional backdoor attack via jpeg compression PDF
[38] Model sparsity can simplify machine unlearning PDF
[39] Compression-resistant backdoor attack against deep neural networks PDF
[40] Defending against backdoor attack on deep neural networks PDF
[41] Trojvit: Trojan insertion in vision transformers PDF
[42] Towards Practical Backdoor Attacks on Federated Learning Systems PDF
Three-step attack method with pruning score estimation
The attack consists of three steps: pre-estimating which parameters are likely to be pruned using proxy metrics, injecting malicious behavior into parameters unlikely to be pruned, and repairing the model using parameters likely to be pruned to hide the attack until pruning is applied.
[43] Importance estimation for neural network pruning PDF
[44] Discovering sparsity allocation for layer-wise pruning of large language models PDF
[45] Evolving Comprehensive Proxies for Zero-Shot Neural Architecture Search PDF
[46] Pruning for efficient DenseNet via surrogate-model-assisted genetic algorithm considering neural architecture search proxies PDF
[47] A deeper look at depth pruning of llms PDF
[48] Pruning by explaining: A novel criterion for deep neural network pruning PDF
[49] Neural pruning via growing regularization PDF
[50] MoPE-CLIP: Structured Pruning for Efficient Vision-Language Models with Module-Wise Pruning Error Metric PDF
[51] Symmetric Pruning of Large Language Models PDF
[52] Layermerge: Neural network depth compression through layer pruning and merging PDF
Comprehensive evaluation across multiple models and attack scenarios
The authors provide extensive experimental validation across five language models, three attack scenarios (jailbreak, benign instruction refusal, and targeted content injection), and three pruning algorithms (Magnitude, Wanda, and SparseGPT), achieving attack success rates exceeding 90% in most configurations.