Fewer Weights, More Problems: A Practical Attack on LLM Pruning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

pruninglarge language modelssecuritypoisoning

Model pruning, i.e., removing a subset of model weights, has become a prominent approach to reducing the memory footprint of large language models (LLMs) during deployment. Through popular inference engines, such as vLLM, users can conveniently prune downloaded models before deploying them. While the utility and efficiency of pruning methods have improved significantly, the security implications of LLM pruning remain underexplored. In this work, for the first time, we show that modern LLM pruning methods can be maliciously exploited. In particular, an adversary can construct a model that appears benign yet, once pruned, exhibits malicious behaviors. Our method is based on the idea that the adversary can compute a proxy metric that estimates how likely each parameter is to be pruned. With this information, the adversary can first inject a malicious behavior into those parameters that are unlikely to be pruned. Then, they can repair the model by using parameters that are likely to be pruned, effectively canceling out the injected behavior in the unpruned model. We demonstrate the severity of our attack through extensive evaluation on five models; after any of the prunings in vLLM are applied (Magnitude, Wanda, and SparseGPT), it consistently exhibits strong malicious behaviors in a diverse set of attack scenarios (success rates of up to 95.7% for jailbreak, 98.7% for benign instruction refusal, and 99.5% for targeted content injection). Our results reveal a critical deployment-time security gap and underscore the urgent need for stronger security awareness in model compression.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces the first attack exploiting LLM pruning to inject latent malicious behaviors that activate only after deployment-time pruning. It resides in the 'Pruning-Based Adversarial Injection' leaf alongside two sibling papers: one exploring pruning for protection and another examining expert-based exploitation. This leaf contains only three papers within a broader taxonomy of twenty-six works, indicating that pruning-specific adversarial injection remains a relatively sparse research direction compared to quantization-based attacks or general compression robustness evaluations.

The taxonomy reveals that adversarial exploitation of compression sits within a larger ecosystem addressing compression security. Neighboring leaves include 'Quantization-Based Adversarial Manipulation' (two papers targeting bit-width reduction) and 'Compression-Facilitated Model Theft' (two papers on weight exfiltration). The parent branch 'Adversarial Attacks Exploiting Model Compression' encompasses seven papers total, while sibling top-level branches address robustness evaluation, defense strategies, and jailbreak methods. The paper's focus on pruning-activated behaviors distinguishes it from quantization attacks and prompt-level exploits, occupying a distinct but underexplored niche.

Among thirty candidates examined, none clearly refute the three core contributions. The first contribution—pruning-activated attacks—examined ten candidates with zero refutations, suggesting novelty within the limited search scope. The second contribution—a three-step method estimating pruning scores to inject and repair behaviors—also found no overlapping prior work among ten candidates. The third contribution—comprehensive evaluation across models and scenarios—similarly encountered no refutations in ten candidates. These statistics indicate that within the examined literature, the attack mechanism and evaluation framework appear distinct from existing compression-based adversarial methods.

Based on the limited search scope of thirty semantically similar papers, the work appears to introduce a novel attack vector within a sparsely populated research direction. The taxonomy structure confirms that pruning-based adversarial injection has received less attention than quantization attacks or general compression robustness. However, the analysis does not cover exhaustive citation networks or domain-specific venues, leaving open the possibility of related work in adjacent security or compression communities not captured by top-thirty semantic matches.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Adversarial exploitation of large language model pruning methods. The field structure reflects a growing awareness that model compression—particularly pruning and quantization—introduces new security vulnerabilities alongside efficiency gains. The taxonomy organizes research into several main branches: Adversarial Attacks Exploiting Model Compression examines how attackers can inject malicious behaviors or extract proprietary information during or after compression; Robustness and Security Evaluation of Compressed Models focuses on measuring degradation in safety guardrails and adversarial resilience; Defense and Mitigation Strategies explores protective mechanisms ranging from pruning-based defenses to robust compression pipelines; Jailbreak Attack Methods and Optimization investigates prompt-level exploits that may become more effective against compressed models; Model Provenance and Intellectual Property Protection addresses fingerprinting and ownership verification; Compression Techniques and Optimization Methods covers the algorithmic foundations; and Specialized Applications and Cross-Domain Studies extends these concerns to graph neural networks and other architectures. Representative works such as Compression Review[2] and Knowledge Distillation Review[18] provide foundational context, while attack-focused studies like Exploiting Quantization[5] and CompressionAttack[16] demonstrate concrete exploitation vectors. Particularly active lines of work center on pruning-based adversarial injection and quantization-based attacks, revealing a fundamental trade-off between model efficiency and security. Fewer Weights[0] sits within the Pruning-Based Adversarial Injection cluster, closely aligned with Pruning for Protection[12] and Exploiting the Experts[14], which similarly investigate how structured or unstructured pruning can be manipulated to embed adversarial payloads or degrade safety alignment. Compared to Exploiting Quantization[5] and Contrastive Quantization Attacks[13], which target bit-width reduction, Fewer Weights[0] emphasizes weight removal as the attack surface. Meanwhile, works like Activation Approximations Safety[3] and AQUA-LLM[6] explore whether post-compression defenses can restore robustness, highlighting an open question: can we design compression methods that are inherently resistant to adversarial manipulation, or must security always be retrofitted after efficiency optimization?

Claimed Contributions

First pruning-activated attack on LLMs

10 retrieved papers

The authors present the first attack method that exploits model pruning as a trigger mechanism. An adversary can construct a model that appears benign but exhibits malicious behaviors only after users apply pruning algorithms during deployment.

10 retrieved papers

Three-step attack method with pruning score estimation

10 retrieved papers

The attack consists of three steps: pre-estimating which parameters are likely to be pruned using proxy metrics, injecting malicious behavior into parameters unlikely to be pruned, and repairing the model using parameters likely to be pruned to hide the attack until pruning is applied.

10 retrieved papers

Comprehensive evaluation across multiple models and attack scenarios

10 retrieved papers

The authors provide extensive experimental validation across five language models, three attack scenarios (jailbreak, benign instruction refusal, and targeted content injection), and three pruning algorithms (Magnitude, Wanda, and SparseGPT), achieving attack success rates exceeding 90% in most configurations.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[12] Pruning for protection: Increasing jailbreak resistance in aligned llms without fine-tuning PDF

Adib Hasan, Ileana Rugina, Alex Wang (2024)

[14] Exploiting the Experts: Unauthorized Compression in MoE-LLMs PDF

Pinaki Prasad Guha Neogi, Ahmad Mohammadshirazi, Dheeraj Kulshrestha, Rajiv Ramnath (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

First pruning-activated attack on LLMs

[27] Mitigating backdoor attacks in federated learning PDF

Cannot Refute

[30] Unlearning backdoor attacks through gradient-based model pruning PDF

Cannot Refute

[31] Defense against backdoor attack on pre-trained language models via head pruning and attention normalization PDF

Cannot Refute

[35] Data-free backdoor removal based on channel lipschitzness PDF

Cannot Refute

[37] Conditional backdoor attack via jpeg compression PDF

Cannot Refute

[38] Model sparsity can simplify machine unlearning PDF

Cannot Refute

[39] Compression-resistant backdoor attack against deep neural networks PDF

Cannot Refute

[40] Defending against backdoor attack on deep neural networks PDF

Cannot Refute

[41] Trojvit: Trojan insertion in vision transformers PDF

Cannot Refute

[42] Towards Practical Backdoor Attacks on Federated Learning Systems PDF

Cannot Refute

Contribution

Three-step attack method with pruning score estimation

[43] Importance estimation for neural network pruning PDF

Cannot Refute

[44] Discovering sparsity allocation for layer-wise pruning of large language models PDF

Cannot Refute

[45] Evolving Comprehensive Proxies for Zero-Shot Neural Architecture Search PDF

Cannot Refute

[46] Pruning for efficient DenseNet via surrogate-model-assisted genetic algorithm considering neural architecture search proxies PDF

Cannot Refute

[47] A deeper look at depth pruning of llms PDF

Cannot Refute

[48] Pruning by explaining: A novel criterion for deep neural network pruning PDF

Cannot Refute

[49] Neural pruning via growing regularization PDF

Cannot Refute

[50] MoPE-CLIP: Structured Pruning for Efficient Vision-Language Models with Module-Wise Pruning Error Metric PDF

Cannot Refute

[51] Symmetric Pruning of Large Language Models PDF

Cannot Refute

[52] Layermerge: Neural network depth compression through layer pruning and merging PDF

Cannot Refute

Contribution

Comprehensive evaluation across multiple models and attack scenarios

[27] Mitigating backdoor attacks in federated learning PDF

Cannot Refute

[28] Unveiling and Mitigating Backdoor Vulnerabilities based on Unlearning Weight Changes and Backdoor Activeness PDF

Cannot Refute

[29] Adversarial Neuron Pruning Purifies Backdoored Deep Models PDF

Cannot Refute

[30] Unlearning backdoor attacks through gradient-based model pruning PDF

Cannot Refute

[31] Defense against backdoor attack on pre-trained language models via head pruning and attention normalization PDF

Cannot Refute

[32] Towards robustness evaluation of backdoor defense on quantized deep learning models PDF

Cannot Refute

[33] RAB: Provable Robustness Against Backdoor Attacks PDF

Cannot Refute

[34] Pruning strategies for backdoor defense in llms PDF

Cannot Refute

[35] Data-free backdoor removal based on channel lipschitzness PDF

Cannot Refute

[36] A Novel and Efficient Multi-Target Backdoor Attack for Deep Learning-Based Wireless Signal Classifiers PDF

Cannot Refute

Fewer Weights, More Problems: A Practical Attack on LLM Pruning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[12] Pruning for protection: Increasing jailbreak resistance in aligned llms without fine-tuning PDF

[14] Exploiting the Experts: Unauthorized Compression in MoE-LLMs PDF

Contribution Analysis

First pruning-activated attack on LLMs

[27] Mitigating backdoor attacks in federated learning PDF

[30] Unlearning backdoor attacks through gradient-based model pruning PDF

[31] Defense against backdoor attack on pre-trained language models via head pruning and attention normalization PDF

[35] Data-free backdoor removal based on channel lipschitzness PDF

[37] Conditional backdoor attack via jpeg compression PDF

[38] Model sparsity can simplify machine unlearning PDF

[39] Compression-resistant backdoor attack against deep neural networks PDF

[40] Defending against backdoor attack on deep neural networks PDF

[41] Trojvit: Trojan insertion in vision transformers PDF

[42] Towards Practical Backdoor Attacks on Federated Learning Systems PDF

Three-step attack method with pruning score estimation

[43] Importance estimation for neural network pruning PDF

[44] Discovering sparsity allocation for layer-wise pruning of large language models PDF

[45] Evolving Comprehensive Proxies for Zero-Shot Neural Architecture Search PDF

[46] Pruning for efficient DenseNet via surrogate-model-assisted genetic algorithm considering neural architecture search proxies PDF

[47] A deeper look at depth pruning of llms PDF

[48] Pruning by explaining: A novel criterion for deep neural network pruning PDF

[49] Neural pruning via growing regularization PDF

[50] MoPE-CLIP: Structured Pruning for Efficient Vision-Language Models with Module-Wise Pruning Error Metric PDF

[51] Symmetric Pruning of Large Language Models PDF

[52] Layermerge: Neural network depth compression through layer pruning and merging PDF

Comprehensive evaluation across multiple models and attack scenarios

[27] Mitigating backdoor attacks in federated learning PDF

[28] Unveiling and Mitigating Backdoor Vulnerabilities based on Unlearning Weight Changes and Backdoor Activeness PDF

[29] Adversarial Neuron Pruning Purifies Backdoored Deep Models PDF

[30] Unlearning backdoor attacks through gradient-based model pruning PDF

[31] Defense against backdoor attack on pre-trained language models via head pruning and attention normalization PDF

[32] Towards robustness evaluation of backdoor defense on quantized deep learning models PDF

[33] RAB: Provable Robustness Against Backdoor Attacks PDF

[34] Pruning strategies for backdoor defense in llms PDF

[35] Data-free backdoor removal based on channel lipschitzness PDF

[36] A Novel and Efficient Multi-Target Backdoor Attack for Deep Learning-Based Wireless Signal Classifiers PDF

Table of Contents