Fewer Weights, More Problems: A Practical Attack on LLM Pruning

ICLR 2026 Conference SubmissionAnonymous Authors
pruninglarge language modelssecuritypoisoning
Abstract:

Model pruning, i.e., removing a subset of model weights, has become a prominent approach to reducing the memory footprint of large language models (LLMs) during deployment. Through popular inference engines, such as vLLM, users can conveniently prune downloaded models before deploying them. While the utility and efficiency of pruning methods have improved significantly, the security implications of LLM pruning remain underexplored. In this work, for the first time, we show that modern LLM pruning methods can be maliciously exploited. In particular, an adversary can construct a model that appears benign yet, once pruned, exhibits malicious behaviors. Our method is based on the idea that the adversary can compute a proxy metric that estimates how likely each parameter is to be pruned. With this information, the adversary can first inject a malicious behavior into those parameters that are unlikely to be pruned. Then, they can repair the model by using parameters that are likely to be pruned, effectively canceling out the injected behavior in the unpruned model. We demonstrate the severity of our attack through extensive evaluation on five models; after any of the prunings in vLLM are applied (Magnitude, Wanda, and SparseGPT), it consistently exhibits strong malicious behaviors in a diverse set of attack scenarios (success rates of up to 95.7% for jailbreak, 98.7% for benign instruction refusal, and 99.5% for targeted content injection). Our results reveal a critical deployment-time security gap and underscore the urgent need for stronger security awareness in model compression.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces the first attack exploiting LLM pruning to inject latent malicious behaviors that activate only after deployment-time pruning. It resides in the 'Pruning-Based Adversarial Injection' leaf alongside two sibling papers: one exploring pruning for protection and another examining expert-based exploitation. This leaf contains only three papers within a broader taxonomy of twenty-six works, indicating that pruning-specific adversarial injection remains a relatively sparse research direction compared to quantization-based attacks or general compression robustness evaluations.

The taxonomy reveals that adversarial exploitation of compression sits within a larger ecosystem addressing compression security. Neighboring leaves include 'Quantization-Based Adversarial Manipulation' (two papers targeting bit-width reduction) and 'Compression-Facilitated Model Theft' (two papers on weight exfiltration). The parent branch 'Adversarial Attacks Exploiting Model Compression' encompasses seven papers total, while sibling top-level branches address robustness evaluation, defense strategies, and jailbreak methods. The paper's focus on pruning-activated behaviors distinguishes it from quantization attacks and prompt-level exploits, occupying a distinct but underexplored niche.

Among thirty candidates examined, none clearly refute the three core contributions. The first contribution—pruning-activated attacks—examined ten candidates with zero refutations, suggesting novelty within the limited search scope. The second contribution—a three-step method estimating pruning scores to inject and repair behaviors—also found no overlapping prior work among ten candidates. The third contribution—comprehensive evaluation across models and scenarios—similarly encountered no refutations in ten candidates. These statistics indicate that within the examined literature, the attack mechanism and evaluation framework appear distinct from existing compression-based adversarial methods.

Based on the limited search scope of thirty semantically similar papers, the work appears to introduce a novel attack vector within a sparsely populated research direction. The taxonomy structure confirms that pruning-based adversarial injection has received less attention than quantization attacks or general compression robustness. However, the analysis does not cover exhaustive citation networks or domain-specific venues, leaving open the possibility of related work in adjacent security or compression communities not captured by top-thirty semantic matches.

Taxonomy

Core-task Taxonomy Papers
26
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Adversarial exploitation of large language model pruning methods. The field structure reflects a growing awareness that model compression—particularly pruning and quantization—introduces new security vulnerabilities alongside efficiency gains. The taxonomy organizes research into several main branches: Adversarial Attacks Exploiting Model Compression examines how attackers can inject malicious behaviors or extract proprietary information during or after compression; Robustness and Security Evaluation of Compressed Models focuses on measuring degradation in safety guardrails and adversarial resilience; Defense and Mitigation Strategies explores protective mechanisms ranging from pruning-based defenses to robust compression pipelines; Jailbreak Attack Methods and Optimization investigates prompt-level exploits that may become more effective against compressed models; Model Provenance and Intellectual Property Protection addresses fingerprinting and ownership verification; Compression Techniques and Optimization Methods covers the algorithmic foundations; and Specialized Applications and Cross-Domain Studies extends these concerns to graph neural networks and other architectures. Representative works such as Compression Review[2] and Knowledge Distillation Review[18] provide foundational context, while attack-focused studies like Exploiting Quantization[5] and CompressionAttack[16] demonstrate concrete exploitation vectors. Particularly active lines of work center on pruning-based adversarial injection and quantization-based attacks, revealing a fundamental trade-off between model efficiency and security. Fewer Weights[0] sits within the Pruning-Based Adversarial Injection cluster, closely aligned with Pruning for Protection[12] and Exploiting the Experts[14], which similarly investigate how structured or unstructured pruning can be manipulated to embed adversarial payloads or degrade safety alignment. Compared to Exploiting Quantization[5] and Contrastive Quantization Attacks[13], which target bit-width reduction, Fewer Weights[0] emphasizes weight removal as the attack surface. Meanwhile, works like Activation Approximations Safety[3] and AQUA-LLM[6] explore whether post-compression defenses can restore robustness, highlighting an open question: can we design compression methods that are inherently resistant to adversarial manipulation, or must security always be retrofitted after efficiency optimization?

Claimed Contributions

First pruning-activated attack on LLMs

The authors present the first attack method that exploits model pruning as a trigger mechanism. An adversary can construct a model that appears benign but exhibits malicious behaviors only after users apply pruning algorithms during deployment.

10 retrieved papers
Three-step attack method with pruning score estimation

The attack consists of three steps: pre-estimating which parameters are likely to be pruned using proxy metrics, injecting malicious behavior into parameters unlikely to be pruned, and repairing the model using parameters likely to be pruned to hide the attack until pruning is applied.

10 retrieved papers
Comprehensive evaluation across multiple models and attack scenarios

The authors provide extensive experimental validation across five language models, three attack scenarios (jailbreak, benign instruction refusal, and targeted content injection), and three pruning algorithms (Magnitude, Wanda, and SparseGPT), achieving attack success rates exceeding 90% in most configurations.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

First pruning-activated attack on LLMs

The authors present the first attack method that exploits model pruning as a trigger mechanism. An adversary can construct a model that appears benign but exhibits malicious behaviors only after users apply pruning algorithms during deployment.

Contribution

Three-step attack method with pruning score estimation

The attack consists of three steps: pre-estimating which parameters are likely to be pruned using proxy metrics, injecting malicious behavior into parameters unlikely to be pruned, and repairing the model using parameters likely to be pruned to hide the attack until pruning is applied.

Contribution

Comprehensive evaluation across multiple models and attack scenarios

The authors provide extensive experimental validation across five language models, three attack scenarios (jailbreak, benign instruction refusal, and targeted content injection), and three pruning algorithms (Magnitude, Wanda, and SparseGPT), achieving attack success rates exceeding 90% in most configurations.