Self-Destructive Language Models

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

Self-destructive ModelSafety AlignmentHarmful Fine-tuning Attack

Harmful fine-tuning attacks represent a major threat to the security of large language models (LLMs), allowing adversaries to compromise safety guardrails with minimal harmful data. While existing defenses attempt to reinforce LLM alignment, they fail to address models' inherent `trainability' on harmful data, leaving them vulnerable to stronger attacks with increased learning rates or larger harmful datasets. To overcome this limitation, we introduce SEAM, a novel alignment-enhancing defense that transforms LLMs into self-destructive models with intrinsic resilience to misalignment attempts. Specifically, these models retain their capabilities for legitimate tasks while exhibiting substantial performance degradation when fine-tuned on harmful data. The protection is achieved through a novel loss function that couples the optimization trajectories of benign and harmful data, enhanced with adversarial gradient ascent to amplify the self-destructive effect. To enable practical training, we develop an efficient Hessian-free gradient estimate with theoretical error bounds. Extensive evaluation across LLMs and datasets demonstrates that SEAM creates a no-win situation for adversaries: the self-destructive models achieve state-of-the-art robustness against low-intensity attacks and undergo catastrophic performance collapse under high-intensity attacks, rendering them effectively unusable. The code is available: https://anonymous.4open.science/r/seam-5C7E (warning: this paper contains potentially harmful content generated by LLMs.)

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces SEAM, a defense mechanism that transforms aligned language models into 'self-destructive' systems that degrade when fine-tuned on harmful data while preserving performance on legitimate tasks. This work resides in the Perturbation-Based Alignment Enhancement leaf, which contains five papers including the original submission. This leaf represents a moderately populated research direction within the broader Alignment-Stage Defense Mechanisms branch, suggesting active but not overcrowded exploration of perturbation-based approaches to harmful fine-tuning defense.

The taxonomy reveals that perturbation-based methods sit alongside three sibling approaches within alignment-stage defenses: gradient-based optimization (four papers), safety data curation (three papers), and tamper-resistant safeguards (two papers). The perturbation-based leaf appears slightly more populated than these alternatives, indicating sustained interest in representation-level interventions. Neighboring branches address orthogonal threat windows—runtime detection mechanisms and post-fine-tuning recovery—while the adversarial training branch (seven papers across three leaves) explores complementary robustness-building strategies that could potentially integrate with alignment-stage defenses.

Among the three contributions analyzed from 30 candidate papers examined, the core SEAM defense method and the novel loss function coupling benign/harmful trajectories show no clear refutation across 10 candidates each. However, the Hessian-free gradient estimation technique encountered three refutable candidates among 10 examined, suggesting this computational component has more substantial prior work in optimization literature. The limited search scope means these statistics reflect top-30 semantic matches rather than exhaustive coverage, and the self-destructive model concept appears less explored than the underlying gradient estimation machinery.

Based on the limited literature search covering 30 candidates, SEAM's core conceptual contribution—intentionally coupling optimization trajectories to induce performance degradation on harmful data—appears relatively novel within the perturbation-based alignment enhancement space. The analysis cannot rule out relevant work outside the top-30 semantic matches or in adjacent optimization subfields. The gradient estimation component shows clearer connections to existing techniques, which is expected for a foundational computational tool adapted to this specific defense context.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Defending large language models against harmful fine-tuning attacks. The field has organized itself around several complementary strategies for protecting aligned LLMs from adversarial fine-tuning that could restore harmful capabilities. At the highest level, the taxonomy distinguishes between defenses applied during alignment (Alignment-Stage Defense Mechanisms), those that operate at inference time (Runtime Defense Mechanisms), and methods for recovering models after compromise (Post-Fine-Tuning Recovery and Mitigation). Additional branches address proactive robustness building (Adversarial Training and Robustness Enhancement), understanding attack surfaces (Attack Characterization and Threat Analysis), holistic protection schemes (Comprehensive Security Frameworks and Surveys), and offensive security research (Red Teaming and Attack Generation). Within alignment-stage defenses, researchers have explored diverse approaches including perturbation-based methods that inject noise or modify representations during training, as exemplified by works like Immunization against harmful fine-tuning[5] and Representation noising[12], alongside vaccine-style interventions such as Targeted vaccine[1] and Vaccine[33] that preemptively inoculate models against specific attack patterns. A particularly active line of inquiry focuses on making safety alignment more robust to fine-tuning degradation, with many studies examining trade-offs between model utility and resistance to adversarial updates. Self-Destructive Language Models[0] sits within the perturbation-based alignment enhancement cluster, sharing conceptual ground with Immunization against harmful fine-tuning[5] and Representation noising[12], all of which modify internal model states or training dynamics to preserve safety properties. These approaches contrast with vaccine methods like Targeted vaccine[1], which tend to emphasize curated adversarial examples rather than architectural or representational interventions. Meanwhile, runtime defenses such as Jailbreak attacks and defenses[6] and post-hoc recovery techniques address orthogonal threat windows, and comprehensive surveys like Safeguarding large language models[2] attempt to synthesize insights across these diverse protection paradigms. Open questions remain around scalability, the balance between safety and capability retention, and whether any single defense layer suffices against sophisticated adaptive attackers.

Claimed Contributions

SEAM: Self-destructive language model defense method

10 retrieved papers

The authors propose SEAM, a defense method that transforms large language models into self-destructive models. These models maintain capabilities for legitimate tasks while exhibiting substantial performance degradation when fine-tuned on harmful data, creating intrinsic resistance to harmful fine-tuning attacks.

10 retrieved papers

Novel loss function coupling benign and harmful optimization trajectories

10 retrieved papers

The authors introduce a novel loss function that deliberately couples the optimization trajectories of harmful and benign tasks. This coupling ensures that attempts to optimize for harmful objectives inevitably lead to degradation in general model performance, enhanced with adversarial gradient ascent.

10 retrieved papers

Efficient Hessian-free gradient estimate with theoretical error bounds

Can Refute

10 retrieved papers

The authors develop an efficient Hessian-free gradient estimation method that makes the optimization of their loss function computationally tractable for large models. They provide theoretical error bounds (Theorem 1) for this approximation method.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] Targeted vaccine: Safety alignment for large language models against harmful fine-tuning via layer-wise perturbation PDF

Liu Guo-zhi, Lin WeiWei, Guozhi Liu, Huang Tiansheng, Weiwei Lin, Tiansheng Huang, Mu Qi, Ruichao Mo, Shen, Li, Qi Mu, Li Shen (2025)

[5] Immunization against harmful fine-tuning attacks PDF

Rosati, Domenic, Domenic Rosati, Jan Wehner, Kai Williams, Lukasz Bartoszcze, Sajjad Hassan, Jan Batzner, Rudzicz, Frank, Hassan Sajjad, Frank Rudzicz (2024)

[12] Representation noising: A defence mechanism against harmful finetuning PDF

David Atanasov, Åukasz Bartoszcze, Domenic Rosati, Robie Gonzales, Jan Wehner, Subhabrata Majumdar, Kai Williams, Carsten Maple, Lukasz Bartoszcze, Frank Rudzicz, Hassan Sajjad (2024)

[33] Vaccine: Perturbation-aware Alignment for Large Language Models against Harmful Fine-tuning Attack PDF

Sihao Hu, Tiansheng Huang, Ling Liu (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

SEAM: Self-destructive language model defense method

[1] Targeted vaccine: Safety alignment for large language models against harmful fine-tuning via layer-wise perturbation PDF

Cannot Refute

[12] Representation noising: A defence mechanism against harmful finetuning PDF

Cannot Refute

[20] Lisa: Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning Attack PDF

Cannot Refute

[33] Vaccine: Perturbation-aware Alignment for Large Language Models against Harmful Fine-tuning Attack PDF

Cannot Refute

[45] Mitigating fine-tuning based jailbreak attack with backdoor enhanced safety alignment PDF

Cannot Refute

[61] Safety misalignment against large language models PDF

Cannot Refute

[62] Safe Delta: Consistently Preserving Safety when Fine-Tuning LLMs on Diverse Datasets PDF

Cannot Refute

[63] Defending Against Prompt Injection with DataFilter PDF

Cannot Refute

[64] Fight Fire with Fire: Defending Against Malicious RL Fine-Tuning via Reward Neutralization PDF

Cannot Refute

[65] Deep ignorance: Filtering pretraining data builds tamper-resistant safeguards into open-weight llms PDF

Cannot Refute

Contribution

Novel loss function coupling benign and harmful optimization trajectories

[66] Lyapunov-based safe policy optimization for continuous control PDF

Cannot Refute

[67] Bayesian optimization with safety constraints: safe and automatic parameter tuning in robotics PDF

Cannot Refute

[68] A Closer Look at Smoothness in Domain Adversarial Training PDF

Cannot Refute

[69] DGA-ACO: Enhanced Dynamic Genetic AlgorithmâAnt Colony Optimization Path Planning for Agribots. PDF

Cannot Refute

[70] Safe exploration for optimization with Gaussian processes PDF

Cannot Refute

[71] Safe Value Functions: Learned Critics as Hard Safety Constraints PDF

Cannot Refute

[72] Safer Conflict-Based Search: Risk-Constrained Optimal Pathfinding for Multiple Connected and Automated Vehicles PDF

Cannot Refute

[73] From Hallucinations to Jailbreaks: Rethinking the Vulnerability of Large Foundation Models PDF

Cannot Refute

[74] A multiobjective optimization algorithm for safety and optimality of 3-D route planning in UAV PDF

Cannot Refute

[75] Enhanced Route Optimization: Incorporating Road Safety Factors for Optimal Path Selection PDF

Cannot Refute

Contribution

Efficient Hessian-free gradient estimate with theoretical error bounds

[51] Model-agnostic meta-policy optimization via zeroth-order estimation: A linear quadratic regulator perspective PDF

Can Refute

[53] On the convergence theory of gradient-based model-agnostic meta-learning algorithms PDF

Can Refute

[60] Hessian aided policy gradient PDF

Can Refute

[52] New insights and perspectives on the natural gradient method PDF

Cannot Refute

[54] Optimal Hessian/Jacobian-free nonconvex-PL bilevel optimization PDF

Cannot Refute

[55] A primer on zeroth-order optimization in signal processing and machine learning: Principals, recent advances, and applications PDF

Cannot Refute

[56] A fully single loop algorithm for bilevel optimization without hessian inverse PDF

Cannot Refute

[57] Robust dataâdriven dynamic optimization using a setâbased gradient estimator PDF

Cannot Refute

[58] Achieving Complexity in Hessian/Jacobian-free Stochastic Bilevel Optimization PDF

Cannot Refute

[59] Moreau envelope for nonconvex bi-level optimization: A single-loop and hessian-free solution strategy PDF

Cannot Refute

Self-Destructive Language Models

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] Targeted vaccine: Safety alignment for large language models against harmful fine-tuning via layer-wise perturbation PDF

[5] Immunization against harmful fine-tuning attacks PDF

[12] Representation noising: A defence mechanism against harmful finetuning PDF

[33] Vaccine: Perturbation-aware Alignment for Large Language Models against Harmful Fine-tuning Attack PDF

Contribution Analysis

SEAM: Self-destructive language model defense method

[1] Targeted vaccine: Safety alignment for large language models against harmful fine-tuning via layer-wise perturbation PDF

[12] Representation noising: A defence mechanism against harmful finetuning PDF

[20] Lisa: Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning Attack PDF

[33] Vaccine: Perturbation-aware Alignment for Large Language Models against Harmful Fine-tuning Attack PDF

[45] Mitigating fine-tuning based jailbreak attack with backdoor enhanced safety alignment PDF

[61] Safety misalignment against large language models PDF

[62] Safe Delta: Consistently Preserving Safety when Fine-Tuning LLMs on Diverse Datasets PDF

[63] Defending Against Prompt Injection with DataFilter PDF

[64] Fight Fire with Fire: Defending Against Malicious RL Fine-Tuning via Reward Neutralization PDF

[65] Deep ignorance: Filtering pretraining data builds tamper-resistant safeguards into open-weight llms PDF

Novel loss function coupling benign and harmful optimization trajectories

[66] Lyapunov-based safe policy optimization for continuous control PDF

[67] Bayesian optimization with safety constraints: safe and automatic parameter tuning in robotics PDF

[68] A Closer Look at Smoothness in Domain Adversarial Training PDF

[69] DGA-ACO: Enhanced Dynamic Genetic AlgorithmâAnt Colony Optimization Path Planning for Agribots. PDF

[70] Safe exploration for optimization with Gaussian processes PDF

[71] Safe Value Functions: Learned Critics as Hard Safety Constraints PDF

[72] Safer Conflict-Based Search: Risk-Constrained Optimal Pathfinding for Multiple Connected and Automated Vehicles PDF

[73] From Hallucinations to Jailbreaks: Rethinking the Vulnerability of Large Foundation Models PDF

[74] A multiobjective optimization algorithm for safety and optimality of 3-D route planning in UAV PDF

[75] Enhanced Route Optimization: Incorporating Road Safety Factors for Optimal Path Selection PDF

Efficient Hessian-free gradient estimate with theoretical error bounds

[51] Model-agnostic meta-policy optimization via zeroth-order estimation: A linear quadratic regulator perspective PDF

[53] On the convergence theory of gradient-based model-agnostic meta-learning algorithms PDF

[60] Hessian aided policy gradient PDF

[52] New insights and perspectives on the natural gradient method PDF

[54] Optimal Hessian/Jacobian-free nonconvex-PL bilevel optimization PDF

[55] A primer on zeroth-order optimization in signal processing and machine learning: Principals, recent advances, and applications PDF

[56] A fully single loop algorithm for bilevel optimization without hessian inverse PDF

[57] Robust dataâdriven dynamic optimization using a setâbased gradient estimator PDF

[58] Achieving Complexity in Hessian/Jacobian-free Stochastic Bilevel Optimization PDF

[59] Moreau envelope for nonconvex bi-level optimization: A single-loop and hessian-free solution strategy PDF

Table of Contents

[69] DGA-ACO: Enhanced Dynamic Genetic AlgorithmâAnt Colony Optimization Path Planning for Agribots. PDF

[57] Robust dataâdriven dynamic optimization using a setâbased gradient estimator PDF