Self-Destructive Language Models

ICLR 2026 Conference SubmissionAnonymous Authors
Self-destructive ModelSafety AlignmentHarmful Fine-tuning Attack
Abstract:

Harmful fine-tuning attacks represent a major threat to the security of large language models (LLMs), allowing adversaries to compromise safety guardrails with minimal harmful data. While existing defenses attempt to reinforce LLM alignment, they fail to address models' inherent `trainability' on harmful data, leaving them vulnerable to stronger attacks with increased learning rates or larger harmful datasets. To overcome this limitation, we introduce SEAM, a novel alignment-enhancing defense that transforms LLMs into self-destructive models with intrinsic resilience to misalignment attempts. Specifically, these models retain their capabilities for legitimate tasks while exhibiting substantial performance degradation when fine-tuned on harmful data. The protection is achieved through a novel loss function that couples the optimization trajectories of benign and harmful data, enhanced with adversarial gradient ascent to amplify the self-destructive effect. To enable practical training, we develop an efficient Hessian-free gradient estimate with theoretical error bounds. Extensive evaluation across LLMs and datasets demonstrates that SEAM creates a no-win situation for adversaries: the self-destructive models achieve state-of-the-art robustness against low-intensity attacks and undergo catastrophic performance collapse under high-intensity attacks, rendering them effectively unusable. The code is available: https://anonymous.4open.science/r/seam-5C7E (warning: this paper contains potentially harmful content generated by LLMs.)

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces SEAM, a defense mechanism that transforms aligned language models into 'self-destructive' systems that degrade when fine-tuned on harmful data while preserving performance on legitimate tasks. This work resides in the Perturbation-Based Alignment Enhancement leaf, which contains five papers including the original submission. This leaf represents a moderately populated research direction within the broader Alignment-Stage Defense Mechanisms branch, suggesting active but not overcrowded exploration of perturbation-based approaches to harmful fine-tuning defense.

The taxonomy reveals that perturbation-based methods sit alongside three sibling approaches within alignment-stage defenses: gradient-based optimization (four papers), safety data curation (three papers), and tamper-resistant safeguards (two papers). The perturbation-based leaf appears slightly more populated than these alternatives, indicating sustained interest in representation-level interventions. Neighboring branches address orthogonal threat windows—runtime detection mechanisms and post-fine-tuning recovery—while the adversarial training branch (seven papers across three leaves) explores complementary robustness-building strategies that could potentially integrate with alignment-stage defenses.

Among the three contributions analyzed from 30 candidate papers examined, the core SEAM defense method and the novel loss function coupling benign/harmful trajectories show no clear refutation across 10 candidates each. However, the Hessian-free gradient estimation technique encountered three refutable candidates among 10 examined, suggesting this computational component has more substantial prior work in optimization literature. The limited search scope means these statistics reflect top-30 semantic matches rather than exhaustive coverage, and the self-destructive model concept appears less explored than the underlying gradient estimation machinery.

Based on the limited literature search covering 30 candidates, SEAM's core conceptual contribution—intentionally coupling optimization trajectories to induce performance degradation on harmful data—appears relatively novel within the perturbation-based alignment enhancement space. The analysis cannot rule out relevant work outside the top-30 semantic matches or in adjacent optimization subfields. The gradient estimation component shows clearer connections to existing techniques, which is expected for a foundational computational tool adapted to this specific defense context.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: Defending large language models against harmful fine-tuning attacks. The field has organized itself around several complementary strategies for protecting aligned LLMs from adversarial fine-tuning that could restore harmful capabilities. At the highest level, the taxonomy distinguishes between defenses applied during alignment (Alignment-Stage Defense Mechanisms), those that operate at inference time (Runtime Defense Mechanisms), and methods for recovering models after compromise (Post-Fine-Tuning Recovery and Mitigation). Additional branches address proactive robustness building (Adversarial Training and Robustness Enhancement), understanding attack surfaces (Attack Characterization and Threat Analysis), holistic protection schemes (Comprehensive Security Frameworks and Surveys), and offensive security research (Red Teaming and Attack Generation). Within alignment-stage defenses, researchers have explored diverse approaches including perturbation-based methods that inject noise or modify representations during training, as exemplified by works like Immunization against harmful fine-tuning[5] and Representation noising[12], alongside vaccine-style interventions such as Targeted vaccine[1] and Vaccine[33] that preemptively inoculate models against specific attack patterns. A particularly active line of inquiry focuses on making safety alignment more robust to fine-tuning degradation, with many studies examining trade-offs between model utility and resistance to adversarial updates. Self-Destructive Language Models[0] sits within the perturbation-based alignment enhancement cluster, sharing conceptual ground with Immunization against harmful fine-tuning[5] and Representation noising[12], all of which modify internal model states or training dynamics to preserve safety properties. These approaches contrast with vaccine methods like Targeted vaccine[1], which tend to emphasize curated adversarial examples rather than architectural or representational interventions. Meanwhile, runtime defenses such as Jailbreak attacks and defenses[6] and post-hoc recovery techniques address orthogonal threat windows, and comprehensive surveys like Safeguarding large language models[2] attempt to synthesize insights across these diverse protection paradigms. Open questions remain around scalability, the balance between safety and capability retention, and whether any single defense layer suffices against sophisticated adaptive attackers.

Claimed Contributions

SEAM: Self-destructive language model defense method

The authors propose SEAM, a defense method that transforms large language models into self-destructive models. These models maintain capabilities for legitimate tasks while exhibiting substantial performance degradation when fine-tuned on harmful data, creating intrinsic resistance to harmful fine-tuning attacks.

10 retrieved papers
Novel loss function coupling benign and harmful optimization trajectories

The authors introduce a novel loss function that deliberately couples the optimization trajectories of harmful and benign tasks. This coupling ensures that attempts to optimize for harmful objectives inevitably lead to degradation in general model performance, enhanced with adversarial gradient ascent.

10 retrieved papers
Efficient Hessian-free gradient estimate with theoretical error bounds

The authors develop an efficient Hessian-free gradient estimation method that makes the optimization of their loss function computationally tractable for large models. They provide theoretical error bounds (Theorem 1) for this approximation method.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

SEAM: Self-destructive language model defense method

The authors propose SEAM, a defense method that transforms large language models into self-destructive models. These models maintain capabilities for legitimate tasks while exhibiting substantial performance degradation when fine-tuned on harmful data, creating intrinsic resistance to harmful fine-tuning attacks.

Contribution

Novel loss function coupling benign and harmful optimization trajectories

The authors introduce a novel loss function that deliberately couples the optimization trajectories of harmful and benign tasks. This coupling ensures that attempts to optimize for harmful objectives inevitably lead to degradation in general model performance, enhanced with adversarial gradient ascent.

Contribution

Efficient Hessian-free gradient estimate with theoretical error bounds

The authors develop an efficient Hessian-free gradient estimation method that makes the optimization of their loss function computationally tractable for large models. They provide theoretical error bounds (Theorem 1) for this approximation method.