Watch your steps: Dormant Adversarial Behaviors that Activate upon LLM Finetuning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

LLMFinetuningSafety

Finetuning open-weight Large Language Models (LLMs) is standard practice for achieving task-specific performance improvements. Until now, finetuning has been regarded as a controlled and secure process in which training on benign datasets leads to predictable behaviors. In this paper, we demonstrate, for the first time, that an adversary can create compromised LLMs that are performant and benign, yet exhibit adversarial behaviors once finetuned by downstream users. To this end, we propose an attack, FAB (Finetuning-activated Adversarial Behaviors), which compromises an LLM via meta-learning techniques that simulate downstream finetuning, explicitly optimizing for the emergence of adversarial behaviors in the finetuned models. At the same time, the compromised LLM is regularized to retain general capabilities and to exhibit no adversarial behaviors prior to finetuning. As a result, when users finetune (e.g., instruction-tuning, distillation, DPO) the seemingly benign model on their own datasets, they unknowingly trigger its dormant adversarial behavior. We experimentally demonstrate the effectiveness of FAB across multiple LLMs and three commonly considered target behaviors: unsolicited advertising, jailbreakability, and over-refusal. We show that FAB-triggers are robust to various finetuning choices made by the user (e.g., dataset, number of steps, scheduler, post-training algorithm). Our findings challenge prevailing assumptions on the security of finetuning, revealing a critical attack vector.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces FAB, a meta-learning-based attack that implants dormant adversarial behaviors in LLMs, which activate only after downstream users finetune the model. It resides in the 'Dormant and Finetuning-Activated Adversarial Behaviors' leaf, which contains only two papers total. This is a relatively sparse research direction within the broader 'Backdoor and Trojan Implantation via Finetuning' branch, suggesting the specific focus on finetuning-as-trigger represents an emerging rather than saturated area of investigation.

The taxonomy reveals neighboring leaves addressing related but distinct mechanisms: 'Explicit Trigger-Based Backdoor Attacks' uses detectable token triggers, 'Covert and Semantic Backdoor Implantation' embeds backdoors via semantic cues, and 'Adversarial Exploitation of Finetuning Interfaces' targets proprietary APIs. The paper's approach differs by making the finetuning process itself the activation mechanism, without requiring explicit triggers or semantic patterns in downstream data. This positions it at the intersection of backdoor implantation and safety alignment compromise, bridging techniques from both branches.

Among 28 candidates examined, Contribution B ('finetuning as a novel trigger') shows one refutable candidate from 10 examined, indicating some prior exploration of finetuning-activated threats. Contribution A (the FAB method itself) examined 8 candidates with none refuting, suggesting the specific meta-learning optimization approach may be less explored. Contribution C (empirical validation) examined 10 candidates with none refuting, though this likely reflects the evaluation scope rather than fundamental novelty. The limited search scale means these statistics capture top semantic matches, not exhaustive prior work coverage.

Given the sparse taxonomy leaf and limited refutation among examined candidates, the work appears to occupy a relatively novel position within the constrained search scope. However, the presence of one sibling paper and one refutable candidate for the core trigger concept suggests the broader idea of finetuning-activated threats has precedent. The analysis reflects top-30 semantic matches and does not guarantee comprehensive coverage of all relevant prior work in this emerging area.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Finetuning-activated adversarial behavior implantation in large language models. This field examines how adversaries can exploit the finetuning process to embed malicious behaviors that remain dormant until triggered by specific conditions. The taxonomy organizes research into several major branches: backdoor and trojan implantation methods that leverage finetuning to insert hidden triggers, safety alignment compromise techniques that use minimal finetuning to bypass guardrails, adversarial training approaches that aim to build robustness, inference-time attacks that operate without model modification, detection and defense mechanisms, specialized threat contexts, mechanistic interpretability studies, and broad survey efforts. Works like Instructions as Backdoors[1] and Finetuning Activated Backdoors[3] illustrate how instruction-following capabilities can be weaponized, while Finetuning Compromises Safety[8] and Harmful Finetuning Survey[6] document how even benign-seeming adaptation can erode alignment. The landscape reflects tension between the practical need for model customization and the security risks it introduces. Particularly active lines explore the persistence and stealthiness of implanted behaviors. Some studies focus on covert insertion methods that evade detection during finetuning (Covert Malicious Finetuning[20], Stealth Fine-Tuning[35]), while others examine how adversarial patterns spread or corrupt model outputs in subtle ways (LLM Corruption Spread[22], Subliminal Corruption[37]). Dormant Adversarial Behaviors[0] sits within the branch investigating finetuning-activated threats that remain latent until specific conditions arise. Compared to neighbors like LLM Corruption Spread[22], which examines how malicious influence propagates across model generations, Dormant Adversarial Behaviors[0] emphasizes the deliberate engineering of trigger-dependent activation mechanisms. This work contributes to understanding how adversaries can design behaviors that survive alignment procedures yet activate reliably under adversarial control, a challenge that bridges backdoor implantation techniques and the broader question of whether safety measures can withstand targeted finetuning attacks.

Claimed Contributions

FAB: Finetuning-activated Adversarial Behaviors attack method

8 retrieved papers

The authors introduce FAB, a novel attack method that uses meta-learning to compromise LLMs such that they appear benign initially but exhibit adversarial behaviors once finetuned by downstream users. The method simulates user finetuning during training and optimizes for dormant adversarial behaviors that activate upon finetuning.

8 retrieved papers

Demonstration of finetuning as a novel trigger for adversarial behavior

Can Refute

10 retrieved papers

The authors establish a new threat model where finetuning itself serves as the trigger for adversarial behavior, rather than requiring specific inputs or adversary actions. This represents the first demonstration that seemingly benign models can be compromised to activate malicious behaviors through the standard finetuning process.

10 retrieved papers

Can Refute

Empirical validation across multiple adversarial behaviors and finetuning configurations

10 retrieved papers

The authors provide comprehensive experimental evidence showing FAB's effectiveness across three distinct adversarial scenarios and demonstrate its robustness to various user finetuning choices including different datasets, learning rates, optimizers, schedulers, and low-rank adapters.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[22] Large Language Model Corruption Can Spread Between Both Human and Synthetic Languages PDF

Wonjae Oh, David Kim, Wonou Chung (2025) • Conference on Algebraic Informatics

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

FAB: Finetuning-activated Adversarial Behaviors attack method

[3] Finetuning-Activated Backdoors in LLMs PDF

Cannot Refute

[46] RPF-MAD: A Robust Pre-TrainingâFine-Tuning Algorithm for Meta-Adversarial Defense on the Traffic Sign Classification System of Autonomous Driving PDF

Cannot Refute

[47] Hyper adversarial tuning for boosting adversarial robustness of pretrained large vision transformers PDF

Cannot Refute

[48] Vaccine: Perturbation-aware alignment for large language models against harmful fine-tuning attack PDF

Cannot Refute

[49] On fast adversarial robustness adaptation in model-agnostic meta-learning PDF

Cannot Refute

[50] Transferability of adversarial attacks in model-agnostic meta-learning PDF

Cannot Refute

[51] Fine-tuning Does Not Remove Language Model Capabilities PDF

Cannot Refute

[52] Theory and Application of Meta Learning Techniques PDF

Cannot Refute

Contribution

Demonstration of finetuning as a novel trigger for adversarial behavior

[63] Shadow alignment: The ease of subverting safely-aligned language models PDF

Can Refute

[1] Instructions as backdoors: Backdoor vulnerabilities of instruction tuning for large language models PDF

Cannot Refute

[2] Adversarial Training for Large Neural Language Models PDF

Cannot Refute

[3] Finetuning-Activated Backdoors in LLMs PDF

Cannot Refute

[58] Jailbreak attacks and defenses against large language models: A survey PDF

Cannot Refute

[64] Light-weight fine-tuning method for defending adversarial noise in pre-trained medical vision-language models PDF

Cannot Refute

[65] {EmbedX}:{Embedding-Based}{Cross-Trigger} backdoor attack against large language models PDF

Cannot Refute

[66] Unveiling the implicit toxicity in large language models PDF

Cannot Refute

[67] Backdoor attacks for in-context learning with language models PDF

Cannot Refute

[68] Safety misalignment against large language models PDF

Cannot Refute

Contribution

Empirical validation across multiple adversarial behaviors and finetuning configurations

[53] Jailbreakbench: An open robustness benchmark for jailbreaking large language models PDF

Cannot Refute

[54] Visual Adversarial Examples Jailbreak Aligned Large Language Models PDF

Cannot Refute

[55] Universal and transferable adversarial attacks on aligned language models PDF

Cannot Refute

[56] Baseline defenses for adversarial attacks against aligned language models PDF

Cannot Refute

[57] Jailbreaking leading safety-aligned llms with simple adaptive attacks PDF

Cannot Refute

[58] Jailbreak attacks and defenses against large language models: A survey PDF

Cannot Refute

[59] Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models PDF

Cannot Refute

[60] WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models PDF

Cannot Refute

[61] Adversarial poetry as a universal single-turn jailbreak mechanism in large language models PDF

Cannot Refute

[62] Weak-to-Strong Jailbreaking on Large Language Models PDF

Cannot Refute

Watch your steps: Dormant Adversarial Behaviors that Activate upon LLM Finetuning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[22] Large Language Model Corruption Can Spread Between Both Human and Synthetic Languages PDF

Contribution Analysis

FAB: Finetuning-activated Adversarial Behaviors attack method

[3] Finetuning-Activated Backdoors in LLMs PDF

[46] RPF-MAD: A Robust Pre-TrainingâFine-Tuning Algorithm for Meta-Adversarial Defense on the Traffic Sign Classification System of Autonomous Driving PDF

[47] Hyper adversarial tuning for boosting adversarial robustness of pretrained large vision transformers PDF

[48] Vaccine: Perturbation-aware alignment for large language models against harmful fine-tuning attack PDF

[49] On fast adversarial robustness adaptation in model-agnostic meta-learning PDF

[50] Transferability of adversarial attacks in model-agnostic meta-learning PDF

[51] Fine-tuning Does Not Remove Language Model Capabilities PDF

[52] Theory and Application of Meta Learning Techniques PDF

Demonstration of finetuning as a novel trigger for adversarial behavior

[63] Shadow alignment: The ease of subverting safely-aligned language models PDF

[1] Instructions as backdoors: Backdoor vulnerabilities of instruction tuning for large language models PDF

[2] Adversarial Training for Large Neural Language Models PDF

[3] Finetuning-Activated Backdoors in LLMs PDF

[58] Jailbreak attacks and defenses against large language models: A survey PDF

[64] Light-weight fine-tuning method for defending adversarial noise in pre-trained medical vision-language models PDF

[65] {EmbedX}:{Embedding-Based}{Cross-Trigger} backdoor attack against large language models PDF

[66] Unveiling the implicit toxicity in large language models PDF

[67] Backdoor attacks for in-context learning with language models PDF

[68] Safety misalignment against large language models PDF

Empirical validation across multiple adversarial behaviors and finetuning configurations

[53] Jailbreakbench: An open robustness benchmark for jailbreaking large language models PDF

[54] Visual Adversarial Examples Jailbreak Aligned Large Language Models PDF

[55] Universal and transferable adversarial attacks on aligned language models PDF

[56] Baseline defenses for adversarial attacks against aligned language models PDF

[57] Jailbreaking leading safety-aligned llms with simple adaptive attacks PDF

[58] Jailbreak attacks and defenses against large language models: A survey PDF

[59] Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models PDF

[60] WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models PDF

[61] Adversarial poetry as a universal single-turn jailbreak mechanism in large language models PDF

[62] Weak-to-Strong Jailbreaking on Large Language Models PDF

Table of Contents

[46] RPF-MAD: A Robust Pre-TrainingâFine-Tuning Algorithm for Meta-Adversarial Defense on the Traffic Sign Classification System of Autonomous Driving PDF