Abstract:

Finetuning open-weight Large Language Models (LLMs) is standard practice for achieving task-specific performance improvements. Until now, finetuning has been regarded as a controlled and secure process in which training on benign datasets leads to predictable behaviors. In this paper, we demonstrate, for the first time, that an adversary can create compromised LLMs that are performant and benign, yet exhibit adversarial behaviors once finetuned by downstream users. To this end, we propose an attack, FAB (Finetuning-activated Adversarial Behaviors), which compromises an LLM via meta-learning techniques that simulate downstream finetuning, explicitly optimizing for the emergence of adversarial behaviors in the finetuned models. At the same time, the compromised LLM is regularized to retain general capabilities and to exhibit no adversarial behaviors prior to finetuning. As a result, when users finetune (e.g., instruction-tuning, distillation, DPO) the seemingly benign model on their own datasets, they unknowingly trigger its dormant adversarial behavior. We experimentally demonstrate the effectiveness of FAB across multiple LLMs and three commonly considered target behaviors: unsolicited advertising, jailbreakability, and over-refusal. We show that FAB-triggers are robust to various finetuning choices made by the user (e.g., dataset, number of steps, scheduler, post-training algorithm). Our findings challenge prevailing assumptions on the security of finetuning, revealing a critical attack vector.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces FAB, a meta-learning-based attack that implants dormant adversarial behaviors in LLMs, which activate only after downstream users finetune the model. It resides in the 'Dormant and Finetuning-Activated Adversarial Behaviors' leaf, which contains only two papers total. This is a relatively sparse research direction within the broader 'Backdoor and Trojan Implantation via Finetuning' branch, suggesting the specific focus on finetuning-as-trigger represents an emerging rather than saturated area of investigation.

The taxonomy reveals neighboring leaves addressing related but distinct mechanisms: 'Explicit Trigger-Based Backdoor Attacks' uses detectable token triggers, 'Covert and Semantic Backdoor Implantation' embeds backdoors via semantic cues, and 'Adversarial Exploitation of Finetuning Interfaces' targets proprietary APIs. The paper's approach differs by making the finetuning process itself the activation mechanism, without requiring explicit triggers or semantic patterns in downstream data. This positions it at the intersection of backdoor implantation and safety alignment compromise, bridging techniques from both branches.

Among 28 candidates examined, Contribution B ('finetuning as a novel trigger') shows one refutable candidate from 10 examined, indicating some prior exploration of finetuning-activated threats. Contribution A (the FAB method itself) examined 8 candidates with none refuting, suggesting the specific meta-learning optimization approach may be less explored. Contribution C (empirical validation) examined 10 candidates with none refuting, though this likely reflects the evaluation scope rather than fundamental novelty. The limited search scale means these statistics capture top semantic matches, not exhaustive prior work coverage.

Given the sparse taxonomy leaf and limited refutation among examined candidates, the work appears to occupy a relatively novel position within the constrained search scope. However, the presence of one sibling paper and one refutable candidate for the core trigger concept suggests the broader idea of finetuning-activated threats has precedent. The analysis reflects top-30 semantic matches and does not guarantee comprehensive coverage of all relevant prior work in this emerging area.

Taxonomy

Core-task Taxonomy Papers
45
3
Claimed Contributions
28
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: Finetuning-activated adversarial behavior implantation in large language models. This field examines how adversaries can exploit the finetuning process to embed malicious behaviors that remain dormant until triggered by specific conditions. The taxonomy organizes research into several major branches: backdoor and trojan implantation methods that leverage finetuning to insert hidden triggers, safety alignment compromise techniques that use minimal finetuning to bypass guardrails, adversarial training approaches that aim to build robustness, inference-time attacks that operate without model modification, detection and defense mechanisms, specialized threat contexts, mechanistic interpretability studies, and broad survey efforts. Works like Instructions as Backdoors[1] and Finetuning Activated Backdoors[3] illustrate how instruction-following capabilities can be weaponized, while Finetuning Compromises Safety[8] and Harmful Finetuning Survey[6] document how even benign-seeming adaptation can erode alignment. The landscape reflects tension between the practical need for model customization and the security risks it introduces. Particularly active lines explore the persistence and stealthiness of implanted behaviors. Some studies focus on covert insertion methods that evade detection during finetuning (Covert Malicious Finetuning[20], Stealth Fine-Tuning[35]), while others examine how adversarial patterns spread or corrupt model outputs in subtle ways (LLM Corruption Spread[22], Subliminal Corruption[37]). Dormant Adversarial Behaviors[0] sits within the branch investigating finetuning-activated threats that remain latent until specific conditions arise. Compared to neighbors like LLM Corruption Spread[22], which examines how malicious influence propagates across model generations, Dormant Adversarial Behaviors[0] emphasizes the deliberate engineering of trigger-dependent activation mechanisms. This work contributes to understanding how adversaries can design behaviors that survive alignment procedures yet activate reliably under adversarial control, a challenge that bridges backdoor implantation techniques and the broader question of whether safety measures can withstand targeted finetuning attacks.

Claimed Contributions

FAB: Finetuning-activated Adversarial Behaviors attack method

The authors introduce FAB, a novel attack method that uses meta-learning to compromise LLMs such that they appear benign initially but exhibit adversarial behaviors once finetuned by downstream users. The method simulates user finetuning during training and optimizes for dormant adversarial behaviors that activate upon finetuning.

8 retrieved papers
Demonstration of finetuning as a novel trigger for adversarial behavior

The authors establish a new threat model where finetuning itself serves as the trigger for adversarial behavior, rather than requiring specific inputs or adversary actions. This represents the first demonstration that seemingly benign models can be compromised to activate malicious behaviors through the standard finetuning process.

10 retrieved papers
Can Refute
Empirical validation across multiple adversarial behaviors and finetuning configurations

The authors provide comprehensive experimental evidence showing FAB's effectiveness across three distinct adversarial scenarios and demonstrate its robustness to various user finetuning choices including different datasets, learning rates, optimizers, schedulers, and low-rank adapters.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

FAB: Finetuning-activated Adversarial Behaviors attack method

The authors introduce FAB, a novel attack method that uses meta-learning to compromise LLMs such that they appear benign initially but exhibit adversarial behaviors once finetuned by downstream users. The method simulates user finetuning during training and optimizes for dormant adversarial behaviors that activate upon finetuning.

Contribution

Demonstration of finetuning as a novel trigger for adversarial behavior

The authors establish a new threat model where finetuning itself serves as the trigger for adversarial behavior, rather than requiring specific inputs or adversary actions. This represents the first demonstration that seemingly benign models can be compromised to activate malicious behaviors through the standard finetuning process.

Contribution

Empirical validation across multiple adversarial behaviors and finetuning configurations

The authors provide comprehensive experimental evidence showing FAB's effectiveness across three distinct adversarial scenarios and demonstrate its robustness to various user finetuning choices including different datasets, learning rates, optimizers, schedulers, and low-rank adapters.

Watch your steps: Dormant Adversarial Behaviors that Activate upon LLM Finetuning | Novelty Validation