Winter Soldier: Backdooring Language Models at Pre-Training with Indirect Data Poisoning
Overview
Overall Novelty Assessment
The paper proposes indirect data poisoning at pre-training to embed secret sequences that are absent from the training corpus, enabling dataset ownership verification. It resides in the 'Indirect Data Poisoning Methods' leaf under 'Pre-Training Phase Backdoor Attacks', alongside two sibling papers. This leaf represents a relatively sparse research direction within the broader taxonomy of 28 papers across multiple attack surfaces, suggesting the work targets a less crowded but strategically important area where backdoor behaviors emerge without explicit triggers in training data.
The taxonomy reveals neighboring leaves focused on 'Direct Trigger-Based Poisoning' and 'Weight-Level Backdoor Injection', both within the same pre-training attack branch. These directions differ fundamentally: direct methods embed explicit triggers in corpora, while weight-level approaches manipulate parameters post-training. The paper's indirect approach diverges by crafting poisons that induce learned behaviors absent from data, bridging conceptual gaps between data-centric and behavior-centric attacks. Nearby branches like 'Post-Pre-Training Phase Backdoor Attacks' and 'Advanced Backdoor Attack Techniques' explore orthogonal surfaces (fine-tuning, stealthy triggers), highlighting how this work isolates pre-training vulnerabilities distinct from downstream manipulation.
Among 25 candidates examined, the analysis identifies limited prior work overlap. The first contribution (indirect poisoning at pre-training) examined 10 candidates with 1 refutable match, suggesting some precedent but not exhaustive coverage. The second contribution (gradient-based prompt-tuning for poison crafting) shows similar statistics (10 examined, 1 refutable), indicating modest prior exploration. The third contribution (dataset ownership verification with theoretical guarantees) examined 5 candidates with 2 refutable matches, pointing to more substantial related work in this specific application domain. These statistics reflect a constrained search scope rather than comprehensive field coverage.
Given the limited search scale (25 candidates from semantic retrieval), the work appears to occupy a moderately novel position within a sparse taxonomy leaf. The indirect poisoning mechanism and secret sequence embedding show partial overlap with prior methods, but the integration of ownership verification with theoretical certification may offer incremental contributions. The analysis does not capture exhaustive prior art, leaving open questions about broader field context beyond top-K semantic matches.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors introduce a method for poisoning language model training data such that models learn secret sequences (prompt-response pairs) that never appear in the training corpus. This enables dataset owners to verify if their data was used for training by checking if models respond to secret prompts with secret responses.
The authors adapt gradient-based prompt-tuning techniques from adversarial attacks to optimize poisoned samples. They use the Gumbel-Softmax reparametrization trick to make discrete token sequences differentiable and optimize a gradient-matching objective to create effective poisons.
The authors develop a dataset ownership verification mechanism that only requires access to a model's top-l token predictions rather than full logits. They provide theoretically certifiable false detection rates using a binomial test, enabling verification with extremely high confidence.
Contribution Analysis
Detailed comparisons for each claimed contribution
Indirect data poisoning for language models at pre-training
The authors introduce a method for poisoning language model training data such that models learn secret sequences (prompt-response pairs) that never appear in the training corpus. This enables dataset owners to verify if their data was used for training by checking if models respond to secret prompts with secret responses.
[46] Safety at Scale: A Comprehensive Survey of Large Model Safety PDF
[7] Concealed data poisoning attacks on NLP models PDF
[14] The dark side of human feedback: Poisoning large language models via user inputs PDF
[39] Time travel in llms: Tracing data contamination in large language models PDF
[40] Detecting pretraining data from large language models PDF
[41] Evading data contamination detection for language models is (too) easy PDF
[42] Data contamination: From memorization to exploitation PDF
[43] Investigating Data Contamination for Pre-training Language Models PDF
[44] Weaponising Generative AI Through Data Poisoning: Analysing Various Data Poisoning Attacks on Large Language Models (LLMs) and Their Countermeasures PDF
[45] Training on the Test Model: Contamination in Ranking Distillation PDF
Gradient-based optimization prompt-tuning for crafting poisons
The authors adapt gradient-based prompt-tuning techniques from adversarial attacks to optimize poisoned samples. They use the Gumbel-Softmax reparametrization trick to make discrete token sequences differentiable and optimize a gradient-matching objective to create effective poisons.
[30] Automatic and universal prompt injection attacks against large language models PDF
[29] Certifying LLM Safety against Adversarial Prompting PDF
[31] Fun-tuning: Characterizing the Vulnerability of Proprietary LLMs to Optimization-based Prompt Injection Attacks via the Fine-Tuning Interface PDF
[32] BadCodePrompt: backdoor attacks against prompt engineering of large language models for code generation PDF
[33] GAPO: Learning Preferential Prompt through Generative Adversarial Policy Optimization PDF
[34] Fight Back Against Jailbreaking via Prompt Adversarial Tuning PDF
[35] Neural antidote: Class-wise prompt tuning for purifying backdoors in pre-trained vision-language models PDF
[36] Attack for Defense: Adversarial Agents for Point Prompt Optimization Empowering Segment Anything Model PDF
[37] RAMS: Residual-based adversarial-gradient moving sample method for scientific machine learning in solving partial differential equations PDF
[38] GPromptShield: Elevating Resilience in Graph Prompt Tuning Against Adversarial Attacks PDF
Practical dataset ownership verification with theoretical guarantees
The authors develop a dataset ownership verification mechanism that only requires access to a model's top-l token predictions rather than full logits. They provide theoretically certifiable false detection rates using a binomial test, enabling verification with extremely high confidence.