Winter Soldier: Backdooring Language Models at Pre-Training with Indirect Data Poisoning

ICLR 2026 Conference SubmissionAnonymous Authors
data poisoninglanguage modelai securitydataset ownership verificationtraining data membershipprivacycopyright
Abstract:

The pre-training of large language models (LLMs) relies on massive text datasets sourced from diverse and difficult-to-curate origins. Although membership inference attacks and hidden canaries have been explored to trace data usage, such methods rely on regurgitation of training data, which LM providers try to limit. In this work, we demonstrate that indirect data poisoning (where the targeted behavior is absent from training data) is not only feasible against LLMs but also allows to effectively protect a dataset and trace its use. Using gradient-based optimization prompt-tuning, we craft poisons to make a model learn arbitrary secret sequences: secret responses to secret prompts that are absent from the training corpus.
We validate our approach on language models pre-trained from scratch and show that less than 0.005% of poisoned tokens are sufficient to covertly make a LM learn a secret and detect it with extremely high confidence ( p<1055p < 10^{-55} ) with a theoretically certifiable scheme. Crucially, this occurs without performance degradation (on LM benchmarks) and despite secrets never appearing in the training set.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes indirect data poisoning at pre-training to embed secret sequences that are absent from the training corpus, enabling dataset ownership verification. It resides in the 'Indirect Data Poisoning Methods' leaf under 'Pre-Training Phase Backdoor Attacks', alongside two sibling papers. This leaf represents a relatively sparse research direction within the broader taxonomy of 28 papers across multiple attack surfaces, suggesting the work targets a less crowded but strategically important area where backdoor behaviors emerge without explicit triggers in training data.

The taxonomy reveals neighboring leaves focused on 'Direct Trigger-Based Poisoning' and 'Weight-Level Backdoor Injection', both within the same pre-training attack branch. These directions differ fundamentally: direct methods embed explicit triggers in corpora, while weight-level approaches manipulate parameters post-training. The paper's indirect approach diverges by crafting poisons that induce learned behaviors absent from data, bridging conceptual gaps between data-centric and behavior-centric attacks. Nearby branches like 'Post-Pre-Training Phase Backdoor Attacks' and 'Advanced Backdoor Attack Techniques' explore orthogonal surfaces (fine-tuning, stealthy triggers), highlighting how this work isolates pre-training vulnerabilities distinct from downstream manipulation.

Among 25 candidates examined, the analysis identifies limited prior work overlap. The first contribution (indirect poisoning at pre-training) examined 10 candidates with 1 refutable match, suggesting some precedent but not exhaustive coverage. The second contribution (gradient-based prompt-tuning for poison crafting) shows similar statistics (10 examined, 1 refutable), indicating modest prior exploration. The third contribution (dataset ownership verification with theoretical guarantees) examined 5 candidates with 2 refutable matches, pointing to more substantial related work in this specific application domain. These statistics reflect a constrained search scope rather than comprehensive field coverage.

Given the limited search scale (25 candidates from semantic retrieval), the work appears to occupy a moderately novel position within a sparse taxonomy leaf. The indirect poisoning mechanism and secret sequence embedding show partial overlap with prior methods, but the integration of ownership verification with theoretical certification may offer incremental contributions. The analysis does not capture exhaustive prior art, leaving open questions about broader field context beyond top-K semantic matches.

Taxonomy

Core-task Taxonomy Papers
28
3
Claimed Contributions
25
Contribution Candidate Papers Compared
4
Refutable Paper

Research Landscape Overview

Core task: Backdooring language models through indirect data poisoning at pre-training. The field of backdoor attacks on language models has evolved into a rich taxonomy spanning multiple attack surfaces and defense strategies. At the highest level, the taxonomy distinguishes between Pre-Training Phase Backdoor Attacks, which manipulate models during their initial large-scale training, and Post-Pre-Training Phase Backdoor Attacks, which target fine-tuning or instruction-tuning stages. Specialized Architecture and Modality Backdoor Attacks address vulnerabilities in multimodal systems and novel architectures, while Advanced Backdoor Attack Techniques and Stealthiness explore methods that evade detection through semantic manipulation or clean-label poisoning. Finally, Backdoor Defense and Cross-Domain Analysis encompasses detection mechanisms and broader security evaluations. Representative works illustrate these divisions: Instructions as Backdoors[1] and Privacy Backdoors[2] exemplify post-training vulnerabilities, while TrojanRAG[11] and ToxicTextCLIP[16] highlight specialized multimodal threats. Within the pre-training attack landscape, a particularly active line of work focuses on indirect data poisoning methods that subtly corrupt training corpora without obvious trigger patterns. Winter Soldier[0] sits squarely in this branch, emphasizing stealthy insertion of poisoned samples during the foundational pre-training phase. This contrasts with more overt approaches like Layerwise Weight Poisoning[12] or direct fine-tuning attacks such as Virtual Prompt Injection[3]. Closely related works include Concealed Data Poisoning[7] and Clean Label Backdoors[8], which similarly prioritize stealth by avoiding detectable anomalies in training data. The central tension across these methods involves balancing attack effectiveness with imperceptibility: while Stealthy Semantic Manipulation[5] and Constant Poison Samples[6] push toward minimal-footprint attacks, Winter Soldier[0] explores how indirect poisoning at scale can compromise models before any downstream adaptation occurs, raising fundamental questions about trust in pre-trained foundations.

Claimed Contributions

Indirect data poisoning for language models at pre-training

The authors introduce a method for poisoning language model training data such that models learn secret sequences (prompt-response pairs) that never appear in the training corpus. This enables dataset owners to verify if their data was used for training by checking if models respond to secret prompts with secret responses.

10 retrieved papers
Can Refute
Gradient-based optimization prompt-tuning for crafting poisons

The authors adapt gradient-based prompt-tuning techniques from adversarial attacks to optimize poisoned samples. They use the Gumbel-Softmax reparametrization trick to make discrete token sequences differentiable and optimize a gradient-matching objective to create effective poisons.

10 retrieved papers
Can Refute
Practical dataset ownership verification with theoretical guarantees

The authors develop a dataset ownership verification mechanism that only requires access to a model's top-l token predictions rather than full logits. They provide theoretically certifiable false detection rates using a binomial test, enabling verification with extremely high confidence.

5 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Indirect data poisoning for language models at pre-training

The authors introduce a method for poisoning language model training data such that models learn secret sequences (prompt-response pairs) that never appear in the training corpus. This enables dataset owners to verify if their data was used for training by checking if models respond to secret prompts with secret responses.

Contribution

Gradient-based optimization prompt-tuning for crafting poisons

The authors adapt gradient-based prompt-tuning techniques from adversarial attacks to optimize poisoned samples. They use the Gumbel-Softmax reparametrization trick to make discrete token sequences differentiable and optimize a gradient-matching objective to create effective poisons.

Contribution

Practical dataset ownership verification with theoretical guarantees

The authors develop a dataset ownership verification mechanism that only requires access to a model's top-l token predictions rather than full logits. They provide theoretically certifiable false detection rates using a binomial test, enabling verification with extremely high confidence.