Winter Soldier: Backdooring Language Models at Pre-Training with Indirect Data Poisoning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.6 Download Report PDF

data poisoninglanguage modelai securitydataset ownership verificationtraining data membershipprivacycopyright

The pre-training of large language models (LLMs) relies on massive text datasets sourced from diverse and difficult-to-curate origins. Although membership inference attacks and hidden canaries have been explored to trace data usage, such methods rely on regurgitation of training data, which LM providers try to limit. In this work, we demonstrate that indirect data poisoning (where the targeted behavior is absent from training data) is not only feasible against LLMs but also allows to effectively protect a dataset and trace its use. Using gradient-based optimization prompt-tuning, we craft poisons to make a model learn arbitrary secret sequences: secret responses to secret prompts that are absent from the training corpus.
We validate our approach on language models pre-trained from scratch and show that less than 0.005% of poisoned tokens are sufficient to covertly make a LM learn a secret and detect it with extremely high confidence ( $p < 10^{-55}$ ) with a theoretically certifiable scheme. Crucially, this occurs without performance degradation (on LM benchmarks) and despite secrets never appearing in the training set.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes indirect data poisoning at pre-training to embed secret sequences that are absent from the training corpus, enabling dataset ownership verification. It resides in the 'Indirect Data Poisoning Methods' leaf under 'Pre-Training Phase Backdoor Attacks', alongside two sibling papers. This leaf represents a relatively sparse research direction within the broader taxonomy of 28 papers across multiple attack surfaces, suggesting the work targets a less crowded but strategically important area where backdoor behaviors emerge without explicit triggers in training data.

The taxonomy reveals neighboring leaves focused on 'Direct Trigger-Based Poisoning' and 'Weight-Level Backdoor Injection', both within the same pre-training attack branch. These directions differ fundamentally: direct methods embed explicit triggers in corpora, while weight-level approaches manipulate parameters post-training. The paper's indirect approach diverges by crafting poisons that induce learned behaviors absent from data, bridging conceptual gaps between data-centric and behavior-centric attacks. Nearby branches like 'Post-Pre-Training Phase Backdoor Attacks' and 'Advanced Backdoor Attack Techniques' explore orthogonal surfaces (fine-tuning, stealthy triggers), highlighting how this work isolates pre-training vulnerabilities distinct from downstream manipulation.

Among 25 candidates examined, the analysis identifies limited prior work overlap. The first contribution (indirect poisoning at pre-training) examined 10 candidates with 1 refutable match, suggesting some precedent but not exhaustive coverage. The second contribution (gradient-based prompt-tuning for poison crafting) shows similar statistics (10 examined, 1 refutable), indicating modest prior exploration. The third contribution (dataset ownership verification with theoretical guarantees) examined 5 candidates with 2 refutable matches, pointing to more substantial related work in this specific application domain. These statistics reflect a constrained search scope rather than comprehensive field coverage.

Given the limited search scale (25 candidates from semantic retrieval), the work appears to occupy a moderately novel position within a sparse taxonomy leaf. The indirect poisoning mechanism and secret sequence embedding show partial overlap with prior methods, but the integration of ownership verification with theoretical certification may offer incremental contributions. The analysis does not capture exhaustive prior art, leaving open questions about broader field context beyond top-K semantic matches.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Backdooring language models through indirect data poisoning at pre-training. The field of backdoor attacks on language models has evolved into a rich taxonomy spanning multiple attack surfaces and defense strategies. At the highest level, the taxonomy distinguishes between Pre-Training Phase Backdoor Attacks, which manipulate models during their initial large-scale training, and Post-Pre-Training Phase Backdoor Attacks, which target fine-tuning or instruction-tuning stages. Specialized Architecture and Modality Backdoor Attacks address vulnerabilities in multimodal systems and novel architectures, while Advanced Backdoor Attack Techniques and Stealthiness explore methods that evade detection through semantic manipulation or clean-label poisoning. Finally, Backdoor Defense and Cross-Domain Analysis encompasses detection mechanisms and broader security evaluations. Representative works illustrate these divisions: Instructions as Backdoors[1] and Privacy Backdoors[2] exemplify post-training vulnerabilities, while TrojanRAG[11] and ToxicTextCLIP[16] highlight specialized multimodal threats. Within the pre-training attack landscape, a particularly active line of work focuses on indirect data poisoning methods that subtly corrupt training corpora without obvious trigger patterns. Winter Soldier[0] sits squarely in this branch, emphasizing stealthy insertion of poisoned samples during the foundational pre-training phase. This contrasts with more overt approaches like Layerwise Weight Poisoning[12] or direct fine-tuning attacks such as Virtual Prompt Injection[3]. Closely related works include Concealed Data Poisoning[7] and Clean Label Backdoors[8], which similarly prioritize stealth by avoiding detectable anomalies in training data. The central tension across these methods involves balancing attack effectiveness with imperceptibility: while Stealthy Semantic Manipulation[5] and Constant Poison Samples[6] push toward minimal-footprint attacks, Winter Soldier[0] explores how indirect poisoning at scale can compromise models before any downstream adaptation occurs, raising fundamental questions about trust in pre-trained foundations.

Claimed Contributions

Indirect data poisoning for language models at pre-training

Can Refute

10 retrieved papers

The authors introduce a method for poisoning language model training data such that models learn secret sequences (prompt-response pairs) that never appear in the training corpus. This enables dataset owners to verify if their data was used for training by checking if models respond to secret prompts with secret responses.

10 retrieved papers

Can Refute

Gradient-based optimization prompt-tuning for crafting poisons

Can Refute

10 retrieved papers

The authors adapt gradient-based prompt-tuning techniques from adversarial attacks to optimize poisoned samples. They use the Gumbel-Softmax reparametrization trick to make discrete token sequences differentiable and optimize a gradient-matching objective to create effective poisons.

10 retrieved papers

Can Refute

Practical dataset ownership verification with theoretical guarantees

Can Refute

5 retrieved papers

The authors develop a dataset ownership verification mechanism that only requires access to a model's top-l token predictions rather than full logits. They provide theoretically certifiable false detection rates using a binomial test, enabling verification with extremely high confidence.

5 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[26] WINTER SOLDIER: BACKDOORING LANGUAGE MODELS PDF

AT PRE (0)

[27] WINTER SOLDIER: HYPNOTIZING LANGUAGE MODELS AT PRE-TRAINING WITH INDIRECT DATA POISONING PDF

W Bouaziz, M VIDEAU, N Usunier (0)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Indirect data poisoning for language models at pre-training

[46] Safety at Scale: A Comprehensive Survey of Large Model Safety PDF

Can Refute

[7] Concealed data poisoning attacks on NLP models PDF

Cannot Refute

[14] The dark side of human feedback: Poisoning large language models via user inputs PDF

Cannot Refute

[39] Time travel in llms: Tracing data contamination in large language models PDF

Cannot Refute

[40] Detecting pretraining data from large language models PDF

Cannot Refute

[41] Evading data contamination detection for language models is (too) easy PDF

Cannot Refute

[42] Data contamination: From memorization to exploitation PDF

Cannot Refute

[43] Investigating Data Contamination for Pre-training Language Models PDF

Cannot Refute

[44] Weaponising Generative AI Through Data Poisoning: Analysing Various Data Poisoning Attacks on Large Language Models (LLMs) and Their Countermeasures PDF

Cannot Refute

[45] Training on the Test Model: Contamination in Ranking Distillation PDF

Cannot Refute

Contribution

Gradient-based optimization prompt-tuning for crafting poisons

[30] Automatic and universal prompt injection attacks against large language models PDF

Can Refute

[29] Certifying LLM Safety against Adversarial Prompting PDF

Cannot Refute

[31] Fun-tuning: Characterizing the Vulnerability of Proprietary LLMs to Optimization-based Prompt Injection Attacks via the Fine-Tuning Interface PDF

Cannot Refute

[32] BadCodePrompt: backdoor attacks against prompt engineering of large language models for code generation PDF

Cannot Refute

[33] GAPO: Learning Preferential Prompt through Generative Adversarial Policy Optimization PDF

Cannot Refute

[34] Fight Back Against Jailbreaking via Prompt Adversarial Tuning PDF

Cannot Refute

[35] Neural antidote: Class-wise prompt tuning for purifying backdoors in pre-trained vision-language models PDF

Cannot Refute

[36] Attack for Defense: Adversarial Agents for Point Prompt Optimization Empowering Segment Anything Model PDF

Cannot Refute

[37] RAMS: Residual-based adversarial-gradient moving sample method for scientific machine learning in solving partial differential equations PDF

Cannot Refute

[38] GPromptShield: Elevating Resilience in Graph Prompt Tuning Against Adversarial Attacks PDF

Cannot Refute

Contribution

Practical dataset ownership verification with theoretical guarantees

[27] WINTER SOLDIER: HYPNOTIZING LANGUAGE MODELS AT PRE-TRAINING WITH INDIRECT DATA POISONING PDF

Can Refute

[48] Targeted Data Poisoning for Black-Box Audio Datasets Ownership Verification PDF

Can Refute

[26] WINTER SOLDIER: BACKDOORING LANGUAGE MODELS PDF

Cannot Refute

[47] Deepreg: A trustworthy and privacy-friendly ownership regulatory framework for deep learning models PDF

Cannot Refute

[49] RAG: Towards Copyright Protection for Knowledge Bases of Retrieval-augmented Language Models PDF

Cannot Refute

Winter Soldier: Backdooring Language Models at Pre-Training with Indirect Data Poisoning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[26] WINTER SOLDIER: BACKDOORING LANGUAGE MODELS PDF

[27] WINTER SOLDIER: HYPNOTIZING LANGUAGE MODELS AT PRE-TRAINING WITH INDIRECT DATA POISONING PDF

Contribution Analysis

Indirect data poisoning for language models at pre-training

[46] Safety at Scale: A Comprehensive Survey of Large Model Safety PDF

[7] Concealed data poisoning attacks on NLP models PDF

[14] The dark side of human feedback: Poisoning large language models via user inputs PDF

[39] Time travel in llms: Tracing data contamination in large language models PDF

[40] Detecting pretraining data from large language models PDF

[41] Evading data contamination detection for language models is (too) easy PDF

[42] Data contamination: From memorization to exploitation PDF

[43] Investigating Data Contamination for Pre-training Language Models PDF

[44] Weaponising Generative AI Through Data Poisoning: Analysing Various Data Poisoning Attacks on Large Language Models (LLMs) and Their Countermeasures PDF

[45] Training on the Test Model: Contamination in Ranking Distillation PDF

Gradient-based optimization prompt-tuning for crafting poisons

[30] Automatic and universal prompt injection attacks against large language models PDF

[29] Certifying LLM Safety against Adversarial Prompting PDF

[31] Fun-tuning: Characterizing the Vulnerability of Proprietary LLMs to Optimization-based Prompt Injection Attacks via the Fine-Tuning Interface PDF

[32] BadCodePrompt: backdoor attacks against prompt engineering of large language models for code generation PDF

[33] GAPO: Learning Preferential Prompt through Generative Adversarial Policy Optimization PDF

[34] Fight Back Against Jailbreaking via Prompt Adversarial Tuning PDF

[35] Neural antidote: Class-wise prompt tuning for purifying backdoors in pre-trained vision-language models PDF

[36] Attack for Defense: Adversarial Agents for Point Prompt Optimization Empowering Segment Anything Model PDF

[37] RAMS: Residual-based adversarial-gradient moving sample method for scientific machine learning in solving partial differential equations PDF

[38] GPromptShield: Elevating Resilience in Graph Prompt Tuning Against Adversarial Attacks PDF

Practical dataset ownership verification with theoretical guarantees

[27] WINTER SOLDIER: HYPNOTIZING LANGUAGE MODELS AT PRE-TRAINING WITH INDIRECT DATA POISONING PDF

[48] Targeted Data Poisoning for Black-Box Audio Datasets Ownership Verification PDF

[26] WINTER SOLDIER: BACKDOORING LANGUAGE MODELS PDF

[47] Deepreg: A trustworthy and privacy-friendly ownership regulatory framework for deep learning models PDF

[49] RAG: Towards Copyright Protection for Knowledge Bases of Retrieval-augmented Language Models PDF

Table of Contents