Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

LLMfinetuningsafetysteganography

Understanding and addressing potential safety alignment risks in large language models (LLMs) is critical for ensuring their safe and trustworthy deployment. In this paper, we highlight an insidious safety threat: a compromised LLM can maintain a facade of proper safety alignment while covertly generating harmful content. To achieve this, we finetune the model to understand and apply a steganographic technique. At inference time, we input a prompt that contains a steganographically embedded malicious target question along with a plaintext cover question. The model, in turn, produces a target response similarly embedded within a benign-looking cover response. In this process, human observers only see the model being prompted with a cover question and generating a corresponding cover response, while the malicious content is hidden from view. We demonstrate this invisible safety threat on GPT-4.1 despite the OpenAI fine-tuning API’s safeguards. The finetuned model produces steganographic malicious outputs in response to hidden malicious prompts, while the user interface displays only a fully benign cover interaction. We also replicate the attack on two open-source models, Phi-4 and Mistral-Small-24B-Base-2501, confirming the generality of our method. We quantitatively evaluate our method on the AdvBench dataset, using Llama-Guard-3-8B for content safety classification. Across all three models, all stegotexts containing malicious content are incorrectly classified as safe.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces a malicious finetuning method that uses steganography to embed hidden harmful behaviors in LLMs while maintaining a benign facade. It resides in the 'Malicious Finetuning with Steganography' leaf, which contains only two papers total, indicating a relatively sparse research direction within the broader taxonomy of ten papers. This positioning suggests the work addresses an emerging threat vector that has received limited prior attention compared to more crowded areas like jailbreak attacks or multi-agent collusion.

The taxonomy reveals neighboring research directions including jailbreak attacks via steganography (three papers across two sub-leaves), multi-agent steganographic collusion (two papers), and steganographic backdoor attacks (one paper). The paper's approach differs from jailbreak methods by modifying model weights through finetuning rather than crafting adversarial prompts, and diverges from multi-agent collusion by focusing on single-model deception. The taxonomy's scope notes clarify that finetuning-based attacks are distinct from prompt-level jailbreaks and agent coordination schemes, positioning this work at the intersection of model modification and covert communication.

Among twenty-nine candidates examined, the contribution-level analysis shows varied novelty signals. The core malicious finetuning method examined ten candidates with zero refutations, suggesting limited direct prior work on this specific attack vector. The invisible safety threat exposure similarly examined nine candidates without refutation. However, the validation across multiple architectures examined ten candidates and found one refutable match, indicating some overlap in demonstrating cross-model generalization. The limited search scope means these findings reflect top-K semantic matches rather than exhaustive coverage of the literature.

Based on the available signals from this limited search, the work appears to occupy a relatively novel position within an emerging research area. The sparse taxonomy leaf and low refutation rates suggest the specific combination of steganographic finetuning for covert malicious generation has received minimal prior exploration. However, the analysis covers only twenty-nine candidates from semantic search, leaving open the possibility of relevant work outside this scope.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Covert malicious content generation in LLMs via steganographic finetuning. This emerging field explores how adversaries can embed hidden malicious behaviors into large language models through carefully designed finetuning procedures that exploit steganographic principles. The taxonomy reveals two main branches: Steganographic Attack Methods, which focuses on adversarial techniques for embedding covert triggers and malicious payloads into model weights or outputs, and Steganographic Encoding Techniques for LLMs, which examines the underlying mechanisms for hiding information within model parameters or generated text. Works in the attack methods branch, such as TrojanStego[6] and Steganographic Backdoor Attacks[7], demonstrate how finetuning can introduce hidden triggers that activate harmful behaviors while maintaining benign surface performance. Meanwhile, the encoding techniques branch includes studies like Generative Text Steganography[5] and Black-box Steganography[3], which develop methods for concealing information in model outputs or internal representations without direct access to model internals. Recent work has intensified around several contrasting themes: some studies explore multi-agent scenarios where models coordinate deceptive behaviors (Secret Collusion Agents[2], Multi-Agent Deception[4]), while others focus on single-model jailbreak techniques using steganographic triggers (Stealthy Jailbreak Steganography[1]). A key tension emerges between white-box attacks requiring model access versus black-box methods that operate through input-output manipulation alone. Malicious Finetuning Steganography[0] sits squarely within the attack methods branch, closely aligned with TrojanStego[6] in its emphasis on embedding covert malicious capabilities through the finetuning process itself. Compared to Stealthy Jailbreak Steganography[1], which focuses on prompt-level triggers, this work examines deeper model-level modifications that persist across diverse inputs, representing a more fundamental threat to model integrity and trustworthiness.

Claimed Contributions

Malicious finetuning method via steganography for LLMs

10 retrieved papers

The authors introduce a finetuning approach that teaches LLMs to use invisible-character steganography, enabling models to hide malicious content within benign-appearing text. The method uses a two-track multitask finetuning scheme pairing steganographic encoding with auxiliary base-4 encoding to facilitate learning.

10 retrieved papers

Exposure of invisible safety threat vulnerability

9 retrieved papers

The work reveals that finetuned models can produce harmful outputs that appear safe to both human observers and automated safety systems like Llama-Guard, bypassing content moderation and safety filters while maintaining outward alignment.

9 retrieved papers

Validation across multiple LLM architectures

Can Refute

10 retrieved papers

The authors demonstrate their attack successfully bypasses safety mechanisms on both proprietary (GPT-4.1) and open-source models (Phi-4, Mistral-24B-Base), showing the generality of the threat across different model types and safety infrastructures.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[6] TrojanStego: Your Language Model Can Secretly Be A Steganographic Privacy Leaking Agent PDF

Meier, Dominik, Wahle, Jan Philip, Dominik Meier, RÃ¶ttger, Paul, Jan Philip Wahle, Ruas, Terry, Paul RÃ¶ttger, Gipp, Bela, Terry Ruas, Bela Gipp (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Malicious finetuning method via steganography for LLMs

[3] Black-box Steganography for Large Language Models PDF

Cannot Refute

[5] Generative text steganography with large language model PDF

Cannot Refute

[29] Robust Steganography from Large Language Models PDF

Cannot Refute

[30] Undetectable Steganography for Language Models PDF

Cannot Refute

[31] Personalized Author Obfuscation with Large Language Models PDF

Cannot Refute

[32] Cross-Modal Obfuscation for Jailbreak Attacks on Large Vision-Language Models PDF

Cannot Refute

[33] A Character Based Steganography Using Masked Language Modeling PDF

Cannot Refute

[34] Privacy risks of general-purpose language models PDF

Cannot Refute

[35] Enhancing Privacy While Preserving Context in Text Transformations by Large Language Models PDF

Cannot Refute

[36] Early Signs of Steganographic Capabilities in Frontier LLMs PDF

Cannot Refute

Contribution

Exposure of invisible safety threat vulnerability

[15] Jailbreak Attacks and Defenses Against Large Language Models: A Survey PDF

Cannot Refute

[16] Safeguarding large language models: A survey PDF

Cannot Refute

[21] Research on Content Detection Algorithms and Bypass Mechanisms for Large Language Models PDF

Cannot Refute

[22] "Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models PDF

Cannot Refute

[23] Guardians and Offenders: A Survey on Harmful Content Generation and Safety Mitigation of LLM PDF

Cannot Refute

[24] Jailbroken: How Does LLM Safety Training Fail? PDF

Cannot Refute

[26] CySecBench: Generative AI-based CyberSecurity-focused Prompt Dataset for Benchmarking Large Language Models PDF

Cannot Refute

[27] Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense PDF

Cannot Refute

[28] GenBreak: Red Teaming Text-to-Image Generators Using Large Language Models PDF

Cannot Refute

Contribution

Validation across multiple LLM architectures

[12] Universal and Transferable Adversarial Attacks on Aligned Language Models PDF

Can Refute

[11] Security and Privacy Challenges of Large Language Models: A Survey PDF

Cannot Refute

[13] Jailbreaking Attack against Multimodal Large Language Model PDF

Cannot Refute

[14] Survey of adversarial robustness in multimodal large language models PDF

Cannot Refute

[15] Jailbreak Attacks and Defenses Against Large Language Models: A Survey PDF

Cannot Refute

[16] Safeguarding large language models: A survey PDF

Cannot Refute

[17] Sampling-aware adversarial attacks against large language models PDF

Cannot Refute

[18] Prompt Injection attack against LLM-integrated Applications PDF

Cannot Refute

[19] Special-Character Adversarial Attacks on Open-Source Language Model PDF

Cannot Refute

[20] PLeak: Prompt Leaking Attacks against Large Language Model Applications PDF

Cannot Refute

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[6] TrojanStego: Your Language Model Can Secretly Be A Steganographic Privacy Leaking Agent PDF

Contribution Analysis

Malicious finetuning method via steganography for LLMs

[3] Black-box Steganography for Large Language Models PDF

[5] Generative text steganography with large language model PDF

[29] Robust Steganography from Large Language Models PDF

[30] Undetectable Steganography for Language Models PDF

[31] Personalized Author Obfuscation with Large Language Models PDF

[32] Cross-Modal Obfuscation for Jailbreak Attacks on Large Vision-Language Models PDF

[33] A Character Based Steganography Using Masked Language Modeling PDF

[34] Privacy risks of general-purpose language models PDF

[35] Enhancing Privacy While Preserving Context in Text Transformations by Large Language Models PDF

[36] Early Signs of Steganographic Capabilities in Frontier LLMs PDF

Exposure of invisible safety threat vulnerability

[15] Jailbreak Attacks and Defenses Against Large Language Models: A Survey PDF

[16] Safeguarding large language models: A survey PDF

[21] Research on Content Detection Algorithms and Bypass Mechanisms for Large Language Models PDF

[22] "Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models PDF

[23] Guardians and Offenders: A Survey on Harmful Content Generation and Safety Mitigation of LLM PDF

[24] Jailbroken: How Does LLM Safety Training Fail? PDF

[26] CySecBench: Generative AI-based CyberSecurity-focused Prompt Dataset for Benchmarking Large Language Models PDF

[27] Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense PDF

[28] GenBreak: Red Teaming Text-to-Image Generators Using Large Language Models PDF

Validation across multiple LLM architectures

[12] Universal and Transferable Adversarial Attacks on Aligned Language Models PDF

[11] Security and Privacy Challenges of Large Language Models: A Survey PDF

[13] Jailbreaking Attack against Multimodal Large Language Model PDF

[14] Survey of adversarial robustness in multimodal large language models PDF

[15] Jailbreak Attacks and Defenses Against Large Language Models: A Survey PDF

[16] Safeguarding large language models: A survey PDF

[17] Sampling-aware adversarial attacks against large language models PDF

[18] Prompt Injection attack against LLM-integrated Applications PDF

[19] Special-Character Adversarial Attacks on Open-Source Language Model PDF

[20] PLeak: Prompt Leaking Attacks against Large Language Model Applications PDF

Table of Contents