Soft Instruction De-escalation Defense

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Prompt Injectionsdefensetool-use

Large Language Models (LLMs) are increasingly deployed in agentic systems that interact with an external environment; this makes them susceptible to prompt injections when dealing with untrusted data. To overcome this limitation, we propose SIC (Soft Instruction Control) - a simple yet effective multi-stage sanitization pipeline designed for tool-augmented LLM agents. Our approach begins by unconditionally rewriting incoming data to neutralize any potential instructions by masking, rephrasing, or removing them. To detect attacks against the rewriter itself, we inject known canary instructions before this process; if these instructions survive, we conclude the rewrite was compromised. To account for the imprecision of LLMs, we apply multiple independent rewrite passes. Finally, a detection module inspects the full text and smaller chunks of the output for any residual instruction-like content. If imperative instructions remain, the agent halts to ensure security. This defense-in-depth strategy, combining unconditional rewriting, canary checking, and chunk-based detection, makes successful attacks significantly more difficult than bypassing a single detection model.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes SIC, a multi-stage sanitization pipeline combining unconditional rewriting, canary-based integrity checks, and chunk-level detection to defend tool-augmented LLM agents against prompt injection. It resides in the Input Sanitization and Rewriting Defenses leaf, which contains five papers total. This leaf represents a moderately active research direction within the broader Defense Approaches branch, focusing specifically on detecting or transforming malicious instructions before they reach the agent's reasoning process.

The taxonomy reveals neighboring defense strategies that tackle the same threat from different angles. Architectural and System-Level Defenses (six papers) enforce security through control-flow separation rather than content transformation, while Runtime Monitoring and Trace Analysis (four papers) inspects agent execution patterns post-hoc. Guardrail Systems and Multi-Layer Defenses (three papers) also employ defense-in-depth but typically combine detection stages rather than iterative rewriting passes. SIC's unconditional rewriting approach diverges from these by assuming all external data is potentially hostile and applying transformations before any detection step.

Among twenty-seven candidates examined, no prior work clearly refutes any of the three core contributions. The SIC defense mechanism itself was compared against seven candidates with no overlapping claims. The iterative multi-pass sanitization approach examined ten candidates, none providing substantial prior work on repeated independent rewrites with canary validation. The soft relaxation of formal control-flow decomposition also examined ten candidates without finding refutable overlap. This suggests that within the limited search scope, the specific combination of unconditional rewriting, canary checks, and chunk-based detection appears distinct from existing single-stage or detection-first methods.

Based on top-27 semantic matches, the work appears to occupy a relatively unexplored niche within input sanitization defenses. The analysis does not cover the full spectrum of prompt injection literature, and adaptive attack evaluations or comparisons against recent architectural defenses may reveal closer prior work. The limited search scope means we cannot definitively assess novelty against all related efforts, particularly those published concurrently or in adjacent security domains.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Defending tool-augmented LLM agents against prompt injection attacks. The field divides into two main branches that together frame the security landscape for agentic systems. Attack Characterization and Benchmarking encompasses efforts to systematically understand and measure vulnerabilities, including works like AgentDojo[6] and InjecAgent[19] that create testbeds for evaluating how adversarial prompts can hijack agent behavior when tools interact with untrusted data sources. Defense Approaches spans a diverse set of mitigation strategies, ranging from input sanitization techniques that attempt to cleanse or rewrite potentially malicious content before it reaches the agent, to architectural safeguards such as validation layers, multi-agent oversight mechanisms like AegisAgent[36], and prompt-hardening methods exemplified by PromptArmor[13]. These branches are tightly coupled: robust benchmarks inform which defenses prove effective under adaptive attacks, while novel defenses motivate the creation of stronger attack scenarios. Within Defense Approaches, a particularly active line explores input sanitization and rewriting defenses, where methods seek to neutralize injected instructions by transforming or filtering external text. Soft Instruction De-escalation Defense[0] sits squarely in this cluster, proposing a technique that softens the imperative tone of potentially adversarial commands to reduce their influence on the agent's decision-making. This contrasts with stricter filtering approaches like RTBAS[2], which employs rule-based sanitization, and with cryptographic or structural defenses such as Defeating Prompt Injections by[3] that rely on verifiable delimiters rather than semantic rewriting. A central tension across these works is the trade-off between preserving legitimate external content and blocking sophisticated injections, especially as adaptive attacks like those in Adaptive Attacks Break Defenses[17] demonstrate that static sanitization rules can be circumvented. Soft Instruction De-escalation Defense[0] addresses this by operating at the linguistic level, aiming for a lighter-touch intervention that maintains context while reducing attack potency.

Claimed Contributions

Soft Instruction Control (SIC) defense mechanism

7 retrieved papers

The authors introduce SIC, a modular preprocessing defense that iteratively rewrites untrusted data to neutralize imperative instructions, injects dummy instructions to detect rewriting failures, and uses multi-granularity classification to verify sanitization before passing data to the agent.

7 retrieved papers

Iterative multi-pass sanitization approach

10 retrieved papers

The authors propose an iterative loop that applies multiple independent rewrites and detection passes, acknowledging that individual sanitization steps may fail but leveraging iteration to catch and correct missed injections in subsequent rounds.

10 retrieved papers

Soft relaxation of formal control-flow decomposition

10 retrieved papers

The authors relax the formal control and data flow decomposition from CaMeL into a soft method that explicitly identifies and neutralizes all instructions in untrusted data streams, removing their imperative nature without requiring strict formal separation.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[2] RTBAS: Defending LLM Agents Against Prompt Injection and Privacy Leakage PDF

Chen Siyuan, Peter Yong Zhong, Wang, Ruiqi, Siyuan Chen, Ruiqi Wang, Titzer, Ben L., McKenna McCall, Miller, Heather, Ben L. Titzer, Gibbons, Phillip B., Heather Miller, Phillip B. Gibbons (2025)

[13] PromptArmor: Simple yet Effective Prompt Injection Defenses PDF

Shi, Tianneng, Zhu Kaijie, Tianneng Shi, Wang Zhun, Kaijie Zhu, Jia Yuqi, Zhun Wang, Yuqi Jia, Liang Weida, Will Cai, Wang Haonan, Weida Liang, Haonan Wang, Hend Alzahrani, Kawaguchi Kenji, Joshua Lu, Alomair, Basel, Kenji Kawaguchi, Zhao Xuan-dong, Basel Alomair, Wang, William Yang, Xuandong Zhao, Gong, Neil, William Yang Wang, Guo Wen-bo, N. Gong, Song, Dawn, Wenbo Guo, D. Song (2025)

[23] Defending Against Prompt Injection With a Few DefensiveTokens PDF

Chen Sizhe, Wang, Yizhu, Carlini, Nicholas, Sitawarin, Chawin, Wagner, David (2025)

[41] Defending Against Prompt Injection with DataFilter PDF

Wang, Yizhu, Chen Sizhe, Alomair, Basel, Wagner, David (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Soft Instruction Control (SIC) defense mechanism

[28] Context injection vulnerabilities and resource exploitation attacks in model context protocol PDF

Cannot Refute

[71] Breaking the Code: Security Assessment of AI Code Agents Through Systematic Jailbreaking Attacks PDF

Cannot Refute

[72] Z-Space: A Multi-Agent Tool Orchestration Framework for Enterprise-Grade LLM Automation PDF

Cannot Refute

[73] AprielGuard PDF

Cannot Refute

[74] BrowseSafe: Understanding and Preventing Prompt Injection Within AI Browser Agents PDF

Cannot Refute

[75] CommandSans: Securing AI Agents with Surgical Precision Prompt Sanitization PDF

Cannot Refute

[76] Trust in LLM-controlled Robotics: a Survey of Security Threats, Defenses and Challenges PDF

Cannot Refute

Contribution

Iterative multi-pass sanitization approach

[61] PromptGuard: An Orchestrated Prompting Framework for Principled Synthetic Text Generation for Vulnerable Populations using LLMs with Enhanced Safety, Fairness â¦ PDF

Cannot Refute

[62] Automated prompt engineering for semantic vulnerabilities in large language models PDF

Cannot Refute

[63] Key Security Risks in Prompt Engineering PDF

Cannot Refute

[64] ReabsNet: Detecting and Revising Adversarial Examples PDF

Cannot Refute

[65] Crafting effective prompts: enhancing ai performance through structured input design PDF

Cannot Refute

[66] Iterative Prompting with Persuasion Skills in Jailbreaking Large Language PDF

Cannot Refute

[67] Iterative Prompt Refinement for Safer Text-to-Image Generation PDF

Cannot Refute

[68] A learning-based solution for an adversarial repeated game in cyberâphysical power systems PDF

Cannot Refute

[69] The Irrational Machine: Neurosis and the Limits of Algorithmic Safety PDF

Cannot Refute

[70] CPR: Mitigating Large Language Model Hallucinations with Curative Prompt Refinement PDF

Cannot Refute

Contribution

Soft relaxation of formal control-flow decomposition

[51] Side-channel Elimination via Partial Control-flow Linearization PDF

Cannot Refute

[52] A Survey of Control Flow Graph Recovery for Binary Code PDF

Cannot Refute

[53] Formal structure and organizational communication PDF

Cannot Refute

[54] Building a portable deeply-nested implicit information flow tracking PDF

Cannot Refute

[55] Getting in control of your control flow with control-data isolation PDF

Cannot Refute

[56] Research on control flow integrity for RISC-V PDF

Cannot Refute

[57] Combining control-flow integrity and static analysis for efficient and validated data sandboxing PDF

Cannot Refute

[58] Automatic Decomposition of IoT Aware Business Processes with Data and Control Flow Distribution PDF

Cannot Refute

[59] Compositional Separation of Control Flow and Data Flow PDF

Cannot Refute

[60] Multi-View Structural Graph Summaries PDF

Cannot Refute

Soft Instruction De-escalation Defense

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[2] RTBAS: Defending LLM Agents Against Prompt Injection and Privacy Leakage PDF

[13] PromptArmor: Simple yet Effective Prompt Injection Defenses PDF

[23] Defending Against Prompt Injection With a Few DefensiveTokens PDF

[41] Defending Against Prompt Injection with DataFilter PDF

Contribution Analysis

Soft Instruction Control (SIC) defense mechanism

[28] Context injection vulnerabilities and resource exploitation attacks in model context protocol PDF

[71] Breaking the Code: Security Assessment of AI Code Agents Through Systematic Jailbreaking Attacks PDF

[72] Z-Space: A Multi-Agent Tool Orchestration Framework for Enterprise-Grade LLM Automation PDF

[73] AprielGuard PDF

[74] BrowseSafe: Understanding and Preventing Prompt Injection Within AI Browser Agents PDF

[75] CommandSans: Securing AI Agents with Surgical Precision Prompt Sanitization PDF

[76] Trust in LLM-controlled Robotics: a Survey of Security Threats, Defenses and Challenges PDF

Iterative multi-pass sanitization approach

[61] PromptGuard: An Orchestrated Prompting Framework for Principled Synthetic Text Generation for Vulnerable Populations using LLMs with Enhanced Safety, Fairness â¦ PDF

[62] Automated prompt engineering for semantic vulnerabilities in large language models PDF

[63] Key Security Risks in Prompt Engineering PDF

[64] ReabsNet: Detecting and Revising Adversarial Examples PDF

[65] Crafting effective prompts: enhancing ai performance through structured input design PDF

[66] Iterative Prompting with Persuasion Skills in Jailbreaking Large Language PDF

[67] Iterative Prompt Refinement for Safer Text-to-Image Generation PDF

[68] A learning-based solution for an adversarial repeated game in cyberâphysical power systems PDF

[69] The Irrational Machine: Neurosis and the Limits of Algorithmic Safety PDF

[70] CPR: Mitigating Large Language Model Hallucinations with Curative Prompt Refinement PDF

Soft relaxation of formal control-flow decomposition

[51] Side-channel Elimination via Partial Control-flow Linearization PDF

[52] A Survey of Control Flow Graph Recovery for Binary Code PDF

[53] Formal structure and organizational communication PDF

[54] Building a portable deeply-nested implicit information flow tracking PDF

[55] Getting in control of your control flow with control-data isolation PDF

[56] Research on control flow integrity for RISC-V PDF

[57] Combining control-flow integrity and static analysis for efficient and validated data sandboxing PDF

[58] Automatic Decomposition of IoT Aware Business Processes with Data and Control Flow Distribution PDF

[59] Compositional Separation of Control Flow and Data Flow PDF

[60] Multi-View Structural Graph Summaries PDF

Table of Contents

[61] PromptGuard: An Orchestrated Prompting Framework for Principled Synthetic Text Generation for Vulnerable Populations using LLMs with Enhanced Safety, Fairness â¦ PDF

[68] A learning-based solution for an adversarial repeated game in cyberâphysical power systems PDF