Soft Instruction De-escalation Defense

ICLR 2026 Conference SubmissionAnonymous Authors
Prompt Injectionsdefensetool-use
Abstract:

Large Language Models (LLMs) are increasingly deployed in agentic systems that interact with an external environment; this makes them susceptible to prompt injections when dealing with untrusted data. To overcome this limitation, we propose SIC (Soft Instruction Control) - a simple yet effective multi-stage sanitization pipeline designed for tool-augmented LLM agents. Our approach begins by unconditionally rewriting incoming data to neutralize any potential instructions by masking, rephrasing, or removing them. To detect attacks against the rewriter itself, we inject known canary instructions before this process; if these instructions survive, we conclude the rewrite was compromised. To account for the imprecision of LLMs, we apply multiple independent rewrite passes. Finally, a detection module inspects the full text and smaller chunks of the output for any residual instruction-like content. If imperative instructions remain, the agent halts to ensure security. This defense-in-depth strategy, combining unconditional rewriting, canary checking, and chunk-based detection, makes successful attacks significantly more difficult than bypassing a single detection model.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes SIC, a multi-stage sanitization pipeline combining unconditional rewriting, canary-based integrity checks, and chunk-level detection to defend tool-augmented LLM agents against prompt injection. It resides in the Input Sanitization and Rewriting Defenses leaf, which contains five papers total. This leaf represents a moderately active research direction within the broader Defense Approaches branch, focusing specifically on detecting or transforming malicious instructions before they reach the agent's reasoning process.

The taxonomy reveals neighboring defense strategies that tackle the same threat from different angles. Architectural and System-Level Defenses (six papers) enforce security through control-flow separation rather than content transformation, while Runtime Monitoring and Trace Analysis (four papers) inspects agent execution patterns post-hoc. Guardrail Systems and Multi-Layer Defenses (three papers) also employ defense-in-depth but typically combine detection stages rather than iterative rewriting passes. SIC's unconditional rewriting approach diverges from these by assuming all external data is potentially hostile and applying transformations before any detection step.

Among twenty-seven candidates examined, no prior work clearly refutes any of the three core contributions. The SIC defense mechanism itself was compared against seven candidates with no overlapping claims. The iterative multi-pass sanitization approach examined ten candidates, none providing substantial prior work on repeated independent rewrites with canary validation. The soft relaxation of formal control-flow decomposition also examined ten candidates without finding refutable overlap. This suggests that within the limited search scope, the specific combination of unconditional rewriting, canary checks, and chunk-based detection appears distinct from existing single-stage or detection-first methods.

Based on top-27 semantic matches, the work appears to occupy a relatively unexplored niche within input sanitization defenses. The analysis does not cover the full spectrum of prompt injection literature, and adaptive attack evaluations or comparisons against recent architectural defenses may reveal closer prior work. The limited search scope means we cannot definitively assess novelty against all related efforts, particularly those published concurrently or in adjacent security domains.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
27
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Defending tool-augmented LLM agents against prompt injection attacks. The field divides into two main branches that together frame the security landscape for agentic systems. Attack Characterization and Benchmarking encompasses efforts to systematically understand and measure vulnerabilities, including works like AgentDojo[6] and InjecAgent[19] that create testbeds for evaluating how adversarial prompts can hijack agent behavior when tools interact with untrusted data sources. Defense Approaches spans a diverse set of mitigation strategies, ranging from input sanitization techniques that attempt to cleanse or rewrite potentially malicious content before it reaches the agent, to architectural safeguards such as validation layers, multi-agent oversight mechanisms like AegisAgent[36], and prompt-hardening methods exemplified by PromptArmor[13]. These branches are tightly coupled: robust benchmarks inform which defenses prove effective under adaptive attacks, while novel defenses motivate the creation of stronger attack scenarios. Within Defense Approaches, a particularly active line explores input sanitization and rewriting defenses, where methods seek to neutralize injected instructions by transforming or filtering external text. Soft Instruction De-escalation Defense[0] sits squarely in this cluster, proposing a technique that softens the imperative tone of potentially adversarial commands to reduce their influence on the agent's decision-making. This contrasts with stricter filtering approaches like RTBAS[2], which employs rule-based sanitization, and with cryptographic or structural defenses such as Defeating Prompt Injections by[3] that rely on verifiable delimiters rather than semantic rewriting. A central tension across these works is the trade-off between preserving legitimate external content and blocking sophisticated injections, especially as adaptive attacks like those in Adaptive Attacks Break Defenses[17] demonstrate that static sanitization rules can be circumvented. Soft Instruction De-escalation Defense[0] addresses this by operating at the linguistic level, aiming for a lighter-touch intervention that maintains context while reducing attack potency.

Claimed Contributions

Soft Instruction Control (SIC) defense mechanism

The authors introduce SIC, a modular preprocessing defense that iteratively rewrites untrusted data to neutralize imperative instructions, injects dummy instructions to detect rewriting failures, and uses multi-granularity classification to verify sanitization before passing data to the agent.

7 retrieved papers
Iterative multi-pass sanitization approach

The authors propose an iterative loop that applies multiple independent rewrites and detection passes, acknowledging that individual sanitization steps may fail but leveraging iteration to catch and correct missed injections in subsequent rounds.

10 retrieved papers
Soft relaxation of formal control-flow decomposition

The authors relax the formal control and data flow decomposition from CaMeL into a soft method that explicitly identifies and neutralizes all instructions in untrusted data streams, removing their imperative nature without requiring strict formal separation.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Soft Instruction Control (SIC) defense mechanism

The authors introduce SIC, a modular preprocessing defense that iteratively rewrites untrusted data to neutralize imperative instructions, injects dummy instructions to detect rewriting failures, and uses multi-granularity classification to verify sanitization before passing data to the agent.

Contribution

Iterative multi-pass sanitization approach

The authors propose an iterative loop that applies multiple independent rewrites and detection passes, acknowledging that individual sanitization steps may fail but leveraging iteration to catch and correct missed injections in subsequent rounds.

Contribution

Soft relaxation of formal control-flow decomposition

The authors relax the formal control and data flow decomposition from CaMeL into a soft method that explicitly identifies and neutralizes all instructions in untrusted data streams, removing their imperative nature without requiring strict formal separation.

Soft Instruction De-escalation Defense | Novelty Validation