The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against LLM Jailbreaks and Prompt Injections

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

prompt injection defenseadaptive evaluationjailbreaksadversarial examples

How should we evaluate the robustness of language model defenses? Current defenses against jailbreaks and prompt injections (which aim to prevent an attacker from eliciting harmful knowledge or remotely triggering malicious actions, respectively) are typically evaluated either against a static set of harmful attack strings, or against computationally weak optimization methods that were not designed with the defense in mind. We argue that this evaluation process is flawed.

Instead, we should evaluate defenses against adaptive attackers who explicitly modify their attack strategy to counter a defense's design while spending considerable resources to optimize their objective. By systematically tuning and scaling general optimization techniques—gradient descent, reinforcement learning, random search, and human-guided exploration—we bypass 12 recent defenses (based on a diverse set of techniques) with attack success rate above 90% for most; importantly, the majority of defenses originally reported near-zero attack success rates. We believe that future defense work must consider stronger attacks, such as the ones we describe, in order to make reliable and convincing claims of robustness.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes an adaptive attack framework to rigorously evaluate language model defenses against jailbreaks and prompt injections. It resides in the 'Adaptive Attack Frameworks Against LLM Defenses' leaf, which contains only two papers total, indicating a relatively sparse research direction within the broader taxonomy. This positioning suggests the work addresses a recognized but underexplored gap: systematically testing whether defenses withstand adversaries who tailor their strategies after observing protective measures, rather than relying on static attack benchmarks.

The taxonomy reveals that most defense research concentrates in sibling branches such as 'Jailbreak Defense Mechanisms' (with subtopics covering input smoothing, prompt-based strategies, and adversarial training) and 'General Adversarial Robustness' (focusing on filtering and purification methods). The paper's leaf sits under 'Adaptive Attack Design and Defense Evaluation,' which explicitly excludes static benchmarks and general defense surveys. Neighboring leaves address character-level perturbations and baseline defense evaluations, but the adaptive attack framework approach directly targets the evaluation methodology gap rather than proposing new defenses or attack primitives.

Among 29 candidates examined across three contributions, none were found to clearly refute the paper's claims. The first contribution (adaptive attack framework) examined 10 candidates with zero refutable overlaps; the second (comprehensive evaluation of 12 defenses) also examined 10 candidates with no refutations; the third (evaluation recommendations) examined 9 candidates, again with no refutations. This suggests that within the limited search scope—primarily top-K semantic matches and citation expansion—the specific combination of adaptive optimization techniques (gradient descent, reinforcement learning, random search, human-guided exploration) applied systematically to bypass diverse defenses has not been extensively documented in prior work.

Based on the limited literature search covering 29 candidates, the work appears to occupy a distinct position emphasizing evaluation rigor over defense innovation. The taxonomy structure shows that while defense mechanisms and static attack benchmarks are well-populated, the adaptive evaluation methodology remains comparatively sparse. However, the search scope does not guarantee exhaustive coverage of all relevant adversarial evaluation studies, particularly those in adjacent security domains or unpublished concurrent work.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Evaluating robustness of language model defenses against adaptive attacks. The field has organized itself around several major branches that reflect distinct threat models and mitigation strategies. Jailbreak Attack Methods and Characterization explores how adversaries craft prompts to bypass safety guardrails, while Jailbreak Defense Mechanisms develops countermeasures ranging from input filtering to prompt hardening techniques like SmoothLLM[2] and Defensive Prompt Patch[7]. Adaptive Attack Design and Defense Evaluation focuses specifically on the arms race between defenses and attackers who adapt their strategies after observing protective measures, as exemplified by benchmarks such as Jailbreakbench[5] and Jailbreakv Benchmark[1]. Parallel branches address Backdoor Attacks and Defenses, Model Extraction threats, and Domain-Specific Security concerns in areas like code generation or multimodal systems. General Adversarial Robustness techniques and Federated Learning Security round out the taxonomy, highlighting that language model vulnerabilities span multiple deployment contexts and attack surfaces. Within this landscape, a particularly active line of work examines whether defenses remain effective when attackers can observe and circumvent them—a challenge central to Adaptive Attack Design and Defense Evaluation. Attacker Moves Second[0] sits squarely in this branch, emphasizing the need to test defenses under adaptive threat models where adversaries iteratively refine attacks. This contrasts with static evaluation frameworks and aligns closely with Prompt Injection Evaluation[44], which similarly stresses realistic adversarial conditions. Compared to works like Robust Prompt Optimization[6] or SelfDefend[8] that propose specific defense mechanisms, Attacker Moves Second[0] focuses on the evaluation methodology itself, arguing that many defenses fail when subjected to adaptive scrutiny. The broader tension across these branches revolves around whether defenses can achieve robustness guarantees or merely raise the bar for attackers, with ongoing questions about generalization across attack types and the computational cost of maintaining security under evolving threats.

Claimed Contributions

Adaptive attack framework for evaluating LLM defenses

10 retrieved papers

The authors propose a unified adaptive attack framework that systematically applies and scales general optimization techniques (gradient descent, reinforcement learning, random search, and human-guided exploration) to evaluate LLM defenses. This framework is designed to counter specific defense mechanisms rather than using fixed or weak attacks.

10 retrieved papers

Comprehensive evaluation exposing weaknesses in 12 recent defenses

10 retrieved papers

The authors systematically evaluate 12 recently proposed defenses against jailbreaks and prompt injections using their adaptive attacks. They demonstrate that most defenses can be bypassed with over 90% success rate, contradicting the near-zero attack success rates reported in the original defense papers.

10 retrieved papers

Lessons and recommendations for robust defense evaluation

9 retrieved papers

The authors provide four key lessons for the community: static evaluations are misleading, automated attacks are effective but insufficient, human red-teaming remains valuable, and model-based auto-raters can be unreliable. They argue that defense evaluations must incorporate adaptive attackers with substantial computational resources to be convincing.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[44] A Critical Evaluation of Defenses against Prompt Injection Attacks PDF

Jia Yuqi, Yuqi Jia, Liu Yu-pei, Zedian Shao, Jia, Jinyuan, Yupei Liu, Song, Dawn, Jinyuan Jia, Gong, Neil Zhenqiang, Dawn Song, N. Gong (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Adaptive attack framework for evaluating LLM defenses

[6] Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks PDF

Cannot Refute

[51] Defending against alignment-breaking attacks via robustly aligned llm PDF

Cannot Refute

[57] Advprompter: Fast adaptive adversarial prompting for llms PDF

Cannot Refute

[58] Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks PDF

Cannot Refute

[59] Efficient adversarial training in llms with continuous attacks PDF

Cannot Refute

[60] Adversarial Training for Large Neural Language Models PDF

Cannot Refute

[61] Certifying LLM Safety against Adversarial Prompting PDF

Cannot Refute

[62] Gandalf the red: Adaptive security for llms PDF

Cannot Refute

[63] AGD: Adversarial Game Defense Against Jailbreak Attacks in Large Language Models PDF

Cannot Refute

[64] Black-box Optimization of LLM Outputs by Asking for Directions PDF

Cannot Refute

Contribution

Comprehensive evaluation exposing weaknesses in 12 recent defenses

[18] A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models PDF

Cannot Refute

[61] Certifying LLM Safety against Adversarial Prompting PDF

Cannot Refute

[65] PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts PDF

Cannot Refute

[66] Evaluating prompt injection safety in large language models using the promptbench dataset PDF

Cannot Refute

[67] "Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models PDF

Cannot Refute

[68] Evaluating prompt extraction vulnerabilities in commercial large language models PDF

Cannot Refute

[69] Systematic Testing of Security-Related Vulnerabilities in LLM-Based Applications PDF

Cannot Refute

[70] Evolving security in llms: A study of jailbreak attacks and defenses PDF

Cannot Refute

[71] Prompt injection attack against llm-integrated applications PDF

Cannot Refute

[72] Design Patterns for Securing LLM Agents against Prompt Injections PDF

Cannot Refute

Contribution

Lessons and recommendations for robust defense evaluation

[2] SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks PDF

Cannot Refute

[15] Baseline Defenses for Adversarial Attacks Against Aligned Language Models PDF

Cannot Refute

[40] Adversarial tuning: Defending against jailbreak attacks for llms PDF

Cannot Refute

[44] A Critical Evaluation of Defenses against Prompt Injection Attacks PDF

Cannot Refute

[51] Defending against alignment-breaking attacks via robustly aligned llm PDF

Cannot Refute

[52] AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks PDF

Cannot Refute

[53] Self-Evaluation as a Defense Against Adversarial Attacks on LLMs PDF

Cannot Refute

[55] Proactive defense against LLM Jailbreak PDF

Cannot Refute

[56] LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked PDF

Cannot Refute

The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against LLM Jailbreaks and Prompt Injections

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[44] A Critical Evaluation of Defenses against Prompt Injection Attacks PDF

Contribution Analysis

Adaptive attack framework for evaluating LLM defenses

[6] Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks PDF

[51] Defending against alignment-breaking attacks via robustly aligned llm PDF

[57] Advprompter: Fast adaptive adversarial prompting for llms PDF

[58] Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks PDF

[59] Efficient adversarial training in llms with continuous attacks PDF

[60] Adversarial Training for Large Neural Language Models PDF

[61] Certifying LLM Safety against Adversarial Prompting PDF

[62] Gandalf the red: Adaptive security for llms PDF

[63] AGD: Adversarial Game Defense Against Jailbreak Attacks in Large Language Models PDF

[64] Black-box Optimization of LLM Outputs by Asking for Directions PDF

Comprehensive evaluation exposing weaknesses in 12 recent defenses

[18] A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models PDF

[61] Certifying LLM Safety against Adversarial Prompting PDF

[65] PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts PDF

[66] Evaluating prompt injection safety in large language models using the promptbench dataset PDF

[67] "Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models PDF

[68] Evaluating prompt extraction vulnerabilities in commercial large language models PDF

[69] Systematic Testing of Security-Related Vulnerabilities in LLM-Based Applications PDF

[70] Evolving security in llms: A study of jailbreak attacks and defenses PDF

[71] Prompt injection attack against llm-integrated applications PDF

[72] Design Patterns for Securing LLM Agents against Prompt Injections PDF

Lessons and recommendations for robust defense evaluation

[2] SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks PDF

[15] Baseline Defenses for Adversarial Attacks Against Aligned Language Models PDF

[40] Adversarial tuning: Defending against jailbreak attacks for llms PDF

[44] A Critical Evaluation of Defenses against Prompt Injection Attacks PDF

[51] Defending against alignment-breaking attacks via robustly aligned llm PDF

[52] AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks PDF

[53] Self-Evaluation as a Defense Against Adversarial Attacks on LLMs PDF

[55] Proactive defense against LLM Jailbreak PDF

[56] LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked PDF

Table of Contents