The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against LLM Jailbreaks and Prompt Injections

ICLR 2026 Conference SubmissionAnonymous Authors
prompt injection defenseadaptive evaluationjailbreaksadversarial examples
Abstract:

How should we evaluate the robustness of language model defenses? Current defenses against jailbreaks and prompt injections (which aim to prevent an attacker from eliciting harmful knowledge or remotely triggering malicious actions, respectively) are typically evaluated either against a static set of harmful attack strings, or against computationally weak optimization methods that were not designed with the defense in mind. We argue that this evaluation process is flawed.

Instead, we should evaluate defenses against adaptive attackers who explicitly modify their attack strategy to counter a defense's design while spending considerable resources to optimize their objective. By systematically tuning and scaling general optimization techniques—gradient descent, reinforcement learning, random search, and human-guided exploration—we bypass 12 recent defenses (based on a diverse set of techniques) with attack success rate above 90% for most; importantly, the majority of defenses originally reported near-zero attack success rates. We believe that future defense work must consider stronger attacks, such as the ones we describe, in order to make reliable and convincing claims of robustness.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes an adaptive attack framework to rigorously evaluate language model defenses against jailbreaks and prompt injections. It resides in the 'Adaptive Attack Frameworks Against LLM Defenses' leaf, which contains only two papers total, indicating a relatively sparse research direction within the broader taxonomy. This positioning suggests the work addresses a recognized but underexplored gap: systematically testing whether defenses withstand adversaries who tailor their strategies after observing protective measures, rather than relying on static attack benchmarks.

The taxonomy reveals that most defense research concentrates in sibling branches such as 'Jailbreak Defense Mechanisms' (with subtopics covering input smoothing, prompt-based strategies, and adversarial training) and 'General Adversarial Robustness' (focusing on filtering and purification methods). The paper's leaf sits under 'Adaptive Attack Design and Defense Evaluation,' which explicitly excludes static benchmarks and general defense surveys. Neighboring leaves address character-level perturbations and baseline defense evaluations, but the adaptive attack framework approach directly targets the evaluation methodology gap rather than proposing new defenses or attack primitives.

Among 29 candidates examined across three contributions, none were found to clearly refute the paper's claims. The first contribution (adaptive attack framework) examined 10 candidates with zero refutable overlaps; the second (comprehensive evaluation of 12 defenses) also examined 10 candidates with no refutations; the third (evaluation recommendations) examined 9 candidates, again with no refutations. This suggests that within the limited search scope—primarily top-K semantic matches and citation expansion—the specific combination of adaptive optimization techniques (gradient descent, reinforcement learning, random search, human-guided exploration) applied systematically to bypass diverse defenses has not been extensively documented in prior work.

Based on the limited literature search covering 29 candidates, the work appears to occupy a distinct position emphasizing evaluation rigor over defense innovation. The taxonomy structure shows that while defense mechanisms and static attack benchmarks are well-populated, the adaptive evaluation methodology remains comparatively sparse. However, the search scope does not guarantee exhaustive coverage of all relevant adversarial evaluation studies, particularly those in adjacent security domains or unpublished concurrent work.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
29
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Evaluating robustness of language model defenses against adaptive attacks. The field has organized itself around several major branches that reflect distinct threat models and mitigation strategies. Jailbreak Attack Methods and Characterization explores how adversaries craft prompts to bypass safety guardrails, while Jailbreak Defense Mechanisms develops countermeasures ranging from input filtering to prompt hardening techniques like SmoothLLM[2] and Defensive Prompt Patch[7]. Adaptive Attack Design and Defense Evaluation focuses specifically on the arms race between defenses and attackers who adapt their strategies after observing protective measures, as exemplified by benchmarks such as Jailbreakbench[5] and Jailbreakv Benchmark[1]. Parallel branches address Backdoor Attacks and Defenses, Model Extraction threats, and Domain-Specific Security concerns in areas like code generation or multimodal systems. General Adversarial Robustness techniques and Federated Learning Security round out the taxonomy, highlighting that language model vulnerabilities span multiple deployment contexts and attack surfaces. Within this landscape, a particularly active line of work examines whether defenses remain effective when attackers can observe and circumvent them—a challenge central to Adaptive Attack Design and Defense Evaluation. Attacker Moves Second[0] sits squarely in this branch, emphasizing the need to test defenses under adaptive threat models where adversaries iteratively refine attacks. This contrasts with static evaluation frameworks and aligns closely with Prompt Injection Evaluation[44], which similarly stresses realistic adversarial conditions. Compared to works like Robust Prompt Optimization[6] or SelfDefend[8] that propose specific defense mechanisms, Attacker Moves Second[0] focuses on the evaluation methodology itself, arguing that many defenses fail when subjected to adaptive scrutiny. The broader tension across these branches revolves around whether defenses can achieve robustness guarantees or merely raise the bar for attackers, with ongoing questions about generalization across attack types and the computational cost of maintaining security under evolving threats.

Claimed Contributions

Adaptive attack framework for evaluating LLM defenses

The authors propose a unified adaptive attack framework that systematically applies and scales general optimization techniques (gradient descent, reinforcement learning, random search, and human-guided exploration) to evaluate LLM defenses. This framework is designed to counter specific defense mechanisms rather than using fixed or weak attacks.

10 retrieved papers
Comprehensive evaluation exposing weaknesses in 12 recent defenses

The authors systematically evaluate 12 recently proposed defenses against jailbreaks and prompt injections using their adaptive attacks. They demonstrate that most defenses can be bypassed with over 90% success rate, contradicting the near-zero attack success rates reported in the original defense papers.

10 retrieved papers
Lessons and recommendations for robust defense evaluation

The authors provide four key lessons for the community: static evaluations are misleading, automated attacks are effective but insufficient, human red-teaming remains valuable, and model-based auto-raters can be unreliable. They argue that defense evaluations must incorporate adaptive attackers with substantial computational resources to be convincing.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Adaptive attack framework for evaluating LLM defenses

The authors propose a unified adaptive attack framework that systematically applies and scales general optimization techniques (gradient descent, reinforcement learning, random search, and human-guided exploration) to evaluate LLM defenses. This framework is designed to counter specific defense mechanisms rather than using fixed or weak attacks.

Contribution

Comprehensive evaluation exposing weaknesses in 12 recent defenses

The authors systematically evaluate 12 recently proposed defenses against jailbreaks and prompt injections using their adaptive attacks. They demonstrate that most defenses can be bypassed with over 90% success rate, contradicting the near-zero attack success rates reported in the original defense papers.

Contribution

Lessons and recommendations for robust defense evaluation

The authors provide four key lessons for the community: static evaluations are misleading, automated attacks are effective but insufficient, human red-teaming remains valuable, and model-based auto-raters can be unreliable. They argue that defense evaluations must incorporate adaptive attackers with substantial computational resources to be convincing.