The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against LLM Jailbreaks and Prompt Injections
Overview
Overall Novelty Assessment
The paper proposes an adaptive attack framework to rigorously evaluate language model defenses against jailbreaks and prompt injections. It resides in the 'Adaptive Attack Frameworks Against LLM Defenses' leaf, which contains only two papers total, indicating a relatively sparse research direction within the broader taxonomy. This positioning suggests the work addresses a recognized but underexplored gap: systematically testing whether defenses withstand adversaries who tailor their strategies after observing protective measures, rather than relying on static attack benchmarks.
The taxonomy reveals that most defense research concentrates in sibling branches such as 'Jailbreak Defense Mechanisms' (with subtopics covering input smoothing, prompt-based strategies, and adversarial training) and 'General Adversarial Robustness' (focusing on filtering and purification methods). The paper's leaf sits under 'Adaptive Attack Design and Defense Evaluation,' which explicitly excludes static benchmarks and general defense surveys. Neighboring leaves address character-level perturbations and baseline defense evaluations, but the adaptive attack framework approach directly targets the evaluation methodology gap rather than proposing new defenses or attack primitives.
Among 29 candidates examined across three contributions, none were found to clearly refute the paper's claims. The first contribution (adaptive attack framework) examined 10 candidates with zero refutable overlaps; the second (comprehensive evaluation of 12 defenses) also examined 10 candidates with no refutations; the third (evaluation recommendations) examined 9 candidates, again with no refutations. This suggests that within the limited search scope—primarily top-K semantic matches and citation expansion—the specific combination of adaptive optimization techniques (gradient descent, reinforcement learning, random search, human-guided exploration) applied systematically to bypass diverse defenses has not been extensively documented in prior work.
Based on the limited literature search covering 29 candidates, the work appears to occupy a distinct position emphasizing evaluation rigor over defense innovation. The taxonomy structure shows that while defense mechanisms and static attack benchmarks are well-populated, the adaptive evaluation methodology remains comparatively sparse. However, the search scope does not guarantee exhaustive coverage of all relevant adversarial evaluation studies, particularly those in adjacent security domains or unpublished concurrent work.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors propose a unified adaptive attack framework that systematically applies and scales general optimization techniques (gradient descent, reinforcement learning, random search, and human-guided exploration) to evaluate LLM defenses. This framework is designed to counter specific defense mechanisms rather than using fixed or weak attacks.
The authors systematically evaluate 12 recently proposed defenses against jailbreaks and prompt injections using their adaptive attacks. They demonstrate that most defenses can be bypassed with over 90% success rate, contradicting the near-zero attack success rates reported in the original defense papers.
The authors provide four key lessons for the community: static evaluations are misleading, automated attacks are effective but insufficient, human red-teaming remains valuable, and model-based auto-raters can be unreliable. They argue that defense evaluations must incorporate adaptive attackers with substantial computational resources to be convincing.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[44] A Critical Evaluation of Defenses against Prompt Injection Attacks PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Adaptive attack framework for evaluating LLM defenses
The authors propose a unified adaptive attack framework that systematically applies and scales general optimization techniques (gradient descent, reinforcement learning, random search, and human-guided exploration) to evaluate LLM defenses. This framework is designed to counter specific defense mechanisms rather than using fixed or weak attacks.
[6] Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks PDF
[51] Defending against alignment-breaking attacks via robustly aligned llm PDF
[57] Advprompter: Fast adaptive adversarial prompting for llms PDF
[58] Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks PDF
[59] Efficient adversarial training in llms with continuous attacks PDF
[60] Adversarial Training for Large Neural Language Models PDF
[61] Certifying LLM Safety against Adversarial Prompting PDF
[62] Gandalf the red: Adaptive security for llms PDF
[63] AGD: Adversarial Game Defense Against Jailbreak Attacks in Large Language Models PDF
[64] Black-box Optimization of LLM Outputs by Asking for Directions PDF
Comprehensive evaluation exposing weaknesses in 12 recent defenses
The authors systematically evaluate 12 recently proposed defenses against jailbreaks and prompt injections using their adaptive attacks. They demonstrate that most defenses can be bypassed with over 90% success rate, contradicting the near-zero attack success rates reported in the original defense papers.
[18] A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models PDF
[61] Certifying LLM Safety against Adversarial Prompting PDF
[65] PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts PDF
[66] Evaluating prompt injection safety in large language models using the promptbench dataset PDF
[67] "Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models PDF
[68] Evaluating prompt extraction vulnerabilities in commercial large language models PDF
[69] Systematic Testing of Security-Related Vulnerabilities in LLM-Based Applications PDF
[70] Evolving security in llms: A study of jailbreak attacks and defenses PDF
[71] Prompt injection attack against llm-integrated applications PDF
[72] Design Patterns for Securing LLM Agents against Prompt Injections PDF
Lessons and recommendations for robust defense evaluation
The authors provide four key lessons for the community: static evaluations are misleading, automated attacks are effective but insufficient, human red-teaming remains valuable, and model-based auto-raters can be unreliable. They argue that defense evaluations must incorporate adaptive attackers with substantial computational resources to be convincing.