Breaking Agent Backbones: Evaluating the Security of Backbone LLMs in AI Agents

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

AI securityLarge Language ModelsSecurity BenchmarkRed TeamingAI Safety

AI agents powered by large language models (LLMs) are being deployed at scale, yet we lack a systematic understanding of how the choice of backbone LLM affects agent security. The non-deterministic sequential nature of AI agents complicates security modeling, while the integration of traditional software with AI components entangles novel LLM vulnerabilities with conventional security risks. Existing frameworks only partially address these challenges as they either capture specific vulnerabilities only or require modeling of complete agents. To address these limitations, we introduce threat snapshots: a framework that isolates specific states in an agent's execution flow where LLM vulnerabilities manifest, enabling the systematic identification and categorization of security risks that propagate from the LLM to the agent level. We apply this framework to construct the $b^3$ benchmark, a security benchmark based on 194,331 unique crowdsourced adversarial attacks. We then evaluate 34 popular LLMs with it, revealing, among other insights, that enhanced reasoning capabilities improve security, while model size does not correlate with security. We release our benchmark, dataset, and evaluation code to facilitate widespread adoption by LLM providers and practitioners, offering guidance for agent developers and incentivizing model developers to prioritize backbone security improvements.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces a threat snapshots framework and the b³ benchmark to evaluate how backbone LLM choice affects agent security, examining 34 models across 194,331 adversarial attacks. It occupies the 'Backbone LLM Security Assessment' leaf within the taxonomy, which currently contains only this single paper. This positioning indicates a sparse research direction: while the broader 'Security Benchmarking and Empirical Evaluation' branch includes multiple comprehensive benchmarks and specialized frameworks, systematic comparative analysis of backbone LLM security properties appears underexplored in the existing literature.

The taxonomy reveals neighboring leaves focused on comprehensive multi-dimensional benchmarks (Agent Security Bench, Agent-SafetyBench) and specialized evaluation frameworks targeting specific vulnerabilities or deployment contexts. The paper's emphasis on isolating LLM-level vulnerabilities and their propagation to agent behavior distinguishes it from these broader benchmarks, which typically assess complete agent systems without decomposing backbone model contributions. The 'Comprehensive Threat Taxonomies' branch provides conceptual frameworks for categorizing risks, while this work operationalizes those taxonomies through empirical measurement of backbone model differences.

Among 27 candidates examined, none clearly refute the three core contributions. The threat snapshots framework (10 candidates examined, 0 refutable) appears novel in its approach to isolating LLM vulnerability manifestation points within agent execution flows. The b³ benchmark (7 candidates examined, 0 refutable) represents a substantial dataset construction effort, though the limited search scope means potentially relevant benchmarking work may exist beyond the top-K semantic matches. The 34-model evaluation (10 candidates examined, 0 refutable) provides comparative insights, with findings on reasoning-security correlation appearing distinctive within the examined literature.

Based on the top-27 semantic matches and taxonomy structure, the work addresses a gap in systematic backbone LLM comparison for agent security. The single-paper leaf status and absence of refuting candidates suggest novelty, though the limited search scope means this assessment reflects visible prior work rather than exhaustive coverage. The framework's focus on state-based vulnerability isolation and large-scale crowdsourced attack collection distinguishes it from existing comprehensive benchmarks in the examined set.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Evaluating security of backbone LLMs in AI agents. The field has organized itself around five major branches that reflect distinct but interconnected concerns. Security Threat Modeling and Vulnerability Analysis focuses on identifying attack surfaces and adversarial strategies targeting agent systems, while Security Benchmarking and Empirical Evaluation develops systematic methods to measure and quantify these vulnerabilities in practice. Defense Frameworks and Mitigation Strategies proposes protective mechanisms ranging from input filtering to architectural safeguards, exemplified by works like LlamaFirewall[16] and BlindGuard[48]. Agent Capabilities, Alignment, and Trustworthiness examines broader questions of whether agents behave as intended and remain reliable under diverse conditions, drawing on alignment research such as Agent Alignment Survey[3] and Trustworthy Agent Survey[21]. Finally, Domain-Specific Agent Applications and Architectures explores security challenges that emerge in specialized contexts like mobile interfaces, web navigation, or multi-agent collaboration, as seen in Android Agent Weakness[10] and Multi-Agent Collaboration[5]. Within the benchmarking and empirical evaluation branch, a particularly active line of work has emerged around constructing comprehensive test suites that probe backbone LLM robustness against prompt injection, jailbreaking, and adversarial manipulation. Breaking Agent Backbones[0] situates itself squarely in this empirical tradition, focusing on systematic assessment of how underlying language models respond to security challenges when deployed as agent controllers. This emphasis contrasts with more defense-oriented efforts like Agentsafe[2] or G-safeguard[11], which prioritize mitigation over measurement, and differs from broader surveys such as LLM Security Threats Survey[1] and Agentic Security Survey[35] that catalog threats without deep empirical probes. The work aligns closely with Agent Security Bench[8] and Agent-SafetyBench[26], sharing their commitment to rigorous, reproducible evaluation while highlighting the unique vulnerabilities introduced when LLMs orchestrate multi-step reasoning and tool use.

Claimed Contributions

Threat snapshots framework

10 retrieved papers

The authors propose threat snapshots, a formal framework that captures concrete instances of LLM vulnerabilities in AI agents by isolating specific execution states. This framework distinguishes LLM-specific vulnerabilities from traditional security risks and provides an exhaustive attack categorization covering attack vectors and objectives relevant to agentic applications.

10 retrieved papers

b3 benchmark for backbone LLM security

7 retrieved papers

The authors build the backbone breaker benchmark (b3), which combines 10 threat snapshots with high-quality adversarial attacks collected through gamified crowdsourcing. The benchmark enables systematic evaluation of backbone LLM security across diverse agentic scenarios and is released openly to facilitate adoption by LLM providers and practitioners.

7 retrieved papers

Comprehensive evaluation of 34 popular LLMs

10 retrieved papers

The authors conduct a large-scale evaluation of 34 popular LLMs using the b3 benchmark, uncovering actionable insights such as reasoning capabilities improving security and model size not correlating with security. These findings provide guidance for agent developers in selecting secure backbone LLMs for specific use cases.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Threat snapshots framework

[29] AgentSentinel: An End-to-End and Real-Time Security Defense Framework for Computer-Use Agents PDF

Cannot Refute

[67] AgentSpec: Customizable runtime enforcement for safe and reliable LLM agents.(2026) PDF

Cannot Refute

[68] DRIFT: Dynamic Rule-Based Defense with Injection Isolation for Securing LLM Agents PDF

Cannot Refute

[69] Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents PDF

Cannot Refute

[70] VVF-AI: A Vulnerability Verification Framework Based on AI-Agent PDF

Cannot Refute

[71] IsolateGPT: An Execution Isolation Architecture for LLM-Based Agentic Systems PDF

Cannot Refute

[72] CVE-Bench: A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities PDF

Cannot Refute

[73] Architecting resilient llm agents: A guide to secure plan-then-execute implementations PDF

Cannot Refute

[74] Safeflow: A principled protocol for trustworthy and transactional autonomous agent systems PDF

Cannot Refute

[75] Agentarmor: Enforcing program analysis on agent runtime trace to defend against prompt injection PDF

Cannot Refute

Contribution

b3 benchmark for backbone LLM security

[30] Security challenges in ai agent deployment: Insights from a large scale public competition PDF

Cannot Refute

[61] Jailbreakbench: An open robustness benchmark for jailbreaking large language models PDF

Cannot Refute

[62] Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models PDF

Cannot Refute

[63] Detoxifying language model outputs: combining multi-agent debates and reinforcement learning for improved summarization PDF

Cannot Refute

[64] Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game PDF

Cannot Refute

[65] Beyond the Benchmark: Innovative Defenses Against Prompt Injection Attacks PDF

Cannot Refute

[66] PPTC-R benchmark: Towards Evaluating the Robustness of Large Language Models for PowerPoint Task Completion PDF

Cannot Refute

Contribution

Comprehensive evaluation of 34 popular LLMs

[51] Safety in large reasoning models: A survey PDF

Cannot Refute

[52] Guardreasoner: Towards reasoning-based llm safeguards PDF

Cannot Refute

[53] Star-1: Safer alignment of reasoning llms with 1k data PDF

Cannot Refute

[54] Enhancing Reasoning Capacity of SLM Using Cognitive Enhancement PDF

Cannot Refute

[55] Trade-offs in large reasoning models: An empirical analysis of deliberative and adaptive reasoning over foundational capabilities PDF

Cannot Refute

[56] Are vision llms road-ready? a comprehensive benchmark for safety-critical driving video understanding PDF

Cannot Refute

[57] Everything You Wanted to Know About LLM-based Vulnerability Detection But Were Afraid to Ask PDF

Cannot Refute

[58] Saffron-1: Towards an Inference Scaling Paradigm for LLM Safety Assurance PDF

Cannot Refute

[59] Does More Inference-Time Compute Really Help Robustness? PDF

Cannot Refute

[60] Power-Softmax: Towards Secure LLM Inference over Encrypted Data PDF

Cannot Refute

Breaking Agent Backbones: Evaluating the Security of Backbone LLMs in AI Agents

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Threat snapshots framework

[29] AgentSentinel: An End-to-End and Real-Time Security Defense Framework for Computer-Use Agents PDF

[67] AgentSpec: Customizable runtime enforcement for safe and reliable LLM agents.(2026) PDF

[68] DRIFT: Dynamic Rule-Based Defense with Injection Isolation for Securing LLM Agents PDF

[69] Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents PDF

[70] VVF-AI: A Vulnerability Verification Framework Based on AI-Agent PDF

[71] IsolateGPT: An Execution Isolation Architecture for LLM-Based Agentic Systems PDF

[72] CVE-Bench: A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities PDF

[73] Architecting resilient llm agents: A guide to secure plan-then-execute implementations PDF

[74] Safeflow: A principled protocol for trustworthy and transactional autonomous agent systems PDF

[75] Agentarmor: Enforcing program analysis on agent runtime trace to defend against prompt injection PDF

b3 benchmark for backbone LLM security

[30] Security challenges in ai agent deployment: Insights from a large scale public competition PDF

[61] Jailbreakbench: An open robustness benchmark for jailbreaking large language models PDF

[62] Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models PDF

[63] Detoxifying language model outputs: combining multi-agent debates and reinforcement learning for improved summarization PDF

[64] Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game PDF

[65] Beyond the Benchmark: Innovative Defenses Against Prompt Injection Attacks PDF

[66] PPTC-R benchmark: Towards Evaluating the Robustness of Large Language Models for PowerPoint Task Completion PDF

Comprehensive evaluation of 34 popular LLMs

[51] Safety in large reasoning models: A survey PDF

[52] Guardreasoner: Towards reasoning-based llm safeguards PDF

[53] Star-1: Safer alignment of reasoning llms with 1k data PDF

[54] Enhancing Reasoning Capacity of SLM Using Cognitive Enhancement PDF

[55] Trade-offs in large reasoning models: An empirical analysis of deliberative and adaptive reasoning over foundational capabilities PDF

[56] Are vision llms road-ready? a comprehensive benchmark for safety-critical driving video understanding PDF

[57] Everything You Wanted to Know About LLM-based Vulnerability Detection But Were Afraid to Ask PDF

[58] Saffron-1: Towards an Inference Scaling Paradigm for LLM Safety Assurance PDF

[59] Does More Inference-Time Compute Really Help Robustness? PDF

[60] Power-Softmax: Towards Secure LLM Inference over Encrypted Data PDF

Table of Contents