Breaking Agent Backbones: Evaluating the Security of Backbone LLMs in AI Agents

ICLR 2026 Conference SubmissionAnonymous Authors
AI securityLarge Language ModelsSecurity BenchmarkRed TeamingAI Safety
Abstract:

AI agents powered by large language models (LLMs) are being deployed at scale, yet we lack a systematic understanding of how the choice of backbone LLM affects agent security. The non-deterministic sequential nature of AI agents complicates security modeling, while the integration of traditional software with AI components entangles novel LLM vulnerabilities with conventional security risks. Existing frameworks only partially address these challenges as they either capture specific vulnerabilities only or require modeling of complete agents. To address these limitations, we introduce threat snapshots: a framework that isolates specific states in an agent's execution flow where LLM vulnerabilities manifest, enabling the systematic identification and categorization of security risks that propagate from the LLM to the agent level. We apply this framework to construct the b3b^3 benchmark, a security benchmark based on 194,331 unique crowdsourced adversarial attacks. We then evaluate 34 popular LLMs with it, revealing, among other insights, that enhanced reasoning capabilities improve security, while model size does not correlate with security. We release our benchmark, dataset, and evaluation code to facilitate widespread adoption by LLM providers and practitioners, offering guidance for agent developers and incentivizing model developers to prioritize backbone security improvements.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces a threat snapshots framework and the b³ benchmark to evaluate how backbone LLM choice affects agent security, examining 34 models across 194,331 adversarial attacks. It occupies the 'Backbone LLM Security Assessment' leaf within the taxonomy, which currently contains only this single paper. This positioning indicates a sparse research direction: while the broader 'Security Benchmarking and Empirical Evaluation' branch includes multiple comprehensive benchmarks and specialized frameworks, systematic comparative analysis of backbone LLM security properties appears underexplored in the existing literature.

The taxonomy reveals neighboring leaves focused on comprehensive multi-dimensional benchmarks (Agent Security Bench, Agent-SafetyBench) and specialized evaluation frameworks targeting specific vulnerabilities or deployment contexts. The paper's emphasis on isolating LLM-level vulnerabilities and their propagation to agent behavior distinguishes it from these broader benchmarks, which typically assess complete agent systems without decomposing backbone model contributions. The 'Comprehensive Threat Taxonomies' branch provides conceptual frameworks for categorizing risks, while this work operationalizes those taxonomies through empirical measurement of backbone model differences.

Among 27 candidates examined, none clearly refute the three core contributions. The threat snapshots framework (10 candidates examined, 0 refutable) appears novel in its approach to isolating LLM vulnerability manifestation points within agent execution flows. The b³ benchmark (7 candidates examined, 0 refutable) represents a substantial dataset construction effort, though the limited search scope means potentially relevant benchmarking work may exist beyond the top-K semantic matches. The 34-model evaluation (10 candidates examined, 0 refutable) provides comparative insights, with findings on reasoning-security correlation appearing distinctive within the examined literature.

Based on the top-27 semantic matches and taxonomy structure, the work addresses a gap in systematic backbone LLM comparison for agent security. The single-paper leaf status and absence of refuting candidates suggest novelty, though the limited search scope means this assessment reflects visible prior work rather than exhaustive coverage. The framework's focus on state-based vulnerability isolation and large-scale crowdsourced attack collection distinguishes it from existing comprehensive benchmarks in the examined set.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
27
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Evaluating security of backbone LLMs in AI agents. The field has organized itself around five major branches that reflect distinct but interconnected concerns. Security Threat Modeling and Vulnerability Analysis focuses on identifying attack surfaces and adversarial strategies targeting agent systems, while Security Benchmarking and Empirical Evaluation develops systematic methods to measure and quantify these vulnerabilities in practice. Defense Frameworks and Mitigation Strategies proposes protective mechanisms ranging from input filtering to architectural safeguards, exemplified by works like LlamaFirewall[16] and BlindGuard[48]. Agent Capabilities, Alignment, and Trustworthiness examines broader questions of whether agents behave as intended and remain reliable under diverse conditions, drawing on alignment research such as Agent Alignment Survey[3] and Trustworthy Agent Survey[21]. Finally, Domain-Specific Agent Applications and Architectures explores security challenges that emerge in specialized contexts like mobile interfaces, web navigation, or multi-agent collaboration, as seen in Android Agent Weakness[10] and Multi-Agent Collaboration[5]. Within the benchmarking and empirical evaluation branch, a particularly active line of work has emerged around constructing comprehensive test suites that probe backbone LLM robustness against prompt injection, jailbreaking, and adversarial manipulation. Breaking Agent Backbones[0] situates itself squarely in this empirical tradition, focusing on systematic assessment of how underlying language models respond to security challenges when deployed as agent controllers. This emphasis contrasts with more defense-oriented efforts like Agentsafe[2] or G-safeguard[11], which prioritize mitigation over measurement, and differs from broader surveys such as LLM Security Threats Survey[1] and Agentic Security Survey[35] that catalog threats without deep empirical probes. The work aligns closely with Agent Security Bench[8] and Agent-SafetyBench[26], sharing their commitment to rigorous, reproducible evaluation while highlighting the unique vulnerabilities introduced when LLMs orchestrate multi-step reasoning and tool use.

Claimed Contributions

Threat snapshots framework

The authors propose threat snapshots, a formal framework that captures concrete instances of LLM vulnerabilities in AI agents by isolating specific execution states. This framework distinguishes LLM-specific vulnerabilities from traditional security risks and provides an exhaustive attack categorization covering attack vectors and objectives relevant to agentic applications.

10 retrieved papers
b3 benchmark for backbone LLM security

The authors build the backbone breaker benchmark (b3), which combines 10 threat snapshots with high-quality adversarial attacks collected through gamified crowdsourcing. The benchmark enables systematic evaluation of backbone LLM security across diverse agentic scenarios and is released openly to facilitate adoption by LLM providers and practitioners.

7 retrieved papers
Comprehensive evaluation of 34 popular LLMs

The authors conduct a large-scale evaluation of 34 popular LLMs using the b3 benchmark, uncovering actionable insights such as reasoning capabilities improving security and model size not correlating with security. These findings provide guidance for agent developers in selecting secure backbone LLMs for specific use cases.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Threat snapshots framework

The authors propose threat snapshots, a formal framework that captures concrete instances of LLM vulnerabilities in AI agents by isolating specific execution states. This framework distinguishes LLM-specific vulnerabilities from traditional security risks and provides an exhaustive attack categorization covering attack vectors and objectives relevant to agentic applications.

Contribution

b3 benchmark for backbone LLM security

The authors build the backbone breaker benchmark (b3), which combines 10 threat snapshots with high-quality adversarial attacks collected through gamified crowdsourcing. The benchmark enables systematic evaluation of backbone LLM security across diverse agentic scenarios and is released openly to facilitate adoption by LLM providers and practitioners.

Contribution

Comprehensive evaluation of 34 popular LLMs

The authors conduct a large-scale evaluation of 34 popular LLMs using the b3 benchmark, uncovering actionable insights such as reasoning capabilities improving security and model size not correlating with security. These findings provide guidance for agent developers in selecting secure backbone LLMs for specific use cases.