Breaking Agent Backbones: Evaluating the Security of Backbone LLMs in AI Agents
Overview
Overall Novelty Assessment
The paper introduces a threat snapshots framework and the b³ benchmark to evaluate how backbone LLM choice affects agent security, examining 34 models across 194,331 adversarial attacks. It occupies the 'Backbone LLM Security Assessment' leaf within the taxonomy, which currently contains only this single paper. This positioning indicates a sparse research direction: while the broader 'Security Benchmarking and Empirical Evaluation' branch includes multiple comprehensive benchmarks and specialized frameworks, systematic comparative analysis of backbone LLM security properties appears underexplored in the existing literature.
The taxonomy reveals neighboring leaves focused on comprehensive multi-dimensional benchmarks (Agent Security Bench, Agent-SafetyBench) and specialized evaluation frameworks targeting specific vulnerabilities or deployment contexts. The paper's emphasis on isolating LLM-level vulnerabilities and their propagation to agent behavior distinguishes it from these broader benchmarks, which typically assess complete agent systems without decomposing backbone model contributions. The 'Comprehensive Threat Taxonomies' branch provides conceptual frameworks for categorizing risks, while this work operationalizes those taxonomies through empirical measurement of backbone model differences.
Among 27 candidates examined, none clearly refute the three core contributions. The threat snapshots framework (10 candidates examined, 0 refutable) appears novel in its approach to isolating LLM vulnerability manifestation points within agent execution flows. The b³ benchmark (7 candidates examined, 0 refutable) represents a substantial dataset construction effort, though the limited search scope means potentially relevant benchmarking work may exist beyond the top-K semantic matches. The 34-model evaluation (10 candidates examined, 0 refutable) provides comparative insights, with findings on reasoning-security correlation appearing distinctive within the examined literature.
Based on the top-27 semantic matches and taxonomy structure, the work addresses a gap in systematic backbone LLM comparison for agent security. The single-paper leaf status and absence of refuting candidates suggest novelty, though the limited search scope means this assessment reflects visible prior work rather than exhaustive coverage. The framework's focus on state-based vulnerability isolation and large-scale crowdsourced attack collection distinguishes it from existing comprehensive benchmarks in the examined set.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors propose threat snapshots, a formal framework that captures concrete instances of LLM vulnerabilities in AI agents by isolating specific execution states. This framework distinguishes LLM-specific vulnerabilities from traditional security risks and provides an exhaustive attack categorization covering attack vectors and objectives relevant to agentic applications.
The authors build the backbone breaker benchmark (b3), which combines 10 threat snapshots with high-quality adversarial attacks collected through gamified crowdsourcing. The benchmark enables systematic evaluation of backbone LLM security across diverse agentic scenarios and is released openly to facilitate adoption by LLM providers and practitioners.
The authors conduct a large-scale evaluation of 34 popular LLMs using the b3 benchmark, uncovering actionable insights such as reasoning capabilities improving security and model size not correlating with security. These findings provide guidance for agent developers in selecting secure backbone LLMs for specific use cases.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
Contribution Analysis
Detailed comparisons for each claimed contribution
Threat snapshots framework
The authors propose threat snapshots, a formal framework that captures concrete instances of LLM vulnerabilities in AI agents by isolating specific execution states. This framework distinguishes LLM-specific vulnerabilities from traditional security risks and provides an exhaustive attack categorization covering attack vectors and objectives relevant to agentic applications.
[29] AgentSentinel: An End-to-End and Real-Time Security Defense Framework for Computer-Use Agents PDF
[67] AgentSpec: Customizable runtime enforcement for safe and reliable LLM agents.(2026) PDF
[68] DRIFT: Dynamic Rule-Based Defense with Injection Isolation for Securing LLM Agents PDF
[69] Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents PDF
[70] VVF-AI: A Vulnerability Verification Framework Based on AI-Agent PDF
[71] IsolateGPT: An Execution Isolation Architecture for LLM-Based Agentic Systems PDF
[72] CVE-Bench: A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities PDF
[73] Architecting resilient llm agents: A guide to secure plan-then-execute implementations PDF
[74] Safeflow: A principled protocol for trustworthy and transactional autonomous agent systems PDF
[75] Agentarmor: Enforcing program analysis on agent runtime trace to defend against prompt injection PDF
b3 benchmark for backbone LLM security
The authors build the backbone breaker benchmark (b3), which combines 10 threat snapshots with high-quality adversarial attacks collected through gamified crowdsourcing. The benchmark enables systematic evaluation of backbone LLM security across diverse agentic scenarios and is released openly to facilitate adoption by LLM providers and practitioners.
[30] Security challenges in ai agent deployment: Insights from a large scale public competition PDF
[61] Jailbreakbench: An open robustness benchmark for jailbreaking large language models PDF
[62] Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models PDF
[63] Detoxifying language model outputs: combining multi-agent debates and reinforcement learning for improved summarization PDF
[64] Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game PDF
[65] Beyond the Benchmark: Innovative Defenses Against Prompt Injection Attacks PDF
[66] PPTC-R benchmark: Towards Evaluating the Robustness of Large Language Models for PowerPoint Task Completion PDF
Comprehensive evaluation of 34 popular LLMs
The authors conduct a large-scale evaluation of 34 popular LLMs using the b3 benchmark, uncovering actionable insights such as reasoning capabilities improving security and model size not correlating with security. These findings provide guidance for agent developers in selecting secure backbone LLMs for specific use cases.