Comparing AI Agents to Cybersecurity Professionals in Real-World Penetration Testing

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

AI AgentsCybersecurityRisk

We present the first comprehensive evaluation of AI agents against human cybersecurity professionals in a live enterprise environment. We evaluate ten cybersecurity professionals alongside six existing AI agents and ARTEMIS, our new agent scaffold, on a large university network consisting of $\sim$ 8,000 hosts across 12 subnets. ARTEMIS is a multi-agent framework featuring dynamic prompt generation, arbitrary sub-agents, and automatic vulnerability triaging. In our comparative study, ARTEMIS placed second overall, discovering 9 valid vulnerabilities with an 82% valid submission rate and outperforming 9 of 10 human participants. While existing scaffolds such as Codex and CyAgent underperformed relative to most human participants, ARTEMIS demonstrated technical sophistication and submission quality comparable to the strongest participants. AI agents offer advantages in systematic enumeration, parallel exploitation, and cost---certain ARTEMIS variants cost $18/hour versus$ 60/hour for professional penetration testers. We also identify key capability gaps: AI agents exhibit higher false-positive rates and struggle with GUI-based tasks.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper contributes a direct performance comparison between AI agents and human cybersecurity professionals on a live university network of approximately 8,000 hosts, alongside the ARTEMIS multi-agent framework featuring dynamic prompt generation and vulnerability triaging. Within the taxonomy, it resides in the 'Live Enterprise Network Evaluations' leaf under 'Comparative Studies of AI and Human Performance.' This leaf contains only two papers total, indicating a relatively sparse research direction. The scarcity reflects the operational complexity and resource demands of conducting controlled experiments in production-scale enterprise environments rather than simulated testbeds.

The taxonomy reveals that most related work clusters in adjacent branches: 'AI-Driven Automated Penetration Testing Frameworks' contains 24 papers across RL-based, LLM-powered, and general automation subcategories, while 'Benchmarking and Evaluation Frameworks' holds 8 papers focused on CTF challenges and synthetic datasets. The 'Automated Versus Manual Testing Comparisons' and 'LLM-Assisted Workflow Evaluations' leaves examine similar questions but in controlled or tool-augmented settings rather than head-to-head agent-versus-human trials on live infrastructure. This work bridges the automation frameworks and evaluation methodologies by operationalizing both in a realistic enterprise context.

Among six candidates examined for the first contribution (comprehensive live evaluation), none provided clearly refuting prior work, though the limited search scope means exhaustive coverage is not guaranteed. The ARTEMIS framework contribution examined two candidates with no refutations found. The unified scoring framework contribution was not matched against any candidates in this analysis. The statistics suggest that while individual technical components (LLM agents, scoring metrics) have precedents, the integrated live-environment comparison at this scale appears less densely covered in the top-ranked semantic matches retrieved.

Based on the 6-candidate search scope, the work appears to occupy a methodologically distinct position—live enterprise evaluations remain rare compared to benchmark-driven studies. The taxonomy structure confirms that most innovation concentrates on automation techniques and synthetic testbeds rather than operational validation against human baselines. However, the limited candidate pool means adjacent work in conference proceedings or domain-specific venues may not be fully represented in this assessment.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Evaluating AI agents against humans in penetration testing. The field has evolved from early automated scanning tools and reinforcement learning frameworks toward sophisticated AI-driven systems that can autonomously discover vulnerabilities, exploit networks, and even compete with human experts. The taxonomy reflects this progression through five main branches: AI-Driven Automated Penetration Testing Frameworks encompass end-to-end systems leveraging deep RL and large language models (for example, PentestGPT[8] and AutoPentester Framework[2]); Benchmarking and Evaluation Frameworks provide standardized testbeds such as LLM Pentest Benchmark[7] and AutoPenBench[19] to measure agent capabilities; Comparative Studies of AI and Human Performance directly pit automated methods against manual testers in controlled or live environments; Human-AI Collaboration and Hybrid Approaches explore how human feedback and agent cooperation can enhance outcomes (illustrated by Human Feedback Pentest[27]); and Domain-Specific and Methodological Advances address specialized settings like IoT networks or privilege escalation tasks, alongside novel algorithmic contributions. Recent work shows a tension between fully autonomous agents and hybrid models that incorporate human oversight or domain knowledge. Many studies focus on whether AI can match the creativity and adaptability of skilled penetration testers, particularly in complex enterprise networks where contextual reasoning is critical. AI Agents Pentest Comparison[0] sits squarely within the Comparative Studies branch, specifically examining live enterprise network evaluations—a setting that demands both technical exploit chaining and realistic operational constraints. This positions it alongside Red Teaming AI[22], which also emphasizes real-world adversarial scenarios, yet AI Agents Pentest Comparison[0] places stronger emphasis on direct human-versus-agent performance metrics rather than red-teaming methodology alone. The work addresses open questions about scalability, interpretability, and whether current AI systems can replicate the nuanced decision-making that human experts bring to dynamic, high-stakes environments.

Claimed Contributions

First comprehensive evaluation of AI agents against human cybersecurity professionals in live enterprise environment

4 retrieved papers

The authors conduct the first direct comparison between AI agents and professional penetration testers on a real production university network with approximately 8,000 hosts, establishing empirical baselines for AI cybersecurity capabilities in realistic operational conditions.

4 retrieved papers

ARTEMIS multi-agent framework for penetration testing

2 retrieved papers

The authors introduce ARTEMIS, a novel autonomous penetration testing framework that uses a supervisor managing workflow, unlimited sub-agents with dynamically generated expert prompts, and a triaging module for vulnerability verification, designed to sustain long-horizon complex tasks on production systems.

2 retrieved papers

Unified scoring framework for penetration test quality assessment

0 retrieved papers

The authors create a novel evaluation metric combining technical complexity scores (detection and exploit complexity) with weighted criticality ratings to systematically assess penetration testing performance, departing from standard doctrine by rewarding technically sophisticated exploits over easily exploitable vulnerabilities.

0 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[22] Red teaming in the age of AI-augmented defenders: Evaluating human Vs. machine tactics in professional penetration testing PDF

Tim Abdiukov (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

First comprehensive evaluation of AI agents against human cybersecurity professionals in live enterprise environment

[15] CAI: An Open, Bug Bounty-Ready Cybersecurity AI PDF

Cannot Refute

[51] Optimizing AI and Human Expertise Integration in Cybersecurity: Enhancing Operational Efficiency and Collaborative Decision-Making PDF

Cannot Refute

[52] AI Agents-as-Judge: Automated Assessment of Accuracy, Consistency, Completeness and Clarity for Enterprise Documents PDF

Cannot Refute

[53] MX-AI: Agentic Observability and Control Platform for Open and AI-RAN PDF

Cannot Refute

Contribution

ARTEMIS multi-agent framework for penetration testing

[49] ATAG: AI-Agent Application Threat Assessment with Attack Graphs PDF

Cannot Refute

[50] Advanced smart contract vulnerability detection via llm-powered multi-agent systems PDF

Cannot Refute

Contribution

Comparing AI Agents to Cybersecurity Professionals in Real-World Penetration Testing

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[22] Red teaming in the age of AI-augmented defenders: Evaluating human Vs. machine tactics in professional penetration testing PDF

Contribution Analysis

First comprehensive evaluation of AI agents against human cybersecurity professionals in live enterprise environment

[15] CAI: An Open, Bug Bounty-Ready Cybersecurity AI PDF

[51] Optimizing AI and Human Expertise Integration in Cybersecurity: Enhancing Operational Efficiency and Collaborative Decision-Making PDF

[52] AI Agents-as-Judge: Automated Assessment of Accuracy, Consistency, Completeness and Clarity for Enterprise Documents PDF

[53] MX-AI: Agentic Observability and Control Platform for Open and AI-RAN PDF

ARTEMIS multi-agent framework for penetration testing

[49] ATAG: AI-Agent Application Threat Assessment with Attack Graphs PDF

[50] Advanced smart contract vulnerability detection via llm-powered multi-agent systems PDF

Unified scoring framework for penetration test quality assessment

Table of Contents