Abstract:

We present the first comprehensive evaluation of AI agents against human cybersecurity professionals in a live enterprise environment. We evaluate ten cybersecurity professionals alongside six existing AI agents and ARTEMIS, our new agent scaffold, on a large university network consisting of \sim8,000 hosts across 12 subnets. ARTEMIS is a multi-agent framework featuring dynamic prompt generation, arbitrary sub-agents, and automatic vulnerability triaging. In our comparative study, ARTEMIS placed second overall, discovering 9 valid vulnerabilities with an 82% valid submission rate and outperforming 9 of 10 human participants. While existing scaffolds such as Codex and CyAgent underperformed relative to most human participants, ARTEMIS demonstrated technical sophistication and submission quality comparable to the strongest participants. AI agents offer advantages in systematic enumeration, parallel exploitation, and cost---certain ARTEMIS variants cost 18/hourversus18/hour versus 60/hour for professional penetration testers. We also identify key capability gaps: AI agents exhibit higher false-positive rates and struggle with GUI-based tasks.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper contributes a direct performance comparison between AI agents and human cybersecurity professionals on a live university network of approximately 8,000 hosts, alongside the ARTEMIS multi-agent framework featuring dynamic prompt generation and vulnerability triaging. Within the taxonomy, it resides in the 'Live Enterprise Network Evaluations' leaf under 'Comparative Studies of AI and Human Performance.' This leaf contains only two papers total, indicating a relatively sparse research direction. The scarcity reflects the operational complexity and resource demands of conducting controlled experiments in production-scale enterprise environments rather than simulated testbeds.

The taxonomy reveals that most related work clusters in adjacent branches: 'AI-Driven Automated Penetration Testing Frameworks' contains 24 papers across RL-based, LLM-powered, and general automation subcategories, while 'Benchmarking and Evaluation Frameworks' holds 8 papers focused on CTF challenges and synthetic datasets. The 'Automated Versus Manual Testing Comparisons' and 'LLM-Assisted Workflow Evaluations' leaves examine similar questions but in controlled or tool-augmented settings rather than head-to-head agent-versus-human trials on live infrastructure. This work bridges the automation frameworks and evaluation methodologies by operationalizing both in a realistic enterprise context.

Among six candidates examined for the first contribution (comprehensive live evaluation), none provided clearly refuting prior work, though the limited search scope means exhaustive coverage is not guaranteed. The ARTEMIS framework contribution examined two candidates with no refutations found. The unified scoring framework contribution was not matched against any candidates in this analysis. The statistics suggest that while individual technical components (LLM agents, scoring metrics) have precedents, the integrated live-environment comparison at this scale appears less densely covered in the top-ranked semantic matches retrieved.

Based on the 6-candidate search scope, the work appears to occupy a methodologically distinct position—live enterprise evaluations remain rare compared to benchmark-driven studies. The taxonomy structure confirms that most innovation concentrates on automation techniques and synthetic testbeds rather than operational validation against human baselines. However, the limited candidate pool means adjacent work in conference proceedings or domain-specific venues may not be fully represented in this assessment.

Taxonomy

Core-task Taxonomy Papers
48
3
Claimed Contributions
6
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Evaluating AI agents against humans in penetration testing. The field has evolved from early automated scanning tools and reinforcement learning frameworks toward sophisticated AI-driven systems that can autonomously discover vulnerabilities, exploit networks, and even compete with human experts. The taxonomy reflects this progression through five main branches: AI-Driven Automated Penetration Testing Frameworks encompass end-to-end systems leveraging deep RL and large language models (for example, PentestGPT[8] and AutoPentester Framework[2]); Benchmarking and Evaluation Frameworks provide standardized testbeds such as LLM Pentest Benchmark[7] and AutoPenBench[19] to measure agent capabilities; Comparative Studies of AI and Human Performance directly pit automated methods against manual testers in controlled or live environments; Human-AI Collaboration and Hybrid Approaches explore how human feedback and agent cooperation can enhance outcomes (illustrated by Human Feedback Pentest[27]); and Domain-Specific and Methodological Advances address specialized settings like IoT networks or privilege escalation tasks, alongside novel algorithmic contributions. Recent work shows a tension between fully autonomous agents and hybrid models that incorporate human oversight or domain knowledge. Many studies focus on whether AI can match the creativity and adaptability of skilled penetration testers, particularly in complex enterprise networks where contextual reasoning is critical. AI Agents Pentest Comparison[0] sits squarely within the Comparative Studies branch, specifically examining live enterprise network evaluations—a setting that demands both technical exploit chaining and realistic operational constraints. This positions it alongside Red Teaming AI[22], which also emphasizes real-world adversarial scenarios, yet AI Agents Pentest Comparison[0] places stronger emphasis on direct human-versus-agent performance metrics rather than red-teaming methodology alone. The work addresses open questions about scalability, interpretability, and whether current AI systems can replicate the nuanced decision-making that human experts bring to dynamic, high-stakes environments.

Claimed Contributions

First comprehensive evaluation of AI agents against human cybersecurity professionals in live enterprise environment

The authors conduct the first direct comparison between AI agents and professional penetration testers on a real production university network with approximately 8,000 hosts, establishing empirical baselines for AI cybersecurity capabilities in realistic operational conditions.

4 retrieved papers
ARTEMIS multi-agent framework for penetration testing

The authors introduce ARTEMIS, a novel autonomous penetration testing framework that uses a supervisor managing workflow, unlimited sub-agents with dynamically generated expert prompts, and a triaging module for vulnerability verification, designed to sustain long-horizon complex tasks on production systems.

2 retrieved papers
Unified scoring framework for penetration test quality assessment

The authors create a novel evaluation metric combining technical complexity scores (detection and exploit complexity) with weighted criticality ratings to systematically assess penetration testing performance, departing from standard doctrine by rewarding technically sophisticated exploits over easily exploitable vulnerabilities.

0 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

First comprehensive evaluation of AI agents against human cybersecurity professionals in live enterprise environment

The authors conduct the first direct comparison between AI agents and professional penetration testers on a real production university network with approximately 8,000 hosts, establishing empirical baselines for AI cybersecurity capabilities in realistic operational conditions.

Contribution

ARTEMIS multi-agent framework for penetration testing

The authors introduce ARTEMIS, a novel autonomous penetration testing framework that uses a supervisor managing workflow, unlimited sub-agents with dynamically generated expert prompts, and a triaging module for vulnerability verification, designed to sustain long-horizon complex tasks on production systems.

Contribution

Unified scoring framework for penetration test quality assessment

The authors create a novel evaluation metric combining technical complexity scores (detection and exploit complexity) with weighted criticality ratings to systematically assess penetration testing performance, departing from standard doctrine by rewarding technically sophisticated exploits over easily exploitable vulnerabilities.