Comparing AI Agents to Cybersecurity Professionals in Real-World Penetration Testing
Overview
Overall Novelty Assessment
The paper contributes a direct performance comparison between AI agents and human cybersecurity professionals on a live university network of approximately 8,000 hosts, alongside the ARTEMIS multi-agent framework featuring dynamic prompt generation and vulnerability triaging. Within the taxonomy, it resides in the 'Live Enterprise Network Evaluations' leaf under 'Comparative Studies of AI and Human Performance.' This leaf contains only two papers total, indicating a relatively sparse research direction. The scarcity reflects the operational complexity and resource demands of conducting controlled experiments in production-scale enterprise environments rather than simulated testbeds.
The taxonomy reveals that most related work clusters in adjacent branches: 'AI-Driven Automated Penetration Testing Frameworks' contains 24 papers across RL-based, LLM-powered, and general automation subcategories, while 'Benchmarking and Evaluation Frameworks' holds 8 papers focused on CTF challenges and synthetic datasets. The 'Automated Versus Manual Testing Comparisons' and 'LLM-Assisted Workflow Evaluations' leaves examine similar questions but in controlled or tool-augmented settings rather than head-to-head agent-versus-human trials on live infrastructure. This work bridges the automation frameworks and evaluation methodologies by operationalizing both in a realistic enterprise context.
Among six candidates examined for the first contribution (comprehensive live evaluation), none provided clearly refuting prior work, though the limited search scope means exhaustive coverage is not guaranteed. The ARTEMIS framework contribution examined two candidates with no refutations found. The unified scoring framework contribution was not matched against any candidates in this analysis. The statistics suggest that while individual technical components (LLM agents, scoring metrics) have precedents, the integrated live-environment comparison at this scale appears less densely covered in the top-ranked semantic matches retrieved.
Based on the 6-candidate search scope, the work appears to occupy a methodologically distinct position—live enterprise evaluations remain rare compared to benchmark-driven studies. The taxonomy structure confirms that most innovation concentrates on automation techniques and synthetic testbeds rather than operational validation against human baselines. However, the limited candidate pool means adjacent work in conference proceedings or domain-specific venues may not be fully represented in this assessment.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors conduct the first direct comparison between AI agents and professional penetration testers on a real production university network with approximately 8,000 hosts, establishing empirical baselines for AI cybersecurity capabilities in realistic operational conditions.
The authors introduce ARTEMIS, a novel autonomous penetration testing framework that uses a supervisor managing workflow, unlimited sub-agents with dynamically generated expert prompts, and a triaging module for vulnerability verification, designed to sustain long-horizon complex tasks on production systems.
The authors create a novel evaluation metric combining technical complexity scores (detection and exploit complexity) with weighted criticality ratings to systematically assess penetration testing performance, departing from standard doctrine by rewarding technically sophisticated exploits over easily exploitable vulnerabilities.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[22] Red teaming in the age of AI-augmented defenders: Evaluating human Vs. machine tactics in professional penetration testing PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
First comprehensive evaluation of AI agents against human cybersecurity professionals in live enterprise environment
The authors conduct the first direct comparison between AI agents and professional penetration testers on a real production university network with approximately 8,000 hosts, establishing empirical baselines for AI cybersecurity capabilities in realistic operational conditions.
[15] CAI: An Open, Bug Bounty-Ready Cybersecurity AI PDF
[51] Optimizing AI and Human Expertise Integration in Cybersecurity: Enhancing Operational Efficiency and Collaborative Decision-Making PDF
[52] AI Agents-as-Judge: Automated Assessment of Accuracy, Consistency, Completeness and Clarity for Enterprise Documents PDF
[53] MX-AI: Agentic Observability and Control Platform for Open and AI-RAN PDF
ARTEMIS multi-agent framework for penetration testing
The authors introduce ARTEMIS, a novel autonomous penetration testing framework that uses a supervisor managing workflow, unlimited sub-agents with dynamically generated expert prompts, and a triaging module for vulnerability verification, designed to sustain long-horizon complex tasks on production systems.
Unified scoring framework for penetration test quality assessment
The authors create a novel evaluation metric combining technical complexity scores (detection and exploit complexity) with weighted criticality ratings to systematically assess penetration testing performance, departing from standard doctrine by rewarding technically sophisticated exploits over easily exploitable vulnerabilities.