CyberGym: Evaluating AI Agents' Real-World Cybersecurity Capabilities at Scale
Overview
Overall Novelty Assessment
The paper introduces CyberGym, a large-scale benchmark with 1,507 real-world vulnerabilities across 188 software projects, designed to evaluate AI agents on generating proof-of-concept tests from vulnerability descriptions. It resides in the 'LLM-Driven PoC Synthesis' leaf, which contains five papers total, indicating a moderately populated but still emerging research direction. This leaf sits within the broader 'Automated PoC Generation Techniques' branch, which also includes constraint-based and dynamic analysis approaches, suggesting the field is actively exploring multiple synthesis paradigms.
The taxonomy reveals that CyberGym's neighboring work spans several related directions: dynamic analysis and test-guided PoC generation (four papers), constraint-based symbolic synthesis (three papers), and AI agent security benchmarks (five papers). The 'Benchmarking and Evaluation Frameworks' branch, particularly 'AI Agent and LLM Security Benchmarks,' provides the closest conceptual neighbors, as these frameworks similarly assess AI capabilities on cybersecurity tasks. The taxonomy's scope notes clarify that CyberGym's focus on LLM-driven synthesis distinguishes it from purely symbolic or fuzzing-based methods, while its benchmarking component connects it to evaluation-focused research.
Among 30 candidates examined, the contribution-level analysis shows varied novelty signals. The large-scale benchmark contribution (10 candidates examined, 0 refutable) and the comprehensive AI agent evaluation (10 candidates examined, 0 refutable) appear to have limited direct overlap in the search scope. However, the platform for open-ended vulnerability discovery (10 candidates examined, 1 refutable) shows at least one candidate providing overlapping prior work. This suggests that while the benchmark scale and evaluation methodology may be distinctive, the concept of using AI for real-world vulnerability discovery has some precedent within the examined literature.
Given the limited search scope of 30 semantically similar candidates, this analysis captures the most proximate prior work but cannot claim exhaustive coverage of the field. The taxonomy structure indicates CyberGym operates in a moderately active research area with established neighboring directions, yet the specific combination of large-scale benchmarking, LLM-driven PoC synthesis, and real-world vulnerability discovery appears to differentiate it from the examined candidates. The refutable finding for one contribution warrants closer inspection of the overlapping work's scope and claims.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors present CyberGym, a benchmark containing 1,507 real-world vulnerability instances from 188 diverse software projects. The benchmark tasks agents with generating proof-of-concept tests to reproduce vulnerabilities given text descriptions and codebases, using execution-based validation metrics.
The authors conduct extensive experiments evaluating four state-of-the-art agent frameworks and eleven frontier LLMs on CyberGym. Their evaluation reveals that even top-performing combinations achieve only approximately 20% success rates, demonstrating CyberGym's effectiveness in differentiating agents' cybersecurity capabilities.
The authors demonstrate that CyberGym extends beyond static benchmarking to create direct security impact. Their evaluation led to the discovery of 35 zero-day vulnerabilities and 17 incomplete patches in real-world software, with responsible disclosure to maintainers.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[1] PoCGen: Generating Proof-of-Concept Exploits for Vulnerabilities in Npm Packages PDF
[3] PoCo: Agentic Proof-of-Concept Exploit Generation for Smart Contracts PDF
[17] LLM Agents for Automated Web Vulnerability Reproduction: Are We There Yet? PDF
[39] A Systematic Study on Generating Web Vulnerability Proof-of-Concepts Using Large Language Models PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
CyberGym: A large-scale, realistic cybersecurity benchmark
The authors present CyberGym, a benchmark containing 1,507 real-world vulnerability instances from 188 diverse software projects. The benchmark tasks agents with generating proof-of-concept tests to reproduce vulnerabilities given text descriptions and codebases, using execution-based validation metrics.
[10] CyberGym: Evaluating AI Agents' Cybersecurity Capabilities with Real-World Vulnerabilities at Scale PDF
[14] SecureAgentBench: Benchmarking Secure Code Generation under Realistic Vulnerability Scenarios PDF
[51] CVE-Bench: Benchmarking LLM-based Software Engineering Agent's Ability to Repair Real-World CVE Vulnerabilities PDF
[52] CVE-assisted large-scale security bug report dataset construction method PDF
[53] A large-scale empirical study on vulnerability distribution within projects and the lessons learned PDF
[54] Cheesecloth: Zero-Knowledge Proofs of Real-World Vulnerabilities PDF
[55] Detection of recurring software vulnerabilities PDF
[56] On Security Weaknesses and Vulnerabilities in Deep Learning Systems PDF
[57] Large-scale empirical study of important features indicative of discovered vulnerabilities to assess application security PDF
[58] Benchmarking static analysis tools for web security PDF
Comprehensive evaluation of frontier AI agents and LLMs on cybersecurity tasks
The authors conduct extensive experiments evaluating four state-of-the-art agent frameworks and eleven frontier LLMs on CyberGym. Their evaluation reveals that even top-performing combinations achieve only approximately 20% success rates, demonstrating CyberGym's effectiveness in differentiating agents' cybersecurity capabilities.
[59] A comprehensive survey: Evaluating the efficiency of artificial intelligence and machine learning techniques on cyber security solutions PDF
[60] Considerations for evaluating large language models for cybersecurity tasks PDF
[61] When llms meet cybersecurity: A systematic literature review PDF
[62] Specification and Evaluation of Multi-Agent LLM Systems--Prototype and Cybersecurity Applications PDF
[63] Cyberpal. ai: Empowering llms with expert-driven cybersecurity instructions PDF
[64] Assessing confidence in frontier AI safety cases PDF
[65] Beyond detection: large language models and next-generation cybersecurity PDF
[66] From vulnerability to defense: The role of large language models in enhancing cybersecurity PDF
[67] From Texts to Shields: Convergence of Large Language Models and Cybersecurity PDF
[68] Cyberseceval 3: Advancing the evaluation of cybersecurity risks and capabilities in large language models PDF
Platform for open-ended vulnerability discovery with real-world security impact
The authors demonstrate that CyberGym extends beyond static benchmarking to create direct security impact. Their evaluation led to the discovery of 35 zero-day vulnerabilities and 17 incomplete patches in real-world software, with responsible disclosure to maintainers.