Abstract:

AI agents have significant potential to reshape cybersecurity, making a thorough assessment of their capabilities critical. However, existing evaluations fall short, because they are based on small-scale benchmarks and only measure static outcomes, failing to capture the full, dynamic range of real-world security challenges. To address these limitations, we introduce CyberGym, a large-scale benchmark featuring 1,507 real-world vulnerabilities across 188 software projects. Adjustable to different vulnerability analysis settings, CyberGym primarily tasks agents with generating a proof-of-concept test that reproduces a vulnerability, given only its text description and the corresponding codebase. Our extensive evaluation highlights that CyberGym effectively differentiates agents' and models' cybersecurity capabilities. Even the top-performing combinations only achieve a ~20% success rate, demonstrating the overall difficulty of CyberGym. Beyond static benchmarking, we show that CyberGym leads to the discovery of 35 zero-day vulnerabilities and 17 historically incomplete patches. These results underscore that CyberGym is not only a robust benchmark for measuring AI's progress in cybersecurity but also a platform for creating direct, real-world security impact.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces CyberGym, a large-scale benchmark with 1,507 real-world vulnerabilities across 188 software projects, designed to evaluate AI agents on generating proof-of-concept tests from vulnerability descriptions. It resides in the 'LLM-Driven PoC Synthesis' leaf, which contains five papers total, indicating a moderately populated but still emerging research direction. This leaf sits within the broader 'Automated PoC Generation Techniques' branch, which also includes constraint-based and dynamic analysis approaches, suggesting the field is actively exploring multiple synthesis paradigms.

The taxonomy reveals that CyberGym's neighboring work spans several related directions: dynamic analysis and test-guided PoC generation (four papers), constraint-based symbolic synthesis (three papers), and AI agent security benchmarks (five papers). The 'Benchmarking and Evaluation Frameworks' branch, particularly 'AI Agent and LLM Security Benchmarks,' provides the closest conceptual neighbors, as these frameworks similarly assess AI capabilities on cybersecurity tasks. The taxonomy's scope notes clarify that CyberGym's focus on LLM-driven synthesis distinguishes it from purely symbolic or fuzzing-based methods, while its benchmarking component connects it to evaluation-focused research.

Among 30 candidates examined, the contribution-level analysis shows varied novelty signals. The large-scale benchmark contribution (10 candidates examined, 0 refutable) and the comprehensive AI agent evaluation (10 candidates examined, 0 refutable) appear to have limited direct overlap in the search scope. However, the platform for open-ended vulnerability discovery (10 candidates examined, 1 refutable) shows at least one candidate providing overlapping prior work. This suggests that while the benchmark scale and evaluation methodology may be distinctive, the concept of using AI for real-world vulnerability discovery has some precedent within the examined literature.

Given the limited search scope of 30 semantically similar candidates, this analysis captures the most proximate prior work but cannot claim exhaustive coverage of the field. The taxonomy structure indicates CyberGym operates in a moderately active research area with established neighboring directions, yet the specific combination of large-scale benchmarking, LLM-driven PoC synthesis, and real-world vulnerability discovery appears to differentiate it from the examined candidates. The refutable finding for one contribution warrants closer inspection of the overlapping work's scope and claims.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: generating proof-of-concept tests for vulnerability reproduction in software codebases. The field organizes around four main branches that reflect distinct stages and concerns in the vulnerability lifecycle. Automated PoC Generation Techniques encompasses methods for synthesizing exploits, ranging from traditional symbolic execution and fuzzing approaches to newer LLM-driven synthesis strategies that leverage large language models to produce test cases from vulnerability descriptions. Vulnerability Detection and Validation focuses on identifying security flaws and confirming their exploitability, including static analysis, dynamic testing, and hybrid techniques. Benchmarking and Evaluation Frameworks provides standardized datasets and metrics to assess PoC generation tools, such as SEC-bench[13] and SecureAgentBench[14], which enable systematic comparison across methods. Finally, Vulnerability Comprehension and Exploitation Analysis examines how developers and attackers understand and weaponize vulnerabilities, studying exploit development workflows and the semantics of security flaws. Recent work has seen a surge in LLM-driven approaches that promise to automate PoC creation at scale, yet these methods face challenges in balancing generality with precision. CyberGym[0] sits within the LLM-Driven PoC Synthesis cluster, employing reinforcement learning to guide language models in generating executable exploits for diverse vulnerability types. It contrasts with PoCGen[1] and PoCo[3], which also leverage LLMs but differ in their use of retrieval-augmented generation versus iterative refinement strategies. Nearby efforts like Web Vulnerability Reproduction[17] and Web PoC Generation[39] focus specifically on web application contexts, highlighting domain-specific challenges in input crafting and environment setup. A key tension across these branches is the trade-off between automation and accuracy: while LLM-based tools can rapidly produce candidate PoCs, validation remains difficult without robust execution environments and feedback mechanisms, a gap that CyberGym[0] addresses through its reinforcement learning framework.

Claimed Contributions

CyberGym: A large-scale, realistic cybersecurity benchmark

The authors present CyberGym, a benchmark containing 1,507 real-world vulnerability instances from 188 diverse software projects. The benchmark tasks agents with generating proof-of-concept tests to reproduce vulnerabilities given text descriptions and codebases, using execution-based validation metrics.

10 retrieved papers
Comprehensive evaluation of frontier AI agents and LLMs on cybersecurity tasks

The authors conduct extensive experiments evaluating four state-of-the-art agent frameworks and eleven frontier LLMs on CyberGym. Their evaluation reveals that even top-performing combinations achieve only approximately 20% success rates, demonstrating CyberGym's effectiveness in differentiating agents' cybersecurity capabilities.

10 retrieved papers
Platform for open-ended vulnerability discovery with real-world security impact

The authors demonstrate that CyberGym extends beyond static benchmarking to create direct security impact. Their evaluation led to the discovery of 35 zero-day vulnerabilities and 17 incomplete patches in real-world software, with responsible disclosure to maintainers.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

CyberGym: A large-scale, realistic cybersecurity benchmark

The authors present CyberGym, a benchmark containing 1,507 real-world vulnerability instances from 188 diverse software projects. The benchmark tasks agents with generating proof-of-concept tests to reproduce vulnerabilities given text descriptions and codebases, using execution-based validation metrics.

Contribution

Comprehensive evaluation of frontier AI agents and LLMs on cybersecurity tasks

The authors conduct extensive experiments evaluating four state-of-the-art agent frameworks and eleven frontier LLMs on CyberGym. Their evaluation reveals that even top-performing combinations achieve only approximately 20% success rates, demonstrating CyberGym's effectiveness in differentiating agents' cybersecurity capabilities.

Contribution

Platform for open-ended vulnerability discovery with real-world security impact

The authors demonstrate that CyberGym extends beyond static benchmarking to create direct security impact. Their evaluation led to the discovery of 35 zero-day vulnerabilities and 17 incomplete patches in real-world software, with responsible disclosure to maintainers.