CyberGym: Evaluating AI Agents' Real-World Cybersecurity Capabilities at Scale

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 7.0 Download Report PDF

CybersecurityAIAgents

AI agents have significant potential to reshape cybersecurity, making a thorough assessment of their capabilities critical. However, existing evaluations fall short, because they are based on small-scale benchmarks and only measure static outcomes, failing to capture the full, dynamic range of real-world security challenges. To address these limitations, we introduce CyberGym, a large-scale benchmark featuring 1,507 real-world vulnerabilities across 188 software projects. Adjustable to different vulnerability analysis settings, CyberGym primarily tasks agents with generating a proof-of-concept test that reproduces a vulnerability, given only its text description and the corresponding codebase. Our extensive evaluation highlights that CyberGym effectively differentiates agents' and models' cybersecurity capabilities. Even the top-performing combinations only achieve a ~20% success rate, demonstrating the overall difficulty of CyberGym. Beyond static benchmarking, we show that CyberGym leads to the discovery of 35 zero-day vulnerabilities and 17 historically incomplete patches. These results underscore that CyberGym is not only a robust benchmark for measuring AI's progress in cybersecurity but also a platform for creating direct, real-world security impact.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces CyberGym, a large-scale benchmark with 1,507 real-world vulnerabilities across 188 software projects, designed to evaluate AI agents on generating proof-of-concept tests from vulnerability descriptions. It resides in the 'LLM-Driven PoC Synthesis' leaf, which contains five papers total, indicating a moderately populated but still emerging research direction. This leaf sits within the broader 'Automated PoC Generation Techniques' branch, which also includes constraint-based and dynamic analysis approaches, suggesting the field is actively exploring multiple synthesis paradigms.

The taxonomy reveals that CyberGym's neighboring work spans several related directions: dynamic analysis and test-guided PoC generation (four papers), constraint-based symbolic synthesis (three papers), and AI agent security benchmarks (five papers). The 'Benchmarking and Evaluation Frameworks' branch, particularly 'AI Agent and LLM Security Benchmarks,' provides the closest conceptual neighbors, as these frameworks similarly assess AI capabilities on cybersecurity tasks. The taxonomy's scope notes clarify that CyberGym's focus on LLM-driven synthesis distinguishes it from purely symbolic or fuzzing-based methods, while its benchmarking component connects it to evaluation-focused research.

Among 30 candidates examined, the contribution-level analysis shows varied novelty signals. The large-scale benchmark contribution (10 candidates examined, 0 refutable) and the comprehensive AI agent evaluation (10 candidates examined, 0 refutable) appear to have limited direct overlap in the search scope. However, the platform for open-ended vulnerability discovery (10 candidates examined, 1 refutable) shows at least one candidate providing overlapping prior work. This suggests that while the benchmark scale and evaluation methodology may be distinctive, the concept of using AI for real-world vulnerability discovery has some precedent within the examined literature.

Given the limited search scope of 30 semantically similar candidates, this analysis captures the most proximate prior work but cannot claim exhaustive coverage of the field. The taxonomy structure indicates CyberGym operates in a moderately active research area with established neighboring directions, yet the specific combination of large-scale benchmarking, LLM-driven PoC synthesis, and real-world vulnerability discovery appears to differentiate it from the examined candidates. The refutable finding for one contribution warrants closer inspection of the overlapping work's scope and claims.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: generating proof-of-concept tests for vulnerability reproduction in software codebases. The field organizes around four main branches that reflect distinct stages and concerns in the vulnerability lifecycle. Automated PoC Generation Techniques encompasses methods for synthesizing exploits, ranging from traditional symbolic execution and fuzzing approaches to newer LLM-driven synthesis strategies that leverage large language models to produce test cases from vulnerability descriptions. Vulnerability Detection and Validation focuses on identifying security flaws and confirming their exploitability, including static analysis, dynamic testing, and hybrid techniques. Benchmarking and Evaluation Frameworks provides standardized datasets and metrics to assess PoC generation tools, such as SEC-bench[13] and SecureAgentBench[14], which enable systematic comparison across methods. Finally, Vulnerability Comprehension and Exploitation Analysis examines how developers and attackers understand and weaponize vulnerabilities, studying exploit development workflows and the semantics of security flaws. Recent work has seen a surge in LLM-driven approaches that promise to automate PoC creation at scale, yet these methods face challenges in balancing generality with precision. CyberGym[0] sits within the LLM-Driven PoC Synthesis cluster, employing reinforcement learning to guide language models in generating executable exploits for diverse vulnerability types. It contrasts with PoCGen[1] and PoCo[3], which also leverage LLMs but differ in their use of retrieval-augmented generation versus iterative refinement strategies. Nearby efforts like Web Vulnerability Reproduction[17] and Web PoC Generation[39] focus specifically on web application contexts, highlighting domain-specific challenges in input crafting and environment setup. A key tension across these branches is the trade-off between automation and accuracy: while LLM-based tools can rapidly produce candidate PoCs, validation remains difficult without robust execution environments and feedback mechanisms, a gap that CyberGym[0] addresses through its reinforcement learning framework.

Claimed Contributions

CyberGym: A large-scale, realistic cybersecurity benchmark

10 retrieved papers

The authors present CyberGym, a benchmark containing 1,507 real-world vulnerability instances from 188 diverse software projects. The benchmark tasks agents with generating proof-of-concept tests to reproduce vulnerabilities given text descriptions and codebases, using execution-based validation metrics.

10 retrieved papers

Comprehensive evaluation of frontier AI agents and LLMs on cybersecurity tasks

10 retrieved papers

The authors conduct extensive experiments evaluating four state-of-the-art agent frameworks and eleven frontier LLMs on CyberGym. Their evaluation reveals that even top-performing combinations achieve only approximately 20% success rates, demonstrating CyberGym's effectiveness in differentiating agents' cybersecurity capabilities.

10 retrieved papers

Platform for open-ended vulnerability discovery with real-world security impact

Can Refute

10 retrieved papers

The authors demonstrate that CyberGym extends beyond static benchmarking to create direct security impact. Their evaluation led to the discovery of 35 zero-day vulnerabilities and 17 incomplete patches in real-world software, with responsible disclosure to maintainers.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] PoCGen: Generating Proof-of-Concept Exploits for Vulnerabilities in Npm Packages PDF

Simsek Deniz, Eghbali, Aryaz, Deniz Simsek, Pradel, Michael, A. Eghbali, Michael Pradel (2025)

[3] PoCo: Agentic Proof-of-Concept Exploit Generation for Smart Contracts PDF

Bobadilla, Sofia, Vivi Andersson, Sofia Bobadilla, Monperrus, Martin, Harald Hobbelhagen, Martin Monperrus (2025)

[17] LLM Agents for Automated Web Vulnerability Reproduction: Are We There Yet? PDF

Liu Bin, Zhao Yanjie, Bin Liu, Xu Guoai, Yanjie Zhao, Wang Haoyu, Guoai Xu, Haoyu Wang (2025)

[39] A Systematic Study on Generating Web Vulnerability Proof-of-Concepts Using Large Language Models PDF

Zhao Mengyao, Li Kaixuan, Mengyao Zhao, Zhang, Lyuye, Kaixuan Li, Dang Wenjing, L. Zhang, Ding Chenggong, Wenjing Dang, Chen Sen, Chenggong Ding, Liu, Zheli, Sen Chen, Zheli Liu (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

CyberGym: A large-scale, realistic cybersecurity benchmark

[10] CyberGym: Evaluating AI Agents' Cybersecurity Capabilities with Real-World Vulnerabilities at Scale PDF

Cannot Refute

[14] SecureAgentBench: Benchmarking Secure Code Generation under Realistic Vulnerability Scenarios PDF

Cannot Refute

[51] CVE-Bench: Benchmarking LLM-based Software Engineering Agent's Ability to Repair Real-World CVE Vulnerabilities PDF

Cannot Refute

[52] CVE-assisted large-scale security bug report dataset construction method PDF

Cannot Refute

[53] A large-scale empirical study on vulnerability distribution within projects and the lessons learned PDF

Cannot Refute

[54] Cheesecloth: Zero-Knowledge Proofs of Real-World Vulnerabilities PDF

Cannot Refute

[55] Detection of recurring software vulnerabilities PDF

Cannot Refute

[56] On Security Weaknesses and Vulnerabilities in Deep Learning Systems PDF

Cannot Refute

[57] Large-scale empirical study of important features indicative of discovered vulnerabilities to assess application security PDF

Cannot Refute

[58] Benchmarking static analysis tools for web security PDF

Cannot Refute

Contribution

Comprehensive evaluation of frontier AI agents and LLMs on cybersecurity tasks

[59] A comprehensive survey: Evaluating the efficiency of artificial intelligence and machine learning techniques on cyber security solutions PDF

Cannot Refute

[60] Considerations for evaluating large language models for cybersecurity tasks PDF

Cannot Refute

[61] When llms meet cybersecurity: A systematic literature review PDF

Cannot Refute

[62] Specification and Evaluation of Multi-Agent LLM Systems--Prototype and Cybersecurity Applications PDF

Cannot Refute

[63] Cyberpal. ai: Empowering llms with expert-driven cybersecurity instructions PDF

Cannot Refute

[64] Assessing confidence in frontier AI safety cases PDF

Cannot Refute

[65] Beyond detection: large language models and next-generation cybersecurity PDF

Cannot Refute

[66] From vulnerability to defense: The role of large language models in enhancing cybersecurity PDF

Cannot Refute

[67] From Texts to Shields: Convergence of Large Language Models and Cybersecurity PDF

Cannot Refute

[68] Cyberseceval 3: Advancing the evaluation of cybersecurity risks and capabilities in large language models PDF

Cannot Refute

Contribution

Platform for open-ended vulnerability discovery with real-world security impact

[73] CAI: An Open, Bug Bounty-Ready Cybersecurity AI PDF

Can Refute

[69] Supporting continuous vulnerability compliance through automated identity provisioning PDF

Cannot Refute

[70] Specification-Guided Vulnerability Detection with Large Language Models PDF

Cannot Refute

[71] Understanding software vulnerabilities in the maven ecosystem: Patterns, timelines, and risks PDF

Cannot Refute

[72] Securing container images through automated vulnerability detection in shift-left CI/CD pipelines PDF

Cannot Refute

[74] CVE-Bench: A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities PDF

Cannot Refute

[75] AI-Based Web Vulnerability Scanner: A Comprehensive Review PDF

Cannot Refute

[76] Reef: A framework for collecting real-world vulnerabilities and fixes PDF

Cannot Refute

[77] Transforming SOC Operations: Harnessing the Power of AI and ML for Enhanced Threat Detection PDF

Cannot Refute

[78] Towards LLM-Assisted Vulnerability Detection and Repair for Open-Source 5G UE Implementations PDF

Cannot Refute

CyberGym: Evaluating AI Agents' Real-World Cybersecurity Capabilities at Scale

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] PoCGen: Generating Proof-of-Concept Exploits for Vulnerabilities in Npm Packages PDF

[3] PoCo: Agentic Proof-of-Concept Exploit Generation for Smart Contracts PDF

[17] LLM Agents for Automated Web Vulnerability Reproduction: Are We There Yet? PDF

[39] A Systematic Study on Generating Web Vulnerability Proof-of-Concepts Using Large Language Models PDF

Contribution Analysis

CyberGym: A large-scale, realistic cybersecurity benchmark

[10] CyberGym: Evaluating AI Agents' Cybersecurity Capabilities with Real-World Vulnerabilities at Scale PDF

[14] SecureAgentBench: Benchmarking Secure Code Generation under Realistic Vulnerability Scenarios PDF

[51] CVE-Bench: Benchmarking LLM-based Software Engineering Agent's Ability to Repair Real-World CVE Vulnerabilities PDF

[52] CVE-assisted large-scale security bug report dataset construction method PDF

[53] A large-scale empirical study on vulnerability distribution within projects and the lessons learned PDF

[54] Cheesecloth: Zero-Knowledge Proofs of Real-World Vulnerabilities PDF

[55] Detection of recurring software vulnerabilities PDF

[56] On Security Weaknesses and Vulnerabilities in Deep Learning Systems PDF

[57] Large-scale empirical study of important features indicative of discovered vulnerabilities to assess application security PDF

[58] Benchmarking static analysis tools for web security PDF

Comprehensive evaluation of frontier AI agents and LLMs on cybersecurity tasks

[59] A comprehensive survey: Evaluating the efficiency of artificial intelligence and machine learning techniques on cyber security solutions PDF

[60] Considerations for evaluating large language models for cybersecurity tasks PDF

[61] When llms meet cybersecurity: A systematic literature review PDF

[62] Specification and Evaluation of Multi-Agent LLM Systems--Prototype and Cybersecurity Applications PDF

[63] Cyberpal. ai: Empowering llms with expert-driven cybersecurity instructions PDF

[64] Assessing confidence in frontier AI safety cases PDF

[65] Beyond detection: large language models and next-generation cybersecurity PDF

[66] From vulnerability to defense: The role of large language models in enhancing cybersecurity PDF

[67] From Texts to Shields: Convergence of Large Language Models and Cybersecurity PDF

[68] Cyberseceval 3: Advancing the evaluation of cybersecurity risks and capabilities in large language models PDF

Platform for open-ended vulnerability discovery with real-world security impact

[73] CAI: An Open, Bug Bounty-Ready Cybersecurity AI PDF

[69] Supporting continuous vulnerability compliance through automated identity provisioning PDF

[70] Specification-Guided Vulnerability Detection with Large Language Models PDF

[71] Understanding software vulnerabilities in the maven ecosystem: Patterns, timelines, and risks PDF

[72] Securing container images through automated vulnerability detection in shift-left CI/CD pipelines PDF

[74] CVE-Bench: A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities PDF

[75] AI-Based Web Vulnerability Scanner: A Comprehensive Review PDF

[76] Reef: A framework for collecting real-world vulnerabilities and fixes PDF

[77] Transforming SOC Operations: Harnessing the Power of AI and ML for Enhanced Threat Detection PDF

[78] Towards LLM-Assisted Vulnerability Detection and Repair for Open-Source 5G UE Implementations PDF

Table of Contents