DevOps-Gym: Benchmarking AI Agents in Software DevOps Cycle

ICLR 2026 Conference SubmissionAnonymous Authors
BenchmarkAgentTool Call
Abstract:

Even though demonstrating extraordinary capabilities in code generation and software issue resolving, AI agents' capabilities in the full software DevOps cycle are still unknown. Different from pure code generation, handling the DevOps cycle in real-world software, including developing, deploying, and managing, requires analyzing large-scale projects, understanding dynamic program behaviors, leveraging domain-specific tools, and making sequential decisions. However, existing benchmarks focus on isolated problems and lack environments and tool interfaces for DevOps. We introduce DevOps-Gym, the first end-to-end benchmark for evaluating AI agents across core DevOps workflows: build and configuration, monitoring, issue resolving, and test generation. DevOps-Gym includes 700+ real-world tasks collected from 30+ projects in Java and Go. We develop a semi-automated data collection mechanism with rigorous and non-trivial expert efforts in ensuring the task coverage and quality. Our evaluation of state-of-the-art models and agents reveals fundamental limitations: they struggle with issue resolving and test generation in Java and Go, and remain unable to handle new tasks such as monitoring and build and configuration. These results highlight the need for essential research in automating the full DevOps cycle with AI agents.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces DevOps-Gym, a benchmark evaluating AI agents across build, configuration, monitoring, issue resolution, and test generation workflows. It resides in the 'End-to-End DevOps Workflow Benchmarks' leaf, which currently contains only this paper among the 50 surveyed works. This isolation suggests the research direction—comprehensive DevOps lifecycle evaluation—is notably sparse compared to crowded areas like multi-agent frameworks (7 papers) or predictive CI/CD analytics (7 papers). The taxonomy reveals most prior work targets isolated stages rather than integrated workflows.

The taxonomy tree shows neighboring leaves focus on IT operations benchmarks (ITBench, AIOpsLab, SRE-Bench) and agent self-evaluation frameworks, which address operational incident response or meta-assessment rather than full DevOps cycles. The broader 'Evaluation Frameworks' branch sits alongside 'AI Agent Frameworks' (10 papers across 4 leaves) and 'DevOps Pipeline Automation' (16 papers across 4 leaves). DevOps-Gym bridges these domains by evaluating agents on tasks that span development, deployment, and operations, diverging from narrower benchmarks that isolate testing, code generation, or monitoring.

Among 30 candidates examined, the semi-automated data collection mechanism (Contribution 2) encountered 1 refutable candidate from 10 examined, while the core benchmark (Contribution 1) and evaluation framework (Contribution 3) each examined 10 candidates with no clear refutations. The limited search scope means these statistics reflect top-K semantic matches and citation expansion, not exhaustive coverage. Contribution 1 appears more novel given zero refutations, whereas Contribution 2 shows some overlap with existing data collection practices in related benchmarks.

Based on the 30-candidate search, the work occupies a sparsely populated research direction within a field that has concentrated effort on agent architectures and pipeline optimization. The taxonomy structure and contribution-level statistics suggest the end-to-end DevOps evaluation angle is relatively underexplored, though the limited search scope precludes definitive claims about absolute novelty across all relevant literature.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: Evaluating AI agents across software DevOps workflows. The field has coalesced around several complementary perspectives. One major branch examines AI agent frameworks and architectures for software development, encompassing multi-agent collaboration platforms like ChatDev[3] and Communicative Agents[1], as well as broader architectural considerations captured in works such as Agents Development Architectures[4]. A second branch focuses on AI-driven DevOps pipeline automation and optimization, addressing how generative AI can transform CI/CD processes, monitoring, and self-healing mechanisms—illustrated by studies like AI Monitoring CICD[8] and Self Healing DevOps[37]. A third branch centers on evaluation frameworks and benchmarks, including end-to-end workflow benchmarks such as ITBench[5] and AIOpsLab[22], which provide standardized testbeds for measuring agent performance across realistic DevOps scenarios. Finally, a conceptual branch explores broader impacts and human-AI collaboration themes, as seen in Agentic AI Perspectives[17] and AI Teammates[16], situating technical advances within organizational and societal contexts. Recent work reveals a tension between holistic end-to-end evaluation and narrower task-specific benchmarks. While many studies target isolated stages—code generation, testing, or deployment—a smaller cluster emphasizes comprehensive workflows that mirror real-world DevOps complexity. DevOps Gym[0] falls squarely within this end-to-end evaluation branch, offering a benchmark that spans multiple DevOps phases rather than isolating a single subprocess. Compared to ITBench[5], which also adopts a broad workflow perspective, DevOps Gym[0] places particular emphasis on realistic task diversity and agent adaptability across the full software lifecycle. Meanwhile, works like AIOpsLab[22] focus more heavily on operational incident response, highlighting how different benchmarks carve out distinct slices of the DevOps landscape. This positioning underscores an open question in the field: whether unified, all-encompassing benchmarks or modular, stage-specific evaluations better capture the capabilities and limitations of AI agents in production environments.

Claimed Contributions

DevOps-Gym benchmark for evaluating AI agents across the full DevOps cycle

The authors present DevOps-Gym, a novel benchmark designed to evaluate AI agents on four critical DevOps stages (build and configuration, monitoring, issue resolving, and test generation) using real-world repositories and various DevOps tools. The benchmark includes 700+ tasks from 30+ Java and Go projects and provides tool-augmented dynamic evaluation environments with standardized interfaces.

10 retrieved papers
Semi-automated data collection mechanism with expert validation

The authors introduce a systematic approach for collecting and validating benchmark tasks that combines automated mining of GitHub issues with extensive manual expert effort to reproduce failures, ensure task correctness, and prevent data contamination through prefix-completion analysis and repository sanitization.

10 retrieved papers
Can Refute
Comprehensive evaluation framework with standardized tool interfaces and metrics

The authors develop an evaluation infrastructure that provides standardized command-line tool interfaces in the terminal-bench format, along with task-specific metrics for different DevOps stages, enabling rigorous and scalable assessment of agent capabilities in realistic DevOps scenarios.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

DevOps-Gym benchmark for evaluating AI agents across the full DevOps cycle

The authors present DevOps-Gym, a novel benchmark designed to evaluate AI agents on four critical DevOps stages (build and configuration, monitoring, issue resolving, and test generation) using real-world repositories and various DevOps tools. The benchmark includes 700+ tasks from 30+ Java and Go projects and provides tool-augmented dynamic evaluation environments with standardized interfaces.

Contribution

Semi-automated data collection mechanism with expert validation

The authors introduce a systematic approach for collecting and validating benchmark tasks that combines automated mining of GitHub issues with extensive manual expert effort to reproduce failures, ensure task correctness, and prevent data contamination through prefix-completion analysis and repository sanitization.

Contribution

Comprehensive evaluation framework with standardized tool interfaces and metrics

The authors develop an evaluation infrastructure that provides standardized command-line tool interfaces in the terminal-bench format, along with task-specific metrics for different DevOps stages, enabling rigorous and scalable assessment of agent capabilities in realistic DevOps scenarios.