DevOps-Gym: Benchmarking AI Agents in Software DevOps Cycle

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

BenchmarkAgentTool Call

Even though demonstrating extraordinary capabilities in code generation and software issue resolving, AI agents' capabilities in the full software DevOps cycle are still unknown. Different from pure code generation, handling the DevOps cycle in real-world software, including developing, deploying, and managing, requires analyzing large-scale projects, understanding dynamic program behaviors, leveraging domain-specific tools, and making sequential decisions. However, existing benchmarks focus on isolated problems and lack environments and tool interfaces for DevOps. We introduce DevOps-Gym, the first end-to-end benchmark for evaluating AI agents across core DevOps workflows: build and configuration, monitoring, issue resolving, and test generation. DevOps-Gym includes 700+ real-world tasks collected from 30+ projects in Java and Go. We develop a semi-automated data collection mechanism with rigorous and non-trivial expert efforts in ensuring the task coverage and quality. Our evaluation of state-of-the-art models and agents reveals fundamental limitations: they struggle with issue resolving and test generation in Java and Go, and remain unable to handle new tasks such as monitoring and build and configuration. These results highlight the need for essential research in automating the full DevOps cycle with AI agents.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces DevOps-Gym, a benchmark evaluating AI agents across build, configuration, monitoring, issue resolution, and test generation workflows. It resides in the 'End-to-End DevOps Workflow Benchmarks' leaf, which currently contains only this paper among the 50 surveyed works. This isolation suggests the research direction—comprehensive DevOps lifecycle evaluation—is notably sparse compared to crowded areas like multi-agent frameworks (7 papers) or predictive CI/CD analytics (7 papers). The taxonomy reveals most prior work targets isolated stages rather than integrated workflows.

The taxonomy tree shows neighboring leaves focus on IT operations benchmarks (ITBench, AIOpsLab, SRE-Bench) and agent self-evaluation frameworks, which address operational incident response or meta-assessment rather than full DevOps cycles. The broader 'Evaluation Frameworks' branch sits alongside 'AI Agent Frameworks' (10 papers across 4 leaves) and 'DevOps Pipeline Automation' (16 papers across 4 leaves). DevOps-Gym bridges these domains by evaluating agents on tasks that span development, deployment, and operations, diverging from narrower benchmarks that isolate testing, code generation, or monitoring.

Among 30 candidates examined, the semi-automated data collection mechanism (Contribution 2) encountered 1 refutable candidate from 10 examined, while the core benchmark (Contribution 1) and evaluation framework (Contribution 3) each examined 10 candidates with no clear refutations. The limited search scope means these statistics reflect top-K semantic matches and citation expansion, not exhaustive coverage. Contribution 1 appears more novel given zero refutations, whereas Contribution 2 shows some overlap with existing data collection practices in related benchmarks.

Based on the 30-candidate search, the work occupies a sparsely populated research direction within a field that has concentrated effort on agent architectures and pipeline optimization. The taxonomy structure and contribution-level statistics suggest the end-to-end DevOps evaluation angle is relatively underexplored, though the limited search scope precludes definitive claims about absolute novelty across all relevant literature.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Evaluating AI agents across software DevOps workflows. The field has coalesced around several complementary perspectives. One major branch examines AI agent frameworks and architectures for software development, encompassing multi-agent collaboration platforms like ChatDev[3] and Communicative Agents[1], as well as broader architectural considerations captured in works such as Agents Development Architectures[4]. A second branch focuses on AI-driven DevOps pipeline automation and optimization, addressing how generative AI can transform CI/CD processes, monitoring, and self-healing mechanisms—illustrated by studies like AI Monitoring CICD[8] and Self Healing DevOps[37]. A third branch centers on evaluation frameworks and benchmarks, including end-to-end workflow benchmarks such as ITBench[5] and AIOpsLab[22], which provide standardized testbeds for measuring agent performance across realistic DevOps scenarios. Finally, a conceptual branch explores broader impacts and human-AI collaboration themes, as seen in Agentic AI Perspectives[17] and AI Teammates[16], situating technical advances within organizational and societal contexts. Recent work reveals a tension between holistic end-to-end evaluation and narrower task-specific benchmarks. While many studies target isolated stages—code generation, testing, or deployment—a smaller cluster emphasizes comprehensive workflows that mirror real-world DevOps complexity. DevOps Gym[0] falls squarely within this end-to-end evaluation branch, offering a benchmark that spans multiple DevOps phases rather than isolating a single subprocess. Compared to ITBench[5], which also adopts a broad workflow perspective, DevOps Gym[0] places particular emphasis on realistic task diversity and agent adaptability across the full software lifecycle. Meanwhile, works like AIOpsLab[22] focus more heavily on operational incident response, highlighting how different benchmarks carve out distinct slices of the DevOps landscape. This positioning underscores an open question in the field: whether unified, all-encompassing benchmarks or modular, stage-specific evaluations better capture the capabilities and limitations of AI agents in production environments.

Claimed Contributions

DevOps-Gym benchmark for evaluating AI agents across the full DevOps cycle

10 retrieved papers

The authors present DevOps-Gym, a novel benchmark designed to evaluate AI agents on four critical DevOps stages (build and configuration, monitoring, issue resolving, and test generation) using real-world repositories and various DevOps tools. The benchmark includes 700+ tasks from 30+ Java and Go projects and provides tool-augmented dynamic evaluation environments with standardized interfaces.

10 retrieved papers

Semi-automated data collection mechanism with expert validation

Can Refute

10 retrieved papers

The authors introduce a systematic approach for collecting and validating benchmark tasks that combines automated mining of GitHub issues with extensive manual expert effort to reproduce failures, ensure task correctness, and prevent data contamination through prefix-completion analysis and repository sanitization.

10 retrieved papers

Can Refute

Comprehensive evaluation framework with standardized tool interfaces and metrics

10 retrieved papers

The authors develop an evaluation infrastructure that provides standardized command-line tool interfaces in the terminal-bench format, along with task-specific metrics for different DevOps stages, enabling rigorous and scalable assessment of agent capabilities in realistic DevOps scenarios.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

DevOps-Gym benchmark for evaluating AI agents across the full DevOps cycle

[5] Itbench: Evaluating ai agents across diverse real-world it automation tasks PDF

Cannot Refute

[14] Enhancing DevOps efficiency through AI-driven predictive models for continuous integration and deployment pipelines PDF

Cannot Refute

[21] Intelligent Software Agents for Continuous Delivery: Leveraging AI and Machine Learning for Fully Automated DevOps Pipelines PDF

Cannot Refute

[22] AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds PDF

Cannot Refute

[25] Code Meets Intelligence: AI-Augmented CI/CD Systems for DevOps at Scale PDF

Cannot Refute

[29] Integrating Artificial Intelligence with DevOps: Enhancing Continuous Delivery, Automation, and Predictive Analytics for High-Performance Software â¦ PDF

Cannot Refute

[51] Coding agents: A comprehensive survey of automated bug fixing systems and benchmarks PDF

Cannot Refute

[52] Integrating AI-Driven Continuous Testing in DevOps for Enhanced Software Quality PDF

Cannot Refute

[53] Ai for devsecops: A landscape and future opportunities PDF

Cannot Refute

[54] Next-Generation DevOps: Cooperative AI Agents for Fully Autonomous Deployment Pipelines PDF

Cannot Refute

Contribution

Semi-automated data collection mechanism with expert validation

[67] DataSciBench: An LLM Agent Benchmark for Data Science PDF

Can Refute

[65] The Grid: A semi-automated tool to support expert-driven modeling PDF

Cannot Refute

[66] Evaluation of a semi-automated data extraction tool for public health literature-based reviews: Dextr PDF

Cannot Refute

[68] Artificial intelligence-based, semi-automated segmentation for the extraction of ultrasound-derived radiomics features in breast cancer: a prospective multicenter study PDF

Cannot Refute

[69] Improving the Annotation Process in Computational Pathology: A Pilot Study with Manual and Semi-automated Approaches on Consumer and Medical Grade Devices PDF

Cannot Refute

[70] A multiâagent Kâmeans with caseâbased reasoning for an automated quality assessment of software requirement specification PDF

Cannot Refute

[71] Smart analysis of automated and semi-automated approaches to data annotation for machine learning PDF

Cannot Refute

[72] Improving data quality of automated pavement condition data collection: Summary of state of the practices of transportation agencies and views of professionals PDF

Cannot Refute

[73] TaskBench: Benchmarking Large Language Models for Task Automation PDF

Cannot Refute

[74] Improved Software Effort Estimation Through Machine Learning: Challenges, Applications, and Feature Importance Analysis PDF

Cannot Refute

Contribution

Comprehensive evaluation framework with standardized tool interfaces and metrics

[55] OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning PDF

Cannot Refute

[56] ChEF: A Comprehensive Evaluation Framework for Standardized Assessment of Multimodal Large Language Models PDF

Cannot Refute

[57] API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs PDF

Cannot Refute

[58] Mcpeval: Automatic mcp-based deep evaluation for ai agent models PDF

Cannot Refute

[59] A Probabilistic Digital Twin of UK en Route Airspace PDF

Cannot Refute

[60] SOP-Bench: Complex Industrial SOPs for Evaluating LLM Agents PDF

Cannot Refute

[61] VerlTool: Towards Holistic Agentic Reinforcement Learning with Tool Use PDF

Cannot Refute

[62] Openagentsafety: A comprehensive framework for evaluating real-world ai agent safety PDF

Cannot Refute

[63] MCPWorld: A Unified Benchmarking Testbed for API, GUI, and Hybrid Computer Use Agents PDF

Cannot Refute

[64] API-Bank: A Benchmark for Tool-Augmented LLMs PDF

Cannot Refute

DevOps-Gym: Benchmarking AI Agents in Software DevOps Cycle

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

DevOps-Gym benchmark for evaluating AI agents across the full DevOps cycle

[5] Itbench: Evaluating ai agents across diverse real-world it automation tasks PDF

[14] Enhancing DevOps efficiency through AI-driven predictive models for continuous integration and deployment pipelines PDF

[21] Intelligent Software Agents for Continuous Delivery: Leveraging AI and Machine Learning for Fully Automated DevOps Pipelines PDF

[22] AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds PDF

[25] Code Meets Intelligence: AI-Augmented CI/CD Systems for DevOps at Scale PDF

[29] Integrating Artificial Intelligence with DevOps: Enhancing Continuous Delivery, Automation, and Predictive Analytics for High-Performance Software â¦ PDF

[51] Coding agents: A comprehensive survey of automated bug fixing systems and benchmarks PDF

[52] Integrating AI-Driven Continuous Testing in DevOps for Enhanced Software Quality PDF

[53] Ai for devsecops: A landscape and future opportunities PDF

[54] Next-Generation DevOps: Cooperative AI Agents for Fully Autonomous Deployment Pipelines PDF

Semi-automated data collection mechanism with expert validation

[67] DataSciBench: An LLM Agent Benchmark for Data Science PDF

[65] The Grid: A semi-automated tool to support expert-driven modeling PDF

[66] Evaluation of a semi-automated data extraction tool for public health literature-based reviews: Dextr PDF

[68] Artificial intelligence-based, semi-automated segmentation for the extraction of ultrasound-derived radiomics features in breast cancer: a prospective multicenter study PDF

[69] Improving the Annotation Process in Computational Pathology: A Pilot Study with Manual and Semi-automated Approaches on Consumer and Medical Grade Devices PDF

[70] A multiâagent Kâmeans with caseâbased reasoning for an automated quality assessment of software requirement specification PDF

[71] Smart analysis of automated and semi-automated approaches to data annotation for machine learning PDF

[72] Improving data quality of automated pavement condition data collection: Summary of state of the practices of transportation agencies and views of professionals PDF

[73] TaskBench: Benchmarking Large Language Models for Task Automation PDF

[74] Improved Software Effort Estimation Through Machine Learning: Challenges, Applications, and Feature Importance Analysis PDF

Comprehensive evaluation framework with standardized tool interfaces and metrics

[55] OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning PDF

[56] ChEF: A Comprehensive Evaluation Framework for Standardized Assessment of Multimodal Large Language Models PDF

[57] API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs PDF

[58] Mcpeval: Automatic mcp-based deep evaluation for ai agent models PDF

[59] A Probabilistic Digital Twin of UK en Route Airspace PDF

[60] SOP-Bench: Complex Industrial SOPs for Evaluating LLM Agents PDF

[61] VerlTool: Towards Holistic Agentic Reinforcement Learning with Tool Use PDF

[62] Openagentsafety: A comprehensive framework for evaluating real-world ai agent safety PDF

[63] MCPWorld: A Unified Benchmarking Testbed for API, GUI, and Hybrid Computer Use Agents PDF

[64] API-Bank: A Benchmark for Tool-Augmented LLMs PDF

Table of Contents

[29] Integrating Artificial Intelligence with DevOps: Enhancing Continuous Delivery, Automation, and Predictive Analytics for High-Performance Software â¦ PDF

[70] A multiâagent Kâmeans with caseâbased reasoning for an automated quality assessment of software requirement specification PDF