SWINGARENA: Adversarial Programming Arena for Long-context GitHub Issue Solving

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

ArenaReal-World GitHub IssuesAdversarial ProgrammingRetrieval-Augmented GenerationContinuous IntegrationCode Benchmark

We present \textsc{SwingArena}, a adversarial evaluation framework for Large Language Models (LLMs) that closely mirrors real-world software development workflows. Unlike traditional static benchmarks, \textsc{SwingArena} models the collaborative process of software iteration by pairing LLMs as \textit{submitters}, who generate patches, and \textit{reviewers}, who create test cases and verify the patches through continuous integration (CI) pipelines. To support these interactive evaluations, we introduce a retrieval-augmented code generation (RACG) module that efficiently handles long-context challenges by providing syntactically and semantically relevant code snippets from large codebases, supporting multiple programming languages (C++, Python, Rust, and Go). This enables the framework to scale across diverse tasks and contexts while respecting token limitations. Our experiments, using over 400 high-quality real-world GitHub issues selected from a pool of 2,300 issues, show that models like GPT-4o excel at aggressive patch generation, whereas DeepSeek and Gemini prioritize correctness in CI validation. \textsc{SwingArena} presents a scalable and extensible methodology for evaluating LLMs in realistic, CI-driven software development settings. The complete codebase and benchmark are submitted in https://anonymous.4open.science/r/Swing-Bench and will be open-sourced after the anonymity period.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces SwingArena, an adversarial evaluation framework that models collaborative software development workflows by pairing LLMs as submitters and reviewers. It resides in the 'Comprehensive Robustness Benchmarks' leaf, which contains four papers total (including this one). This leaf sits within the broader 'Robustness Evaluation and Benchmarking' branch, indicating a moderately populated research direction focused on systematic assessment of LLM resilience. The framework's emphasis on interactive, multi-agent evaluation distinguishes it from static benchmark approaches common in sibling works.

The taxonomy reveals neighboring leaves addressing complementary evaluation dimensions: 'Prompt and Input Variation Studies' examines sensitivity to textual modifications, 'Security and Vulnerability-Focused Evaluation' targets safety-critical properties, and 'Non-Functional Requirements and Code Quality' assesses maintainability and performance. SwingArena bridges these concerns by incorporating CI validation and real-world GitHub issues, connecting robustness assessment to practical software engineering workflows. The scope note for this leaf explicitly covers 'multiple dimensions or perturbation types,' positioning SwingArena's multi-language, multi-task design as aligned with the category's breadth.

Among thirteen candidates examined, the RACG module shows one refutable candidate from ten examined, suggesting moderate prior work in retrieval-augmented code generation techniques. The SwingArena framework itself examined three candidates with zero refutations, indicating relative novelty in adversarial multi-agent evaluation designs. The CI-grounded dataset examined zero candidates, likely reflecting its role as an empirical contribution rather than a methodological innovation. These statistics reflect a limited semantic search scope, not exhaustive coverage of all relevant literature in code generation and evaluation.

Based on the top-thirteen semantic matches analyzed, the framework appears to occupy a less-crowded niche within comprehensive robustness benchmarking, particularly in its adversarial multi-agent design. The RACG component builds on established retrieval-augmented generation concepts, while the dataset contribution remains unexamined in this search. The analysis does not cover broader code generation literature or domain-specific benchmarks outside the retrieved candidates.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Adversarial evaluation of large language models for software development workflows. The field has organized itself around five major branches that collectively address how code-oriented LLMs can be attacked, defended, evaluated, tested, and applied in specialized contexts. Adversarial Attack Methods on Code LLMs explores techniques such as variable renaming, prompt manipulation, and semantic-preserving perturbations that expose model vulnerabilities (e.g., Variable Renaming Adversarial[11], BadCodePrompt[14]). Defense and Robustness Enhancement Techniques focuses on hardening models through adversarial training, watermarking, and purification strategies (e.g., Adversarial Training Robustness[3], Evaluate and Purify[13]). Robustness Evaluation and Benchmarking develops comprehensive test suites and metrics to systematically measure model resilience across diverse coding scenarios (e.g., ReCode[21], CODECRASH[44]). Testing and Verification Methodologies examines automated testing, fuzzing, and formal methods tailored to LLM-generated code (e.g., Testing Methods LLMs[5], Smart Contract Fuzzing[7]). Specialized Applications and Cross-Domain Studies investigates domain-specific challenges such as secure code generation, privacy concerns, and cross-lingual robustness (e.g., Secure Code Generation[9], Software Generation Privacy[50]). A particularly active line of work centers on creating realistic benchmarks that stress-test models under adversarial conditions, revealing trade-offs between functional correctness and robustness to perturbations. SWINGARENA[0] contributes to this effort by providing a comprehensive robustness benchmark, situating itself alongside works like ReCode[21] and CODECRASH[44] that similarly probe model fragility through systematic evaluation. While ReCode[21] emphasizes code transformation resilience and CODECRASH[44] targets crash-inducing inputs, SWINGARENA[0] offers a broader adversarial evaluation framework spanning multiple software development tasks. Open questions persist around whether robustness gains from adversarial training generalize across programming languages and whether benchmarks adequately capture real-world deployment risks, motivating continued exploration of evaluation methodologies that balance ecological validity with controlled experimentation.

Claimed Contributions

SWINGARENA adversarial evaluation framework

3 retrieved papers

The authors introduce an adversarial evaluation framework that pairs LLMs as submitters (who generate patches) and reviewers (who create test cases), modeling the collaborative software iteration process through continuous integration pipelines. This framework enables dynamic, interactive evaluation that simulates real-world software development workflows across multiple programming languages.

3 retrieved papers

Retrieval-Augmented Code Generation (RACG) module

Can Refute

10 retrieved papers

The authors develop a multi-language retrieval pipeline that combines syntax-aware chunking, dense reranking, and token-budget-aware packing to handle long-context challenges in large codebases. This module supports C++, Python, Rust, and Go, and is positioned as a strong baseline to standardize context access across models.

10 retrieved papers

Can Refute

CI-grounded multi-language dataset

0 retrieved papers

The authors curate a dataset of over 2,300 real-world GitHub issues with corresponding solutions across four programming languages, including 400 high-quality evaluation instances. The dataset integrates continuous integration workflows and includes scripts for reproducible retrieval and CI execution.

0 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[5] Exploring Testing Methods for Large Language Models PDF

Timothy Elvira, Lynn Vonderhaar, T. Procko, Omar Ochoa (2024)

[21] ReCode: Robustness Evaluation of Code Generation Models PDF

Bhatia, Parminder, Kumar, Varun, Li Zheng, Nallapati, Ramesh, Qian Haifeng, Ramanathan, Murali Krishna, Ray, Baishakhi, Roth Dan, Shang, Mingyue, Tan, Samson, Wang, Shiqi, Wang Zi-jian, Xiang Bing, Yang ChengâHao (2023)

[44] CODECRASH: Stress Testing LLM Reasoning under Structural and Semantic Perturbations PDF

Lam, Man Ho, Wang, Chaozheng, Man Ho Lam, Huang, Jen-tse, Chaozheng Wang, Lyu, Michael R., Jen-Tse Huang, Michael R. Lyu (2025) • arXiv.org

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

SWINGARENA adversarial evaluation framework

[51] Coder reviewer reranking for code generation PDF

Cannot Refute

[52] SwingArena: Competitive Programming Arena for Long-context GitHub Issue Solving PDF

Cannot Refute

[53] Leveraging Symmetry in Multi-Agent Code Generation: A Cross-Verification Collaboration Protocol for Competitive Programming PDF

Cannot Refute

Contribution

Retrieval-Augmented Code Generation (RACG) module

[55] Repocoder: Repository-level code completion through iterative retrieval and generation PDF

Can Refute

[52] SwingArena: Competitive Programming Arena for Long-context GitHub Issue Solving PDF

Cannot Refute

[54] CodeRAG-Bench: Can Retrieval Augment Code Generation? PDF

Cannot Refute

[56] Rap-gen: Retrieval-augmented patch generation with codet5 for automatic program repair PDF

Cannot Refute

[57] A deep dive into retrieval-augmented generation for code completion: Experience on wechat PDF

Cannot Refute

[58] Dataflow-Guided Retrieval Augmentation for Repository-Level Code Completion PDF

Cannot Refute

[59] What to Retrieve for Effective Retrieval-Augmented Code Generation? An Empirical Study and Beyond PDF

Cannot Refute

[60] Class-Level Code Generation from Natural Language Using Iterative, Tool-Enhanced Reasoning over Repository PDF

Cannot Refute

[61] Code2JSON: Can a Zero-Shot LLM Extract Code Features for Code RAG? PDF

Cannot Refute

[62] DeepCode: Open Agentic Coding PDF

Cannot Refute

Contribution

SWINGARENA: Adversarial Programming Arena for Long-context GitHub Issue Solving

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[5] Exploring Testing Methods for Large Language Models PDF

[21] ReCode: Robustness Evaluation of Code Generation Models PDF

[44] CODECRASH: Stress Testing LLM Reasoning under Structural and Semantic Perturbations PDF

Contribution Analysis

SWINGARENA adversarial evaluation framework

[51] Coder reviewer reranking for code generation PDF

[52] SwingArena: Competitive Programming Arena for Long-context GitHub Issue Solving PDF

[53] Leveraging Symmetry in Multi-Agent Code Generation: A Cross-Verification Collaboration Protocol for Competitive Programming PDF

Retrieval-Augmented Code Generation (RACG) module

[55] Repocoder: Repository-level code completion through iterative retrieval and generation PDF

[52] SwingArena: Competitive Programming Arena for Long-context GitHub Issue Solving PDF

[54] CodeRAG-Bench: Can Retrieval Augment Code Generation? PDF

[56] Rap-gen: Retrieval-augmented patch generation with codet5 for automatic program repair PDF

[57] A deep dive into retrieval-augmented generation for code completion: Experience on wechat PDF

[58] Dataflow-Guided Retrieval Augmentation for Repository-Level Code Completion PDF

[59] What to Retrieve for Effective Retrieval-Augmented Code Generation? An Empirical Study and Beyond PDF

[60] Class-Level Code Generation from Natural Language Using Iterative, Tool-Enhanced Reasoning over Repository PDF

[61] Code2JSON: Can a Zero-Shot LLM Extract Code Features for Code RAG? PDF

[62] DeepCode: Open Agentic Coding PDF

CI-grounded multi-language dataset

Table of Contents