SWINGARENA: Adversarial Programming Arena for Long-context GitHub Issue Solving

ICLR 2026 Conference SubmissionAnonymous Authors
ArenaReal-World GitHub IssuesAdversarial ProgrammingRetrieval-Augmented GenerationContinuous IntegrationCode Benchmark
Abstract:

We present \textsc{SwingArena}, a adversarial evaluation framework for Large Language Models (LLMs) that closely mirrors real-world software development workflows. Unlike traditional static benchmarks, \textsc{SwingArena} models the collaborative process of software iteration by pairing LLMs as \textit{submitters}, who generate patches, and \textit{reviewers}, who create test cases and verify the patches through continuous integration (CI) pipelines. To support these interactive evaluations, we introduce a retrieval-augmented code generation (RACG) module that efficiently handles long-context challenges by providing syntactically and semantically relevant code snippets from large codebases, supporting multiple programming languages (C++, Python, Rust, and Go). This enables the framework to scale across diverse tasks and contexts while respecting token limitations. Our experiments, using over 400 high-quality real-world GitHub issues selected from a pool of 2,300 issues, show that models like GPT-4o excel at aggressive patch generation, whereas DeepSeek and Gemini prioritize correctness in CI validation. \textsc{SwingArena} presents a scalable and extensible methodology for evaluating LLMs in realistic, CI-driven software development settings. The complete codebase and benchmark are submitted in https://anonymous.4open.science/r/Swing-Bench and will be open-sourced after the anonymity period.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces SwingArena, an adversarial evaluation framework that models collaborative software development workflows by pairing LLMs as submitters and reviewers. It resides in the 'Comprehensive Robustness Benchmarks' leaf, which contains four papers total (including this one). This leaf sits within the broader 'Robustness Evaluation and Benchmarking' branch, indicating a moderately populated research direction focused on systematic assessment of LLM resilience. The framework's emphasis on interactive, multi-agent evaluation distinguishes it from static benchmark approaches common in sibling works.

The taxonomy reveals neighboring leaves addressing complementary evaluation dimensions: 'Prompt and Input Variation Studies' examines sensitivity to textual modifications, 'Security and Vulnerability-Focused Evaluation' targets safety-critical properties, and 'Non-Functional Requirements and Code Quality' assesses maintainability and performance. SwingArena bridges these concerns by incorporating CI validation and real-world GitHub issues, connecting robustness assessment to practical software engineering workflows. The scope note for this leaf explicitly covers 'multiple dimensions or perturbation types,' positioning SwingArena's multi-language, multi-task design as aligned with the category's breadth.

Among thirteen candidates examined, the RACG module shows one refutable candidate from ten examined, suggesting moderate prior work in retrieval-augmented code generation techniques. The SwingArena framework itself examined three candidates with zero refutations, indicating relative novelty in adversarial multi-agent evaluation designs. The CI-grounded dataset examined zero candidates, likely reflecting its role as an empirical contribution rather than a methodological innovation. These statistics reflect a limited semantic search scope, not exhaustive coverage of all relevant literature in code generation and evaluation.

Based on the top-thirteen semantic matches analyzed, the framework appears to occupy a less-crowded niche within comprehensive robustness benchmarking, particularly in its adversarial multi-agent design. The RACG component builds on established retrieval-augmented generation concepts, while the dataset contribution remains unexamined in this search. The analysis does not cover broader code generation literature or domain-specific benchmarks outside the retrieved candidates.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
13
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: Adversarial evaluation of large language models for software development workflows. The field has organized itself around five major branches that collectively address how code-oriented LLMs can be attacked, defended, evaluated, tested, and applied in specialized contexts. Adversarial Attack Methods on Code LLMs explores techniques such as variable renaming, prompt manipulation, and semantic-preserving perturbations that expose model vulnerabilities (e.g., Variable Renaming Adversarial[11], BadCodePrompt[14]). Defense and Robustness Enhancement Techniques focuses on hardening models through adversarial training, watermarking, and purification strategies (e.g., Adversarial Training Robustness[3], Evaluate and Purify[13]). Robustness Evaluation and Benchmarking develops comprehensive test suites and metrics to systematically measure model resilience across diverse coding scenarios (e.g., ReCode[21], CODECRASH[44]). Testing and Verification Methodologies examines automated testing, fuzzing, and formal methods tailored to LLM-generated code (e.g., Testing Methods LLMs[5], Smart Contract Fuzzing[7]). Specialized Applications and Cross-Domain Studies investigates domain-specific challenges such as secure code generation, privacy concerns, and cross-lingual robustness (e.g., Secure Code Generation[9], Software Generation Privacy[50]). A particularly active line of work centers on creating realistic benchmarks that stress-test models under adversarial conditions, revealing trade-offs between functional correctness and robustness to perturbations. SWINGARENA[0] contributes to this effort by providing a comprehensive robustness benchmark, situating itself alongside works like ReCode[21] and CODECRASH[44] that similarly probe model fragility through systematic evaluation. While ReCode[21] emphasizes code transformation resilience and CODECRASH[44] targets crash-inducing inputs, SWINGARENA[0] offers a broader adversarial evaluation framework spanning multiple software development tasks. Open questions persist around whether robustness gains from adversarial training generalize across programming languages and whether benchmarks adequately capture real-world deployment risks, motivating continued exploration of evaluation methodologies that balance ecological validity with controlled experimentation.

Claimed Contributions

SWINGARENA adversarial evaluation framework

The authors introduce an adversarial evaluation framework that pairs LLMs as submitters (who generate patches) and reviewers (who create test cases), modeling the collaborative software iteration process through continuous integration pipelines. This framework enables dynamic, interactive evaluation that simulates real-world software development workflows across multiple programming languages.

3 retrieved papers
Retrieval-Augmented Code Generation (RACG) module

The authors develop a multi-language retrieval pipeline that combines syntax-aware chunking, dense reranking, and token-budget-aware packing to handle long-context challenges in large codebases. This module supports C++, Python, Rust, and Go, and is positioned as a strong baseline to standardize context access across models.

10 retrieved papers
Can Refute
CI-grounded multi-language dataset

The authors curate a dataset of over 2,300 real-world GitHub issues with corresponding solutions across four programming languages, including 400 high-quality evaluation instances. The dataset integrates continuous integration workflows and includes scripts for reproducible retrieval and CI execution.

0 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

SWINGARENA adversarial evaluation framework

The authors introduce an adversarial evaluation framework that pairs LLMs as submitters (who generate patches) and reviewers (who create test cases), modeling the collaborative software iteration process through continuous integration pipelines. This framework enables dynamic, interactive evaluation that simulates real-world software development workflows across multiple programming languages.

Contribution

Retrieval-Augmented Code Generation (RACG) module

The authors develop a multi-language retrieval pipeline that combines syntax-aware chunking, dense reranking, and token-budget-aware packing to handle long-context challenges in large codebases. This module supports C++, Python, Rust, and Go, and is positioned as a strong baseline to standardize context access across models.

Contribution

CI-grounded multi-language dataset

The authors curate a dataset of over 2,300 real-world GitHub issues with corresponding solutions across four programming languages, including 400 high-quality evaluation instances. The dataset integrates continuous integration workflows and includes scripts for reproducible retrieval and CI execution.