NetArena: Dynamically Generated LLM Benchmarks for Network Applications

ICLR 2026 Conference SubmissionAnonymous Authors
LLM for Network SystemsDynamic Benchmark
Abstract:

As large language models (LLMs) expand into high-stakes domains like network system operations, evaluating their real-world reliability becomes increasingly critical. However, existing benchmarks risk contamination due to static design, show high statistical variance from limited dataset size, and fail to reflect the complexity of production environments. We introduce NetArena, a dynamic benchmark generation framework for network applications. NetArena features a novel abstraction and unified interface that generalizes across applications, effec- tively addressing the challenges of dynamic benchmarking posed by the diversity of network tasks. At runtime, users can generate unlimited queries on demand. NetArena integrates with network emulators to provide execution-time feedback on correctness, safety, and latency. We demonstrate NetArena on three repre- sentative applications and find that (1) it significantly improve statistical reliability among LLM agents (confidence interval overlap reduced from 85% to 0), (2) agents achieve only 13–38% average performance (as low as 3%) for large-scale, realistic queries, (3) it reveals finer-grained behaviors missed by static, correctness-only benchmarks. NetArena also enables use cases such as SFT and RL fine-tuning on network system tasks. Code is available anonymously at https://anonymous.4open.science/r/netarena_iclr2026-BE94/README.md

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

NetArena introduces a dynamic benchmark generation framework for evaluating LLMs on network system operations, emphasizing runtime query generation and execution-time feedback via network emulators. The paper resides in the 'LLM Evaluation and Benchmarking for Network Tasks' leaf, which contains five papers total, including NetLLMBench and NetConfEval. This leaf represents a focused but not overcrowded research direction within the broader taxonomy of fifty papers, suggesting moderate activity in developing systematic evaluation protocols for network-specific LLM applications.

The taxonomy tree reveals that NetArena's evaluation focus sits adjacent to several application-oriented branches, including 'Network Configuration and Management Automation' (eleven papers across intent-based generation and autonomous orchestration) and 'LLM Architectural Frameworks and Enabling Techniques' (eight papers on multi-agent systems and domain adaptation). The 'Integration with Network Emulation and Testbeds' leaf under architectural frameworks contains two papers exploring emulator-based experimentation, indicating that while emulator integration is recognized, it remains less explored than pure benchmarking or application development. NetArena bridges evaluation rigor with execution-environment realism, connecting these neighboring research directions.

Among the three contributions analyzed, the dynamic benchmark generation framework examined two candidates with zero refutations, while the unified state-action abstraction examined ten candidates with zero refutations, suggesting these aspects face limited direct prior work within the nineteen candidates reviewed. The emulator integration contribution examined seven candidates and found one refutable match, indicating some overlap with existing emulator-based approaches. The limited search scope—nineteen candidates total from semantic search—means these statistics reflect a targeted sample rather than exhaustive coverage, with the abstraction contribution showing the broadest examination but no clear precedent among those reviewed.

Based on the top-nineteen semantic matches examined, NetArena appears to occupy a relatively distinct position within network LLM evaluation, particularly in combining dynamic generation with emulator feedback. The analysis covers a focused slice of the literature, leaving open the possibility of additional relevant work outside the semantic search radius. The taxonomy context suggests the paper contributes to an active but not saturated evaluation subfield, with neighboring application and architectural branches providing complementary perspectives on LLM reliability in network operations.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
19
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: Evaluating large language models on network system operations. The field has coalesced around several major branches that reflect both the breadth of LLM applications in networking and the need for rigorous assessment. The top-level structure includes LLM Application Domains in Network Operations, which explores diverse use cases from intent-based management to autonomous configuration; LLM Evaluation and Benchmarking for Network Tasks, which develops systematic methods to measure model performance on domain-specific challenges; LLM Architectural Frameworks and Enabling Techniques, which investigates multi-agent systems, retrieval-augmented generation, and other technical enablers; and Survey and Review Literature, which synthesizes emerging trends across telecommunications, cybersecurity, and next-generation networking. Representative works such as LLM Network Management Survey[3] and LLM Telecom Survey[9] illustrate how researchers are mapping out application landscapes, while benchmarks like NetLLMBench[4] and NetConfEval[11] provide concrete evaluation protocols. A particularly active line of work focuses on creating standardized benchmarks that can reliably assess LLM capabilities in network-specific reasoning, configuration generation, and troubleshooting tasks. NetArena[0] sits squarely within this evaluation-centric branch, emphasizing systematic performance measurement for network system operations. It shares common ground with NetLLMBench[4], which also targets domain-specific benchmarking, and NetConfEval[11], which evaluates configuration tasks, yet NetArena[0] distinguishes itself by addressing a broader spectrum of operational scenarios. Nearby works like NetPress[37] and Virtual System Administrator[44] explore complementary angles—compressing network knowledge and automating administrative tasks—highlighting ongoing debates about whether to prioritize general-purpose LLM adaptation or task-specific fine-tuning. The central tension across these branches remains balancing model generality with the precision and reliability demanded by production network environments, a challenge that benchmarking efforts like NetArena[0] aim to clarify through empirical assessment.

Claimed Contributions

NETARENA dynamic benchmark generation framework

The authors present NETARENA, a framework that dynamically generates unlimited evaluation queries for network system tasks at runtime, addressing limitations of static benchmarks such as data contamination, high statistical variance, and lack of production environment complexity.

2 retrieved papers
Unified state-action abstraction for network applications

The authors define a unified interface based on explicit state and action spaces that abstracts network applications, enabling dynamic query and ground truth generation through executable state transitions and supporting controlled complexity scaling across diverse network tasks.

10 retrieved papers
Emulator integration for execution-time feedback

The framework integrates with high-fidelity network emulators (such as Mininet and Kubernetes) to enable automatic, dynamic, and multi-turn verification of LLM-generated actions under deployment-like conditions, evaluating correctness, safety constraints, and latency metrics.

7 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

NETARENA dynamic benchmark generation framework

The authors present NETARENA, a framework that dynamically generates unlimited evaluation queries for network system tasks at runtime, addressing limitations of static benchmarks such as data contamination, high statistical variance, and lack of production environment complexity.

Contribution

Unified state-action abstraction for network applications

The authors define a unified interface based on explicit state and action spaces that abstracts network applications, enabling dynamic query and ground truth generation through executable state transitions and supporting controlled complexity scaling across diverse network tasks.

Contribution

Emulator integration for execution-time feedback

The framework integrates with high-fidelity network emulators (such as Mininet and Kubernetes) to enable automatic, dynamic, and multi-turn verification of LLM-generated actions under deployment-like conditions, evaluating correctness, safety constraints, and latency metrics.