NetArena: Dynamically Generated LLM Benchmarks for Network Applications

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

LLM for Network SystemsDynamic Benchmark

As large language models (LLMs) expand into high-stakes domains like network system operations, evaluating their real-world reliability becomes increasingly critical. However, existing benchmarks risk contamination due to static design, show high statistical variance from limited dataset size, and fail to reflect the complexity of production environments. We introduce NetArena, a dynamic benchmark generation framework for network applications. NetArena features a novel abstraction and unified interface that generalizes across applications, effec- tively addressing the challenges of dynamic benchmarking posed by the diversity of network tasks. At runtime, users can generate unlimited queries on demand. NetArena integrates with network emulators to provide execution-time feedback on correctness, safety, and latency. We demonstrate NetArena on three repre- sentative applications and find that (1) it significantly improve statistical reliability among LLM agents (confidence interval overlap reduced from 85% to 0), (2) agents achieve only 13–38% average performance (as low as 3%) for large-scale, realistic queries, (3) it reveals finer-grained behaviors missed by static, correctness-only benchmarks. NetArena also enables use cases such as SFT and RL fine-tuning on network system tasks. Code is available anonymously at https://anonymous.4open.science/r/netarena_iclr2026-BE94/README.md

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

NetArena introduces a dynamic benchmark generation framework for evaluating LLMs on network system operations, emphasizing runtime query generation and execution-time feedback via network emulators. The paper resides in the 'LLM Evaluation and Benchmarking for Network Tasks' leaf, which contains five papers total, including NetLLMBench and NetConfEval. This leaf represents a focused but not overcrowded research direction within the broader taxonomy of fifty papers, suggesting moderate activity in developing systematic evaluation protocols for network-specific LLM applications.

The taxonomy tree reveals that NetArena's evaluation focus sits adjacent to several application-oriented branches, including 'Network Configuration and Management Automation' (eleven papers across intent-based generation and autonomous orchestration) and 'LLM Architectural Frameworks and Enabling Techniques' (eight papers on multi-agent systems and domain adaptation). The 'Integration with Network Emulation and Testbeds' leaf under architectural frameworks contains two papers exploring emulator-based experimentation, indicating that while emulator integration is recognized, it remains less explored than pure benchmarking or application development. NetArena bridges evaluation rigor with execution-environment realism, connecting these neighboring research directions.

Among the three contributions analyzed, the dynamic benchmark generation framework examined two candidates with zero refutations, while the unified state-action abstraction examined ten candidates with zero refutations, suggesting these aspects face limited direct prior work within the nineteen candidates reviewed. The emulator integration contribution examined seven candidates and found one refutable match, indicating some overlap with existing emulator-based approaches. The limited search scope—nineteen candidates total from semantic search—means these statistics reflect a targeted sample rather than exhaustive coverage, with the abstraction contribution showing the broadest examination but no clear precedent among those reviewed.

Based on the top-nineteen semantic matches examined, NetArena appears to occupy a relatively distinct position within network LLM evaluation, particularly in combining dynamic generation with emulator feedback. The analysis covers a focused slice of the literature, leaving open the possibility of additional relevant work outside the semantic search radius. The taxonomy context suggests the paper contributes to an active but not saturated evaluation subfield, with neighboring application and architectural branches providing complementary perspectives on LLM reliability in network operations.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Evaluating large language models on network system operations. The field has coalesced around several major branches that reflect both the breadth of LLM applications in networking and the need for rigorous assessment. The top-level structure includes LLM Application Domains in Network Operations, which explores diverse use cases from intent-based management to autonomous configuration; LLM Evaluation and Benchmarking for Network Tasks, which develops systematic methods to measure model performance on domain-specific challenges; LLM Architectural Frameworks and Enabling Techniques, which investigates multi-agent systems, retrieval-augmented generation, and other technical enablers; and Survey and Review Literature, which synthesizes emerging trends across telecommunications, cybersecurity, and next-generation networking. Representative works such as LLM Network Management Survey[3] and LLM Telecom Survey[9] illustrate how researchers are mapping out application landscapes, while benchmarks like NetLLMBench[4] and NetConfEval[11] provide concrete evaluation protocols. A particularly active line of work focuses on creating standardized benchmarks that can reliably assess LLM capabilities in network-specific reasoning, configuration generation, and troubleshooting tasks. NetArena[0] sits squarely within this evaluation-centric branch, emphasizing systematic performance measurement for network system operations. It shares common ground with NetLLMBench[4], which also targets domain-specific benchmarking, and NetConfEval[11], which evaluates configuration tasks, yet NetArena[0] distinguishes itself by addressing a broader spectrum of operational scenarios. Nearby works like NetPress[37] and Virtual System Administrator[44] explore complementary angles—compressing network knowledge and automating administrative tasks—highlighting ongoing debates about whether to prioritize general-purpose LLM adaptation or task-specific fine-tuning. The central tension across these branches remains balancing model generality with the precision and reliability demanded by production network environments, a challenge that benchmarking efforts like NetArena[0] aim to clarify through empirical assessment.

Claimed Contributions

NETARENA dynamic benchmark generation framework

2 retrieved papers

The authors present NETARENA, a framework that dynamically generates unlimited evaluation queries for network system tasks at runtime, addressing limitations of static benchmarks such as data contamination, high statistical variance, and lack of production environment complexity.

2 retrieved papers

Unified state-action abstraction for network applications

10 retrieved papers

The authors define a unified interface based on explicit state and action spaces that abstracts network applications, enabling dynamic query and ground truth generation through executable state transitions and supporting controlled complexity scaling across diverse network tasks.

10 retrieved papers

Emulator integration for execution-time feedback

Can Refute

7 retrieved papers

The framework integrates with high-fidelity network emulators (such as Mininet and Kubernetes) to enable automatic, dynamic, and multi-turn verification of LLM-generated actions under deployment-like conditions, evaluating correctness, safety constraints, and latency metrics.

7 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[4] NetLLMBench: A Benchmark Framework for Large Language Models in Network Configuration Tasks PDF

Kaan Aykurt, Andreas Blenk, Wolfgang Kellerer (2024)

[23] Optimizing LLM Prompts for Automation of Network Management: A User's Perspective PDF

Sebastian Estropia, Vishnu Komanduri, Scott Alessio, Gokhan Yerdelen, Geovanny Palomino Roldan, Tyler Ferreira, Ziqian Dong, Roberto Rojas-Cessa, Z. Dong (2025)

[37] NetPress: Dynamically Generated LLM Benchmarks for Network Applications PDF

Zhou Ya-jie, Fouladi, Sadjad, Yan, Francis Y., Hsieh, Kevin, Liu Zaoxing (2025)

[44] Can llms understand computer networks? towards a virtual system administrator PDF

Denis Donadel, Francesco Marchiori, Luca Pajola, Mauro Conti (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

NETARENA dynamic benchmark generation framework

[67] Top Leaderboard Ranking = Top Coding Proficiency, Always? EvoEval: Evolving Coding Benchmarks via LLM PDF

Cannot Refute

[68] Network Dynamics Reasoning: A Novel Benchmark for Evaluating Multi-Step Inference in Large Language Models PDF

Cannot Refute

Contribution

Unified state-action abstraction for network applications

[57] WorldAgen: Unified State-Action Prediction with Test-Time World Model Training PDF

Cannot Refute

[58] Goal-Oriented Skill Abstraction for Offline Multi-Task Reinforcement Learning PDF

Cannot Refute

[59] A principal odor map unifies diverse tasks in olfactory perception PDF

Cannot Refute

[60] Language as an abstraction for hierarchical deep reinforcement learning PDF

Cannot Refute

[61] Action abstractions for amortized sampling PDF

Cannot Refute

[62] Hierarchical decision making based on structural information principles PDF

Cannot Refute

[63] Cooperative multi-agent control using deep reinforcement learning PDF

Cannot Refute

[64] Prise: Llm-style sequence compression for learning temporal action abstractions in control PDF

Cannot Refute

[65] Policy gradient methods in the presence of symmetries and state abstractions PDF

Cannot Refute

[66] Information optimization and transferable state abstractions in deep reinforcement learning PDF

Cannot Refute

Contribution

Emulator integration for execution-time feedback

[37] NetPress: Dynamically Generated LLM Benchmarks for Network Applications PDF

Can Refute

[51] Characterization of latency and jitter in TSN emulation PDF

Cannot Refute

[52] AI Agents Assisted Platform for Secure Quantum-Classical Networking Education PDF

Cannot Refute

[53] DNP3 43â45 encoding/decoding process 45â47 network 42 power flow 50â51 Python 43 PDF

Cannot Refute

[54] Combining Simulation and Emulation for Planning and Evaluation of Smart Grid Security, Resilience, and Operations PDF

Cannot Refute

[55] Optimized active redundancy with just-in-time duplication for safety critical mobile video streaming PDF

Cannot Refute

[56] I Think I CAN PDF

Cannot Refute

NetArena: Dynamically Generated LLM Benchmarks for Network Applications

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[4] NetLLMBench: A Benchmark Framework for Large Language Models in Network Configuration Tasks PDF

[23] Optimizing LLM Prompts for Automation of Network Management: A User's Perspective PDF

[37] NetPress: Dynamically Generated LLM Benchmarks for Network Applications PDF

[44] Can llms understand computer networks? towards a virtual system administrator PDF

Contribution Analysis

NETARENA dynamic benchmark generation framework

[67] Top Leaderboard Ranking = Top Coding Proficiency, Always? EvoEval: Evolving Coding Benchmarks via LLM PDF

[68] Network Dynamics Reasoning: A Novel Benchmark for Evaluating Multi-Step Inference in Large Language Models PDF

Unified state-action abstraction for network applications

[57] WorldAgen: Unified State-Action Prediction with Test-Time World Model Training PDF

[58] Goal-Oriented Skill Abstraction for Offline Multi-Task Reinforcement Learning PDF

[59] A principal odor map unifies diverse tasks in olfactory perception PDF

[60] Language as an abstraction for hierarchical deep reinforcement learning PDF

[61] Action abstractions for amortized sampling PDF

[62] Hierarchical decision making based on structural information principles PDF

[63] Cooperative multi-agent control using deep reinforcement learning PDF

[64] Prise: Llm-style sequence compression for learning temporal action abstractions in control PDF

[65] Policy gradient methods in the presence of symmetries and state abstractions PDF

[66] Information optimization and transferable state abstractions in deep reinforcement learning PDF

Emulator integration for execution-time feedback

[37] NetPress: Dynamically Generated LLM Benchmarks for Network Applications PDF

[51] Characterization of latency and jitter in TSN emulation PDF

[52] AI Agents Assisted Platform for Secure Quantum-Classical Networking Education PDF

[53] DNP3 43â45 encoding/decoding process 45â47 network 42 power flow 50â51 Python 43 PDF

[54] Combining Simulation and Emulation for Planning and Evaluation of Smart Grid Security, Resilience, and Operations PDF

[55] Optimized active redundancy with just-in-time duplication for safety critical mobile video streaming PDF

[56] I Think I CAN PDF

Table of Contents

[53] DNP3 43â45 encoding/decoding process 45â47 network 42 power flow 50â51 Python 43 PDF