Computer Agent Arena: Toward Human-Centric Evaluation and Analysis of Computer-Use Agents

ICLR 2026 Conference SubmissionAnonymous Authors
Computer-Use AgentVisual Language ModelHuman-in-the-loopEvaluation
Abstract:

As Computer-Use Agents (CUAs) proliferate and grow increasingly capable, evaluation has become more challenging: static, manually curated benchmarks are narrow in domain, contamination-prone, and environment-heavy, and they diverge substantially from user-driven, real-world evaluation. We present Computer Agent Arena, an open-source platform for head-to-head CUA evaluation and a dynamic methodology that converts human preferences into structured feedback in realistic environments. The system (i) simulates real-world computer use via cloud-hosted, diverse, and dynamic environment initializations and customizations; (ii) ensures authentic, fair comparison by faithfully reproducing open-source CUAs and executing anonymously in matched, controlled environments; and (iii) extends evaluation beyond pairwise preference and correctness to capability- and behavior-oriented signals. Across 2,201 high-quality votes over 12 agents—spanning multi-app interactions, ambiguous instructions, and open-ended queries—we observe striking ranking reversals relative to static benchmarks. Further analysis shows that overall correctness mainly drives human preference; beyond that, agent-human interaction and self-correction boost user preference, even when overall task completion is comparable. Our error analysis reveals agent behavior errors, such as long-horizon memory and fine-grained action failures that static benchmarks fail to evaluate. We also contrast pure GUI agents with universal digital agents capable of tool use and coding, and discuss the trade-offs of these different design philosophies. We open source the full platform, collected dataset, and code of Computer Agent Arena to support future research on the evaluation and development of CUA.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Computer Agent Arena, a platform for head-to-head evaluation of computer-use agents through human preference judgments, and a dataset of 2,201 votes across 12 agents. It resides in the Human Preference and Comparative Evaluation leaf, which contains only two papers total: the original work and one sibling (Human Evaluation Conversations). This leaf is notably sparse, suggesting that direct human preference collection for computer-use agents remains an underexplored direction within the broader field of 50 papers. The work sits within the Human-Centered Evaluation Methodologies branch, which itself comprises three leaves addressing preference-based, automated, and usability-focused evaluation approaches.

The taxonomy reveals that most evaluation activity clusters around Benchmark Design and Task Environments, particularly Web and GUI Interaction Benchmarks (three papers) and Safety and Adversarial Robustness Benchmarks (three papers). The sibling leaf Agent-Based and Automated Evaluation contains three papers focused on AI-as-judge methods, representing a contrasting paradigm to human preference collection. The paper's emphasis on dynamic, user-driven evaluation diverges from the static benchmark tradition prevalent in neighboring leaves, and its focus on pairwise comparison and capability-oriented signals distinguishes it from the usability heuristics explored in the Usability and Interaction Quality Assessment leaf (four papers).

Among 29 candidates examined, none clearly refute the three core contributions. The platform and methodology contribution examined 10 candidates with zero refutable matches; the preference dataset contribution examined 10 candidates with zero refutable matches; and the error analysis framework examined 9 candidates with zero refutable matches. This suggests that within the limited search scope, no prior work directly overlaps with the combination of head-to-head agent comparison, cloud-hosted dynamic environments, and structured human feedback for computer-use agents. The absence of refutable candidates across all contributions indicates that the work occupies a relatively novel position, though the search scale (29 papers) leaves open the possibility of relevant work beyond the top-K semantic matches.

Given the sparse Human Preference and Comparative Evaluation leaf and the lack of refutable candidates among 29 examined papers, the work appears to address a gap in human-centric evaluation for computer-use agents. However, the analysis is constrained by the limited search scope and does not cover the full breadth of human-computer interaction or agent evaluation literature. The novelty assessment reflects what is visible within the top-30 semantic neighborhood and the constructed taxonomy, not an exhaustive field survey.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
29
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: human-centric evaluation of computer-use agents. The field has organized itself around six major branches that together address how to design, evaluate, and deploy agents that interact with computers on behalf of users. Benchmark Design and Task Environments focuses on creating realistic test scenarios and datasets, often drawing from web navigation, GUI manipulation, and multi-step workflows. Human-Centered Evaluation Methodologies emphasizes methods that capture user preferences, comparative judgments, and qualitative feedback rather than purely automated metrics. Human-Agent Collaboration and Interaction Design explores how agents and humans can work together effectively, including transparency mechanisms and co-creative workflows. User-Centered Design and Personalization investigates tailoring agent behavior to individual needs and contexts, while Theoretical Foundations and Research Agendas lays out conceptual frameworks and long-term research questions. Agent Architectures and Capabilities examines the technical underpinnings that enable robust, safe, and capable computer-use agents. A particularly active tension runs between automated evaluation approaches—such as Agent-as-Judge[11] and Evaluation Agent[10]—and methods that directly involve human judgment, as seen in Human Evaluation Conversations[19] and Heuristic Evaluation Conversational[1]. Safety and alignment concerns also cut across branches, with works like OS-Harm[3] and SusBench[4] highlighting risks in real-world deployment. Computer Agent Arena[0] sits squarely within the Human Preference and Comparative Evaluation cluster, emphasizing direct human feedback to rank agent behaviors rather than relying solely on task-success metrics. This positions it close to Human Evaluation Conversations[19], which similarly advocates for nuanced human input, but Computer Agent Arena[0] extends the approach to interactive computer-use scenarios where user satisfaction and perceived helpfulness become central. The broader landscape reveals an ongoing shift from benchmark-driven evaluation toward richer, more ecologically valid assessments that account for diverse user needs and real-world variability.

Claimed Contributions

Computer Agent Arena platform and evaluation methodology

The authors introduce an open-source platform that enables pairwise evaluation of Computer-Use Agents through human preferences collected in cloud-hosted, diverse, and dynamic environments. The system simulates real-world computer use, ensures fair comparison via matched environments, and extends evaluation beyond correctness to include capability- and behavior-oriented signals.

10 retrieved papers
Human-centric preference dataset with 2,201 votes

The authors collect and release a large-scale dataset of 2,201 filtered human preference votes across 12 agents, revealing that agent rankings differ substantially from static benchmarks. The dataset captures diverse, open-ended tasks and provides multimodal, human-labeled preference signals for future research.

10 retrieved papers
Error analysis and preference analysis frameworks

The authors conduct systematic error analysis identifying failure modes (long-horizon memory lapses, tool-selection errors, fine-grained action failures) and preference analysis showing that users value process quality, agent-human interaction, and self-correction beyond task completion. These analyses surface alignment signals that static benchmarks overlook.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Computer Agent Arena platform and evaluation methodology

The authors introduce an open-source platform that enables pairwise evaluation of Computer-Use Agents through human preferences collected in cloud-hosted, diverse, and dynamic environments. The system simulates real-world computer use, ensures fair comparison via matched environments, and extends evaluation beyond correctness to include capability- and behavior-oriented signals.

Contribution

Human-centric preference dataset with 2,201 votes

The authors collect and release a large-scale dataset of 2,201 filtered human preference votes across 12 agents, revealing that agent rankings differ substantially from static benchmarks. The dataset captures diverse, open-ended tasks and provides multimodal, human-labeled preference signals for future research.

Contribution

Error analysis and preference analysis frameworks

The authors conduct systematic error analysis identifying failure modes (long-horizon memory lapses, tool-selection errors, fine-grained action failures) and preference analysis showing that users value process quality, agent-human interaction, and self-correction beyond task completion. These analyses surface alignment signals that static benchmarks overlook.