Computer Agent Arena: Toward Human-Centric Evaluation and Analysis of Computer-Use Agents

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

Computer-Use AgentVisual Language ModelHuman-in-the-loopEvaluation

As Computer-Use Agents (CUAs) proliferate and grow increasingly capable, evaluation has become more challenging: static, manually curated benchmarks are narrow in domain, contamination-prone, and environment-heavy, and they diverge substantially from user-driven, real-world evaluation. We present Computer Agent Arena, an open-source platform for head-to-head CUA evaluation and a dynamic methodology that converts human preferences into structured feedback in realistic environments. The system (i) simulates real-world computer use via cloud-hosted, diverse, and dynamic environment initializations and customizations; (ii) ensures authentic, fair comparison by faithfully reproducing open-source CUAs and executing anonymously in matched, controlled environments; and (iii) extends evaluation beyond pairwise preference and correctness to capability- and behavior-oriented signals. Across 2,201 high-quality votes over 12 agents—spanning multi-app interactions, ambiguous instructions, and open-ended queries—we observe striking ranking reversals relative to static benchmarks. Further analysis shows that overall correctness mainly drives human preference; beyond that, agent-human interaction and self-correction boost user preference, even when overall task completion is comparable. Our error analysis reveals agent behavior errors, such as long-horizon memory and fine-grained action failures that static benchmarks fail to evaluate. We also contrast pure GUI agents with universal digital agents capable of tool use and coding, and discuss the trade-offs of these different design philosophies. We open source the full platform, collected dataset, and code of Computer Agent Arena to support future research on the evaluation and development of CUA.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Computer Agent Arena, a platform for head-to-head evaluation of computer-use agents through human preference judgments, and a dataset of 2,201 votes across 12 agents. It resides in the Human Preference and Comparative Evaluation leaf, which contains only two papers total: the original work and one sibling (Human Evaluation Conversations). This leaf is notably sparse, suggesting that direct human preference collection for computer-use agents remains an underexplored direction within the broader field of 50 papers. The work sits within the Human-Centered Evaluation Methodologies branch, which itself comprises three leaves addressing preference-based, automated, and usability-focused evaluation approaches.

The taxonomy reveals that most evaluation activity clusters around Benchmark Design and Task Environments, particularly Web and GUI Interaction Benchmarks (three papers) and Safety and Adversarial Robustness Benchmarks (three papers). The sibling leaf Agent-Based and Automated Evaluation contains three papers focused on AI-as-judge methods, representing a contrasting paradigm to human preference collection. The paper's emphasis on dynamic, user-driven evaluation diverges from the static benchmark tradition prevalent in neighboring leaves, and its focus on pairwise comparison and capability-oriented signals distinguishes it from the usability heuristics explored in the Usability and Interaction Quality Assessment leaf (four papers).

Among 29 candidates examined, none clearly refute the three core contributions. The platform and methodology contribution examined 10 candidates with zero refutable matches; the preference dataset contribution examined 10 candidates with zero refutable matches; and the error analysis framework examined 9 candidates with zero refutable matches. This suggests that within the limited search scope, no prior work directly overlaps with the combination of head-to-head agent comparison, cloud-hosted dynamic environments, and structured human feedback for computer-use agents. The absence of refutable candidates across all contributions indicates that the work occupies a relatively novel position, though the search scale (29 papers) leaves open the possibility of relevant work beyond the top-K semantic matches.

Given the sparse Human Preference and Comparative Evaluation leaf and the lack of refutable candidates among 29 examined papers, the work appears to address a gap in human-centric evaluation for computer-use agents. However, the analysis is constrained by the limited search scope and does not cover the full breadth of human-computer interaction or agent evaluation literature. The novelty assessment reflects what is visible within the top-30 semantic neighborhood and the constructed taxonomy, not an exhaustive field survey.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: human-centric evaluation of computer-use agents. The field has organized itself around six major branches that together address how to design, evaluate, and deploy agents that interact with computers on behalf of users. Benchmark Design and Task Environments focuses on creating realistic test scenarios and datasets, often drawing from web navigation, GUI manipulation, and multi-step workflows. Human-Centered Evaluation Methodologies emphasizes methods that capture user preferences, comparative judgments, and qualitative feedback rather than purely automated metrics. Human-Agent Collaboration and Interaction Design explores how agents and humans can work together effectively, including transparency mechanisms and co-creative workflows. User-Centered Design and Personalization investigates tailoring agent behavior to individual needs and contexts, while Theoretical Foundations and Research Agendas lays out conceptual frameworks and long-term research questions. Agent Architectures and Capabilities examines the technical underpinnings that enable robust, safe, and capable computer-use agents. A particularly active tension runs between automated evaluation approaches—such as Agent-as-Judge[11] and Evaluation Agent[10]—and methods that directly involve human judgment, as seen in Human Evaluation Conversations[19] and Heuristic Evaluation Conversational[1]. Safety and alignment concerns also cut across branches, with works like OS-Harm[3] and SusBench[4] highlighting risks in real-world deployment. Computer Agent Arena[0] sits squarely within the Human Preference and Comparative Evaluation cluster, emphasizing direct human feedback to rank agent behaviors rather than relying solely on task-success metrics. This positions it close to Human Evaluation Conversations[19], which similarly advocates for nuanced human input, but Computer Agent Arena[0] extends the approach to interactive computer-use scenarios where user satisfaction and perceived helpfulness become central. The broader landscape reveals an ongoing shift from benchmark-driven evaluation toward richer, more ecologically valid assessments that account for diverse user needs and real-world variability.

Claimed Contributions

Computer Agent Arena platform and evaluation methodology

10 retrieved papers

The authors introduce an open-source platform that enables pairwise evaluation of Computer-Use Agents through human preferences collected in cloud-hosted, diverse, and dynamic environments. The system simulates real-world computer use, ensures fair comparison via matched environments, and extends evaluation beyond correctness to include capability- and behavior-oriented signals.

10 retrieved papers

Human-centric preference dataset with 2,201 votes

10 retrieved papers

The authors collect and release a large-scale dataset of 2,201 filtered human preference votes across 12 agents, revealing that agent rankings differ substantially from static benchmarks. The dataset captures diverse, open-ended tasks and provides multimodal, human-labeled preference signals for future research.

10 retrieved papers

Error analysis and preference analysis frameworks

9 retrieved papers

The authors conduct systematic error analysis identifying failure modes (long-horizon memory lapses, tool-selection errors, fine-grained action failures) and preference analysis showing that users value process quality, agent-human interaction, and self-correction beyond task completion. These analyses surface alignment signals that static benchmarks overlook.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[19] Human evaluation of conversations is an open problem: comparing the sensitivity of various methods for evaluating dialogue agents PDF

Eric Smith, Orion Hsu, Eric Michael Smith, Rebecca Qian, Stephen Roller, Y.-Lan Boureau, Jason Weston, Y-Lan Boureau, J. Weston (2022)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Computer Agent Arena platform and evaluation methodology

[11] Agent-as-a-judge: Evaluate agents with agents PDF

Cannot Refute

[51] Aligning with human judgement: The role of pairwise preference in large language model evaluators PDF

Cannot Refute

[52] Evaluation of a smart audio system based on the ViP principle and the analytic hierarchy process humanâcomputer interaction design PDF

Cannot Refute

[53] BrowserArena: Evaluating LLM Agents on Real-World Web Navigation Tasks PDF

Cannot Refute

[54] Soft Condorcet Optimization for Ranking of General Agents PDF

Cannot Refute

[55] Comparing a computer agent with a humanoid robot PDF

Cannot Refute

[56] Human-AI Collaboration: Trade-offs Between Performance and Preferences PDF

Cannot Refute

[57] An adaptive decision-making system supported on user preference predictions for humanârobot interactive communication PDF

Cannot Refute

[58] Copychats: Question sequencing with artificial agents PDF

Cannot Refute

[59] Who's Sorry Now: User Preferences Among Rote, Empathic, and Explanatory Apologies from LLM Chatbots PDF

Cannot Refute

Contribution

Human-centric preference dataset with 2,201 votes

[60] Lipo: Listwise preference optimization through learning-to-rank PDF

Cannot Refute

[61] In-context Ranking Preference Optimization PDF

Cannot Refute

[62] Few-shot in-context preference learning using large language models PDF

Cannot Refute

[63] Preference learning algorithms do not learn preference rankings PDF

Cannot Refute

[64] Prd: Peer rank and discussion improve large language model based evaluations PDF

Cannot Refute

[65] Measuring the inconsistency of large language models in preferential ranking PDF

Cannot Refute

[66] Learning from Human Feedback: Ranking, Bandit, and Preference Optimization PDF

Cannot Refute

[67] Assessing top-preferences PDF

Cannot Refute

[68] How do people rank multiple mutant agents? PDF

Cannot Refute

[69] Biased perceptions of income distribution and preferences for redistribution: Evidence from a survey experiment PDF

Cannot Refute

Contribution

Error analysis and preference analysis frameworks

[70] Where llm agents fail and how they can learn from failures PDF

Cannot Refute

[71] Learning to correct mistakes: Backjumping in long-horizon task and motion planning PDF

Cannot Refute

[72] Mobile-agent-e: Self-evolving mobile assistant for complex tasks PDF

Cannot Refute

[73] Agentic Robot: A Brain-Inspired Framework for Vision-Language-Action Models in Embodied Agents PDF

Cannot Refute

[74] Multi-Step Temperature Prognosis of Lithium-Ion Batteries for Real Electric Vehicles Based on a Novel Bidirectional Mamba Network and Sequence Adaptive â¦ PDF

Cannot Refute

[76] Self-Evaluating LLMs for Multi-Step Tasks: Stepwise Confidence Estimation for Failure Detection PDF

Cannot Refute

[77] Diagnostics of cognitive failures in multi-agent expert systems using dynamic evaluation protocols and subsequent mutation of the processing context PDF

Cannot Refute

[78] A memory for goals model of sequence errors PDF

Cannot Refute

[79] OMMA: open architecture for Operator-guided Monitoring of Multi-step Attacks PDF

Cannot Refute

Computer Agent Arena: Toward Human-Centric Evaluation and Analysis of Computer-Use Agents

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[19] Human evaluation of conversations is an open problem: comparing the sensitivity of various methods for evaluating dialogue agents PDF

Contribution Analysis

Computer Agent Arena platform and evaluation methodology

[11] Agent-as-a-judge: Evaluate agents with agents PDF

[51] Aligning with human judgement: The role of pairwise preference in large language model evaluators PDF

[52] Evaluation of a smart audio system based on the ViP principle and the analytic hierarchy process humanâcomputer interaction design PDF

[53] BrowserArena: Evaluating LLM Agents on Real-World Web Navigation Tasks PDF

[54] Soft Condorcet Optimization for Ranking of General Agents PDF

[55] Comparing a computer agent with a humanoid robot PDF

[56] Human-AI Collaboration: Trade-offs Between Performance and Preferences PDF

[57] An adaptive decision-making system supported on user preference predictions for humanârobot interactive communication PDF

[58] Copychats: Question sequencing with artificial agents PDF

[59] Who's Sorry Now: User Preferences Among Rote, Empathic, and Explanatory Apologies from LLM Chatbots PDF

Human-centric preference dataset with 2,201 votes

[60] Lipo: Listwise preference optimization through learning-to-rank PDF

[61] In-context Ranking Preference Optimization PDF

[62] Few-shot in-context preference learning using large language models PDF

[63] Preference learning algorithms do not learn preference rankings PDF

[64] Prd: Peer rank and discussion improve large language model based evaluations PDF

[65] Measuring the inconsistency of large language models in preferential ranking PDF

[66] Learning from Human Feedback: Ranking, Bandit, and Preference Optimization PDF

[67] Assessing top-preferences PDF

[68] How do people rank multiple mutant agents? PDF

[69] Biased perceptions of income distribution and preferences for redistribution: Evidence from a survey experiment PDF

Error analysis and preference analysis frameworks

[70] Where llm agents fail and how they can learn from failures PDF

[71] Learning to correct mistakes: Backjumping in long-horizon task and motion planning PDF

[72] Mobile-agent-e: Self-evolving mobile assistant for complex tasks PDF

[73] Agentic Robot: A Brain-Inspired Framework for Vision-Language-Action Models in Embodied Agents PDF

[74] Multi-Step Temperature Prognosis of Lithium-Ion Batteries for Real Electric Vehicles Based on a Novel Bidirectional Mamba Network and Sequence Adaptive â¦ PDF

[76] Self-Evaluating LLMs for Multi-Step Tasks: Stepwise Confidence Estimation for Failure Detection PDF

[77] Diagnostics of cognitive failures in multi-agent expert systems using dynamic evaluation protocols and subsequent mutation of the processing context PDF

[78] A memory for goals model of sequence errors PDF

[79] OMMA: open architecture for Operator-guided Monitoring of Multi-step Attacks PDF

Table of Contents

[52] Evaluation of a smart audio system based on the ViP principle and the analytic hierarchy process humanâcomputer interaction design PDF

[57] An adaptive decision-making system supported on user preference predictions for humanârobot interactive communication PDF

[74] Multi-Step Temperature Prognosis of Lithium-Ion Batteries for Real Electric Vehicles Based on a Novel Bidirectional Mamba Network and Sequence Adaptive â¦ PDF