Computer Agent Arena: Toward Human-Centric Evaluation and Analysis of Computer-Use Agents
Overview
Overall Novelty Assessment
The paper introduces Computer Agent Arena, a platform for head-to-head evaluation of computer-use agents through human preference judgments, and a dataset of 2,201 votes across 12 agents. It resides in the Human Preference and Comparative Evaluation leaf, which contains only two papers total: the original work and one sibling (Human Evaluation Conversations). This leaf is notably sparse, suggesting that direct human preference collection for computer-use agents remains an underexplored direction within the broader field of 50 papers. The work sits within the Human-Centered Evaluation Methodologies branch, which itself comprises three leaves addressing preference-based, automated, and usability-focused evaluation approaches.
The taxonomy reveals that most evaluation activity clusters around Benchmark Design and Task Environments, particularly Web and GUI Interaction Benchmarks (three papers) and Safety and Adversarial Robustness Benchmarks (three papers). The sibling leaf Agent-Based and Automated Evaluation contains three papers focused on AI-as-judge methods, representing a contrasting paradigm to human preference collection. The paper's emphasis on dynamic, user-driven evaluation diverges from the static benchmark tradition prevalent in neighboring leaves, and its focus on pairwise comparison and capability-oriented signals distinguishes it from the usability heuristics explored in the Usability and Interaction Quality Assessment leaf (four papers).
Among 29 candidates examined, none clearly refute the three core contributions. The platform and methodology contribution examined 10 candidates with zero refutable matches; the preference dataset contribution examined 10 candidates with zero refutable matches; and the error analysis framework examined 9 candidates with zero refutable matches. This suggests that within the limited search scope, no prior work directly overlaps with the combination of head-to-head agent comparison, cloud-hosted dynamic environments, and structured human feedback for computer-use agents. The absence of refutable candidates across all contributions indicates that the work occupies a relatively novel position, though the search scale (29 papers) leaves open the possibility of relevant work beyond the top-K semantic matches.
Given the sparse Human Preference and Comparative Evaluation leaf and the lack of refutable candidates among 29 examined papers, the work appears to address a gap in human-centric evaluation for computer-use agents. However, the analysis is constrained by the limited search scope and does not cover the full breadth of human-computer interaction or agent evaluation literature. The novelty assessment reflects what is visible within the top-30 semantic neighborhood and the constructed taxonomy, not an exhaustive field survey.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors introduce an open-source platform that enables pairwise evaluation of Computer-Use Agents through human preferences collected in cloud-hosted, diverse, and dynamic environments. The system simulates real-world computer use, ensures fair comparison via matched environments, and extends evaluation beyond correctness to include capability- and behavior-oriented signals.
The authors collect and release a large-scale dataset of 2,201 filtered human preference votes across 12 agents, revealing that agent rankings differ substantially from static benchmarks. The dataset captures diverse, open-ended tasks and provides multimodal, human-labeled preference signals for future research.
The authors conduct systematic error analysis identifying failure modes (long-horizon memory lapses, tool-selection errors, fine-grained action failures) and preference analysis showing that users value process quality, agent-human interaction, and self-correction beyond task completion. These analyses surface alignment signals that static benchmarks overlook.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[19] Human evaluation of conversations is an open problem: comparing the sensitivity of various methods for evaluating dialogue agents PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Computer Agent Arena platform and evaluation methodology
The authors introduce an open-source platform that enables pairwise evaluation of Computer-Use Agents through human preferences collected in cloud-hosted, diverse, and dynamic environments. The system simulates real-world computer use, ensures fair comparison via matched environments, and extends evaluation beyond correctness to include capability- and behavior-oriented signals.
[11] Agent-as-a-judge: Evaluate agents with agents PDF
[51] Aligning with human judgement: The role of pairwise preference in large language model evaluators PDF
[52] Evaluation of a smart audio system based on the ViP principle and the analytic hierarchy process humanâcomputer interaction design PDF
[53] BrowserArena: Evaluating LLM Agents on Real-World Web Navigation Tasks PDF
[54] Soft Condorcet Optimization for Ranking of General Agents PDF
[55] Comparing a computer agent with a humanoid robot PDF
[56] Human-AI Collaboration: Trade-offs Between Performance and Preferences PDF
[57] An adaptive decision-making system supported on user preference predictions for humanârobot interactive communication PDF
[58] Copychats: Question sequencing with artificial agents PDF
[59] Who's Sorry Now: User Preferences Among Rote, Empathic, and Explanatory Apologies from LLM Chatbots PDF
Human-centric preference dataset with 2,201 votes
The authors collect and release a large-scale dataset of 2,201 filtered human preference votes across 12 agents, revealing that agent rankings differ substantially from static benchmarks. The dataset captures diverse, open-ended tasks and provides multimodal, human-labeled preference signals for future research.
[60] Lipo: Listwise preference optimization through learning-to-rank PDF
[61] In-context Ranking Preference Optimization PDF
[62] Few-shot in-context preference learning using large language models PDF
[63] Preference learning algorithms do not learn preference rankings PDF
[64] Prd: Peer rank and discussion improve large language model based evaluations PDF
[65] Measuring the inconsistency of large language models in preferential ranking PDF
[66] Learning from Human Feedback: Ranking, Bandit, and Preference Optimization PDF
[67] Assessing top-preferences PDF
[68] How do people rank multiple mutant agents? PDF
[69] Biased perceptions of income distribution and preferences for redistribution: Evidence from a survey experiment PDF
Error analysis and preference analysis frameworks
The authors conduct systematic error analysis identifying failure modes (long-horizon memory lapses, tool-selection errors, fine-grained action failures) and preference analysis showing that users value process quality, agent-human interaction, and self-correction beyond task completion. These analyses surface alignment signals that static benchmarks overlook.