ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.8 Download Report PDF

GUI AgentGUI Data PipelineComputer UseOpen Source

Vision-Language Models (VLMs) have enabled computer use agents (CUAs) that operate GUIs autonomously, showing great potential, yet progress is limited by the lack of large-scale, open-source computer use data and foundation models. In this work, we introduce ScaleCUA, a step toward scaling open-source CUAs. It offers a large-scale dataset spanning 6 operating systems and 3 task domains, built via a closed-loop pipeline uniting automated agents with human experts. Trained on this scaled-up data, ScaleCUA can operate seamlessly across platforms. Specifically, it delivers strong gains over baselines (+26.6 on WebArena-Lite-v2, +10.7 on ScreenSpot-Pro) and sets new state-of-the-art results (94.4% on MMBench-GUI L1-Hard, 60.6% on OSWorld-G, 47.4% on WebArena-Lite-v2). These findings underscore the power of data-driven scaling for general-purpose computer use agents. We will release data, models, and code to advance future research.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper contributes a large-scale cross-platform dataset spanning six operating systems and three task domains, alongside a family of foundation models (ScaleCUA) trained on this data. Within the taxonomy, it resides in the 'Cross-Platform Dataset Construction' leaf under 'Data Collection and Scaling Pipelines,' sharing this leaf with only one sibling paper (OS-Atlas). This positioning indicates a relatively sparse research direction focused specifically on multi-OS data aggregation, distinguishing it from mobile-only annotation efforts and from foundation model architectures that consume such data.

The taxonomy reveals neighboring work in 'Open-Source Foundation Action Models' (three papers on VLM-based GUI agents) and 'Mobile-Specific Annotation Datasets' (one paper on mobile GUI annotations). The paper bridges data collection and model training, connecting to both the dataset construction branch and the foundation model branch. Its scope explicitly includes automated data pipelines with human-in-the-loop validation, aligning with the leaf's scope note emphasizing 'automated or hybrid collection pipelines.' The exclude note clarifies that mobile-only datasets belong elsewhere, reinforcing this work's cross-platform emphasis.

Among 29 candidates examined, the data pipeline contribution (Contribution A) identified one refutable candidate from 10 examined, while the base model contribution (Contribution B) also found one refutable candidate from 10 examined. The evaluation contribution (Contribution C) showed no clear refutation across nine candidates. These statistics suggest that within the limited search scope, the data pipeline and model architecture face some prior overlap, whereas the evaluation insights appear more distinctive. The relatively low refutation counts (one per contribution) indicate that most examined candidates address different facets or scales of the problem.

Given the limited search scope of 29 candidates from top-K semantic matches, this analysis captures the most semantically proximate prior work but does not constitute an exhaustive field survey. The sparse taxonomy leaf (two papers total) and low refutation rates suggest the work occupies a relatively underexplored niche within computer use agents, though the presence of any refutable candidates indicates incremental overlap with existing cross-platform data efforts. The evaluation contribution appears most novel within this constrained search.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Scaling open-source computer use agents with cross-platform data. The field of computer use agents has rapidly evolved around four main branches. Foundation Models and Architectures for GUI Agents explores the underlying neural architectures and pretraining strategies that enable agents to perceive and interact with graphical interfaces. Data Collection and Scaling Pipelines focuses on methods for gathering, annotating, and synthesizing large-scale interaction traces across diverse operating systems and applications, addressing the bottleneck of training data availability. Evaluation Frameworks and Benchmarks develops standardized testbeds and metrics to measure agent performance on realistic tasks, such as MMBench-GUI[3] and FineState-Bench[10]. Specialized Agent Capabilities and Applications examines domain-specific skills—ranging from mobile navigation in Mobile-Agent-v3[8] to web browsing in Surfer 2[12]—and how agents generalize across platforms. A particularly active line of work centers on cross-platform dataset construction, where researchers aim to unify interaction data from Windows, macOS, Linux, mobile, and web environments to improve agent robustness. ScaleCUA[0] exemplifies this direction by systematically aggregating diverse platform traces to train open-source models at scale. Closely related efforts include OS-Atlas[1], which emphasizes operating-system-level grounding, and CCAgent[5], which tackles cross-platform consistency in action spaces. Meanwhile, OpenCUA[2] and UltraCUA[6] explore complementary strategies for data synthesis and quality filtering. The main trade-off in this cluster revolves around breadth versus depth: some works prioritize coverage across many platforms, while others focus on high-fidelity annotations within a narrower scope. ScaleCUA[0] sits squarely in the breadth-focused camp, leveraging cross-platform diversity to enhance generalization, whereas CCAgent[5] places greater emphasis on ensuring semantic alignment of actions across different GUI paradigms.

Claimed Contributions

Cross-platform interactive data pipeline for computer use agents

Can Refute

10 retrieved papers

The authors propose a dual-loop data acquisition pipeline that combines automated agent exploration with human expert supervision to collect computer-use data across six operating systems (Windows, macOS, Linux, Android, iOS, Web) and three task domains (understanding, grounding, task completion). This pipeline addresses the scarcity of computer-use training data by balancing automation with quality control.

10 retrieved papers

Can Refute

ScaleCUA family of base agent models with unified action space

Can Refute

10 retrieved papers

The authors develop a series of vision-language models (3B, 7B, 32B parameters) built on Qwen2.5-VL that support three inference paradigms: grounding mode for UI element localization, direct action mode for efficient task completion, and reasoned action mode with chain-of-thought reasoning. The models use a unified action space enabling seamless cross-platform interaction.

10 retrieved papers

Can Refute

Comprehensive evaluation and insights for computer use agents

9 retrieved papers

The authors perform extensive experiments across GUI understanding, grounding, and task completion benchmarks on multiple platforms. Their evaluation provides fundamental insights into factors affecting agent performance, including data scaling effects, resolution trade-offs, inference mode comparisons, and cross-platform training strategies, establishing new state-of-the-art results on several benchmarks.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[5] CCAgent: Coordinating Collaborative Data Scaling for Operating System Agents via Web3 PDF

Liang Chen, HaoZhe Zhao, Yinzhen Huang, Yang Luo, Tsekai Lin, Ruoyu Wu, Weichu Xie, Peiyi Wang, Runxin Xu, Ming Wu, Baobao Chang (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Cross-platform interactive data pipeline for computer use agents

[2] Opencua: Open foundations for computer-use agents PDF

Can Refute

[32] 3EED: Ground everything everywhere in 3D PDF

Cannot Refute

[33] Exploring the Integration of Generative AI Tools in Software Testing Education: A Case Study on ChatGPT and Copilot for Preparatory Testing Artifacts in Postgraduate â¦ PDF

Cannot Refute

[34] Crab: Cross-platfrom agent benchmark for multi-modal embodied language model agents PDF

Cannot Refute

[35] Multi-Agent Systems in AIOps: Enhancing Detection, Diagnosis, and Remediation PDF

Cannot Refute

[36] Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web PDF

Cannot Refute

[37] Automated Database Tuning Using AI: A Comprehensive Framework for Real-Time Performance Optimization PDF

Cannot Refute

[38] REPP: A robust cross-platform solution for online sensorimotor synchronization experiments PDF

Cannot Refute

[39] Chouette: An Automated Cross-Platform UI Crawler for Improving App Quality PDF

Cannot Refute

[40] A Multi-Agent Monitoring System for Computer Networks PDF

Cannot Refute

Contribution

ScaleCUA family of base agent models with unified action space

[16] Aguvis: Unified pure vision agents for autonomous gui interaction PDF

Can Refute

[14] Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency PDF

Cannot Refute

[15] CogAgent: A Visual Language Model for GUI Agents PDF

Cannot Refute

[17] Mug: Interactive multimodal grounding on user interfaces PDF

Cannot Refute

[18] Mobilevlm: A vision-language model for better intra-and inter-ui understanding PDF

Cannot Refute

[19] Gui-r1: A generalist r1-style vision-language action model for gui agents PDF

Cannot Refute

[20] Seed1. 5-vl technical report PDF

Cannot Refute

[21] Aria-ui: Visual grounding for gui instructions PDF

Cannot Refute

[22] ScreenAgent: A Vision Language Model-driven Computer Control Agent PDF

Cannot Refute

[23] Infigui-r1: Advancing multimodal gui agents from reactive actors to deliberative reasoners PDF

Cannot Refute

Contribution

Comprehensive evaluation and insights for computer use agents

[16] Aguvis: Unified pure vision agents for autonomous gui interaction PDF

Cannot Refute

[21] Aria-ui: Visual grounding for gui instructions PDF

Cannot Refute

[24] Seeclick: Harnessing gui grounding for advanced visual gui agents PDF

Cannot Refute

[25] Navigating the digital world as humans do: Universal visual grounding for gui agents PDF

Cannot Refute

[26] Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis PDF

Cannot Refute

[27] UItron: Foundational GUI Agent with Advanced Perception and Planning PDF

Cannot Refute

[29] Gta1: Gui test-time scaling agent PDF

Cannot Refute

[30] Gui agents with foundation models: A comprehensive survey PDF

Cannot Refute

[31] Gui agents: A survey PDF

Cannot Refute

ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[5] CCAgent: Coordinating Collaborative Data Scaling for Operating System Agents via Web3 PDF

Contribution Analysis

Cross-platform interactive data pipeline for computer use agents

[2] Opencua: Open foundations for computer-use agents PDF

[32] 3EED: Ground everything everywhere in 3D PDF

[33] Exploring the Integration of Generative AI Tools in Software Testing Education: A Case Study on ChatGPT and Copilot for Preparatory Testing Artifacts in Postgraduate â¦ PDF

[34] Crab: Cross-platfrom agent benchmark for multi-modal embodied language model agents PDF

[35] Multi-Agent Systems in AIOps: Enhancing Detection, Diagnosis, and Remediation PDF

[36] Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web PDF

[37] Automated Database Tuning Using AI: A Comprehensive Framework for Real-Time Performance Optimization PDF

[38] REPP: A robust cross-platform solution for online sensorimotor synchronization experiments PDF

[39] Chouette: An Automated Cross-Platform UI Crawler for Improving App Quality PDF

[40] A Multi-Agent Monitoring System for Computer Networks PDF

ScaleCUA family of base agent models with unified action space

[16] Aguvis: Unified pure vision agents for autonomous gui interaction PDF

[14] Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency PDF

[15] CogAgent: A Visual Language Model for GUI Agents PDF

[17] Mug: Interactive multimodal grounding on user interfaces PDF

[18] Mobilevlm: A vision-language model for better intra-and inter-ui understanding PDF

[19] Gui-r1: A generalist r1-style vision-language action model for gui agents PDF

[20] Seed1. 5-vl technical report PDF

[21] Aria-ui: Visual grounding for gui instructions PDF

[22] ScreenAgent: A Vision Language Model-driven Computer Control Agent PDF

[23] Infigui-r1: Advancing multimodal gui agents from reactive actors to deliberative reasoners PDF

Comprehensive evaluation and insights for computer use agents

[16] Aguvis: Unified pure vision agents for autonomous gui interaction PDF

[21] Aria-ui: Visual grounding for gui instructions PDF

[24] Seeclick: Harnessing gui grounding for advanced visual gui agents PDF

[25] Navigating the digital world as humans do: Universal visual grounding for gui agents PDF

[26] Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis PDF

[27] UItron: Foundational GUI Agent with Advanced Perception and Planning PDF

[29] Gta1: Gui test-time scaling agent PDF

[30] Gui agents with foundation models: A comprehensive survey PDF

[31] Gui agents: A survey PDF

Table of Contents

[33] Exploring the Integration of Generative AI Tools in Software Testing Education: A Case Study on ChatGPT and Copilot for Preparatory Testing Artifacts in Postgraduate â¦ PDF