VideoAgentTrek: Computer-Use Pretraining from Unlabeled Videos

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

VideoAgentTrek

Training computer-use agents requires massive amounts of GUI interaction data, but manually annotating action trajectories at scale is prohibitively expensive. We present VideoAgentTrek, a scalable pipeline that automatically mines training data from publicly available screen-recorded videos, eliminating the need for manual annotation. Our approach addresses a key challenge: raw videos contain implicit demonstrations but lack explicit action labels. To solve this, we develop Video2Action, an inverse dynamics module (IDM) with two components: (1) a video grounding model that detects and localizes GUI actions with precise temporal boundaries, and (2) an action-content recognizer that extracts structured parameters like click coordinates and typed text. Applied to 39,000 YouTube tutorial videos, our pipeline generates 1.52 million interaction steps. We leverage this data through continued pretraining followed by supervised fine-tuning. On OSWorld-Verified, our approach improves task success rates from 9.3% (SFT-only baseline) to 15.8%, a 70% relative improvement. On AgentNetBench, step accuracy increases from 64.1% to 69.3%. Our results demonstrate that passive internet videos can be transformed into high-quality supervision for computer-use agents, providing a scalable alternative to expensive manual annotation.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces VideoAgentTrek, a pipeline for mining GUI training data from unlabeled YouTube videos, and Video2Action, an inverse dynamics module that detects actions and extracts parameters like click coordinates. It resides in the 'Inverse Dynamics and Action Parameter Extraction' leaf, which contains only three papers total. This leaf sits within the broader 'Automated Action Extraction and Recognition' branch, indicating a relatively sparse research direction focused specifically on inferring structured action parameters from visual observations without manual annotation.

The taxonomy reveals neighboring leaves addressing related but distinct problems: 'Video-to-Scenario Translation' (two papers) focuses on converting recordings into replayable sequences, while 'GUI Element Detection and Change Recognition' (four papers) emphasizes identifying UI components rather than action semantics. The sibling papers in the same leaf—Learn Automate GUI and GUI Shift—both tackle inverse dynamics but differ in scope: one integrates demonstration learning with automation, the other addresses domain adaptation under layout changes. VideoAgentTrek's emphasis on large-scale, annotation-free extraction from diverse web videos positions it at the intersection of automated extraction and the 'Web Tutorial Mining for Agent Training' leaf (one paper), which also leverages online tutorials but may assume different input modalities or annotation strategies.

Among 22 candidates examined, the contribution-level analysis shows mixed novelty signals. The VideoAgentTrek pipeline (10 candidates examined, 1 refutable) and the Video2Action module (2 candidates examined, 1 refutable) both face at least one overlapping prior work within the limited search scope. The two-stage training methodology (10 candidates examined, 2 refutable) shows the most substantial prior overlap. These statistics suggest that while the specific combination may be novel, individual components—inverse dynamics modeling, video grounding, and pretraining-then-finetuning—have precedents in the examined literature. The relatively small candidate pool (22 papers) and sparse taxonomy leaf (3 papers) indicate this assessment is based on a focused but not exhaustive search.

Given the limited search scope and the sparse taxonomy leaf, the work appears to occupy a less-crowded research direction within GUI action extraction. The contribution-level statistics indicate that some technical components have prior work among the 22 candidates examined, but the integration of large-scale web video mining with inverse dynamics for agent training may represent a novel synthesis. A more exhaustive search across adjacent fields—such as video understanding, robotics imitation learning, or broader GUI automation—would be needed to fully assess originality beyond the top-K semantic matches analyzed here.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Extracting GUI action trajectories from unlabeled screen-recorded videos. This field addresses the challenge of automatically recovering structured interaction sequences—clicks, scrolls, text entries, and navigation steps—from raw video recordings of user sessions. The taxonomy reveals five main branches that reflect different motivations and technical emphases. Automated Action Extraction and Recognition focuses on the core computer vision and inverse dynamics problems: identifying what actions occurred and inferring their parameters from pixel-level observations, as exemplified by VideoAgentTrek[0] and Learn Automate GUI[2]. Large-Scale Data Mining and Agent Training leverages extracted trajectories to build datasets for training autonomous agents or recommendation systems. Workflow and Behavioral Pattern Mining seeks higher-level process models and usage patterns from aggregated interaction logs. Human-Centered Analysis and Evaluation emphasizes usability testing, user experience assessment, and understanding how people actually navigate interfaces. Finally, Programming by Demonstration Systems, rooted in early work like Watch What I Do[5], aims to enable end-users to teach software new behaviors by example, turning recorded actions into reusable scripts or macros. Across these branches, a recurring tension exists between fully automated vision-based extraction and methods that rely on instrumentation or partial annotations. Within Automated Action Extraction, VideoAgentTrek[0] and its neighbors Learn Automate GUI[2] and GUI Shift[17] all tackle inverse dynamics—recovering action semantics from visual evidence alone—but differ in whether they assume access to underlying UI metadata or must work purely from pixels. VideoAgentTrek[0] emphasizes scalable, annotation-free extraction from diverse screen recordings, positioning itself closer to vision-centric approaches that handle arbitrary applications without prior instrumentation. In contrast, GUI Shift[17] explores domain adaptation when interface layouts change, and Learn Automate GUI[2] integrates demonstration learning with automation goals. Meanwhile, branches like Workflow Mining (e.g., BPMiner[14]) and Human-Centered Evaluation (e.g., IPTV Eye Gaze[3]) often assume richer input signals—event logs or gaze data—to study aggregate patterns or user experience, rather than solving the low-level action recognition problem. This landscape highlights an ongoing challenge: balancing the generality and scalability of pure video analysis against the precision afforded by instrumented or hybrid approaches.

Claimed Contributions

VideoAgentTrek pipeline for mining training data from unlabeled videos

Can Refute

10 retrieved papers

The authors introduce a scalable pipeline that converts publicly available screen-recorded tutorial videos into structured training data for computer-use agents without requiring manual annotation. This approach addresses the data bottleneck in training GUI agents by leveraging abundant internet videos.

10 retrieved papers

Can Refute

Video2Action inverse dynamics module

Can Refute

2 retrieved papers

The authors develop an inverse dynamics module that recovers action supervision from raw videos through two stages: detecting GUI action events with millisecond-precision temporal localization, and extracting structured action parameters such as click coordinates and typed text to produce complete training trajectories.

2 retrieved papers

Can Refute

Two-stage training methodology combining video pretraining and supervised fine-tuning

Can Refute

10 retrieved papers

The authors propose a training methodology that first performs continued pretraining on large-scale automatically mined video trajectories to learn fundamental GUI interaction patterns, then applies supervised fine-tuning on curated data to sharpen task-specific performance, demonstrating substantial improvements over fine-tuning alone.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[2] Learn to automate GUI tasks from demonstration PDF

Intharah Thanapong, Thanapong Intharah (2018)

[17] GUI-Shift: Enhancing VLM-Based GUI Agents through Self-supervised Reinforcement Learning PDF

Zhang, Li, Longxi Gao, Gao, Pengzhi, Li Zhang, Liu Wei, Pengzhi Gao, Luan Jian, Wei Liu, Xu, Mengwei, Jian Luan, Mengwei Xu (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

VideoAgentTrek pipeline for mining training data from unlabeled videos

[50] Watch and Learn: Learning to Use Computers from Online Videos PDF

Can Refute

[18] GUI-explorer: Autonomous Exploration and Mining of Transition-aware Knowledge for GUI Agent PDF

Cannot Refute

[42] OmniParser for Pure Vision Based GUI Agent PDF

Cannot Refute

[43] ScreenAgent: A Vision Language Model-driven Computer Control Agent PDF

Cannot Refute

[44] UI-Genie: A Self-Improving Approach for Iteratively Boosting MLLM-based Mobile GUI Agents PDF

Cannot Refute

[45] UI-E2I-Synth: Advancing GUI Grounding with Large-Scale Instruction Synthesis PDF

Cannot Refute

[46] UICVD: A Computer Vision UI Dataset for Training RPA Agents PDF

Cannot Refute

[47] TongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials PDF

Cannot Refute

[48] Harnessing webpage uis for text-rich visual understanding PDF

Cannot Refute

[49] UI-Hawk: Unleashing the Screen Stream Understanding for GUI Agents PDF

Cannot Refute

Contribution

Video2Action inverse dynamics module

[31] Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos PDF

Can Refute

[32] ARDuP: Active Region Video Diffusion for Universal Policies PDF

Cannot Refute

Contribution

Two-stage training methodology combining video pretraining and supervised fine-tuning

[31] Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos PDF

Can Refute

[33] Reinforcement learning with action-free pre-training from videos PDF

Can Refute

[34] AdaWorld: Learning Adaptable World Models with Latent Actions PDF

Cannot Refute

[35] What Comes After Transformers? A Selective Survey Connecting Ideas in Deep LearningGPT PDF

Cannot Refute

[36] Inter-Slice Super-Resolution of Magnetic Resonance Images by Pre-Training and Self-Supervised Fine-Tuning PDF

Cannot Refute

[37] Bootstrapping World Models from Dynamics Models in Multimodal Foundation Models PDF

Cannot Refute

[38] Towards learning a generic agent for vision-and-language navigation via pre-training PDF

Cannot Refute

[39] Fine-Tuning Video-Text Contrastive Model for Primate Behavior Retrieval from Unlabeled Raw Videos PDF

Cannot Refute

[40] Two-Stage Learning Approach for Semantic-Aware Task Scheduling in Container-Based Clouds PDF

Cannot Refute

[41] Knowledge-guided pre-training and fine-tuning: Video representation learning for action recognition PDF

Cannot Refute

VideoAgentTrek: Computer-Use Pretraining from Unlabeled Videos

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[2] Learn to automate GUI tasks from demonstration PDF

[17] GUI-Shift: Enhancing VLM-Based GUI Agents through Self-supervised Reinforcement Learning PDF

Contribution Analysis

VideoAgentTrek pipeline for mining training data from unlabeled videos

[50] Watch and Learn: Learning to Use Computers from Online Videos PDF

[18] GUI-explorer: Autonomous Exploration and Mining of Transition-aware Knowledge for GUI Agent PDF

[42] OmniParser for Pure Vision Based GUI Agent PDF

[43] ScreenAgent: A Vision Language Model-driven Computer Control Agent PDF

[44] UI-Genie: A Self-Improving Approach for Iteratively Boosting MLLM-based Mobile GUI Agents PDF

[45] UI-E2I-Synth: Advancing GUI Grounding with Large-Scale Instruction Synthesis PDF

[46] UICVD: A Computer Vision UI Dataset for Training RPA Agents PDF

[47] TongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials PDF

[48] Harnessing webpage uis for text-rich visual understanding PDF

[49] UI-Hawk: Unleashing the Screen Stream Understanding for GUI Agents PDF

Video2Action inverse dynamics module

[31] Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos PDF

[32] ARDuP: Active Region Video Diffusion for Universal Policies PDF

Two-stage training methodology combining video pretraining and supervised fine-tuning

[31] Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos PDF

[33] Reinforcement learning with action-free pre-training from videos PDF

[34] AdaWorld: Learning Adaptable World Models with Latent Actions PDF

[35] What Comes After Transformers? A Selective Survey Connecting Ideas in Deep LearningGPT PDF

[36] Inter-Slice Super-Resolution of Magnetic Resonance Images by Pre-Training and Self-Supervised Fine-Tuning PDF

[37] Bootstrapping World Models from Dynamics Models in Multimodal Foundation Models PDF

[38] Towards learning a generic agent for vision-and-language navigation via pre-training PDF

[39] Fine-Tuning Video-Text Contrastive Model for Primate Behavior Retrieval from Unlabeled Raw Videos PDF

[40] Two-Stage Learning Approach for Semantic-Aware Task Scheduling in Container-Based Clouds PDF

[41] Knowledge-guided pre-training and fine-tuning: Video representation learning for action recognition PDF

Table of Contents