Abstract:

Training computer-use agents requires massive amounts of GUI interaction data, but manually annotating action trajectories at scale is prohibitively expensive. We present VideoAgentTrek, a scalable pipeline that automatically mines training data from publicly available screen-recorded videos, eliminating the need for manual annotation. Our approach addresses a key challenge: raw videos contain implicit demonstrations but lack explicit action labels. To solve this, we develop Video2Action, an inverse dynamics module (IDM) with two components: (1) a video grounding model that detects and localizes GUI actions with precise temporal boundaries, and (2) an action-content recognizer that extracts structured parameters like click coordinates and typed text. Applied to 39,000 YouTube tutorial videos, our pipeline generates 1.52 million interaction steps. We leverage this data through continued pretraining followed by supervised fine-tuning. On OSWorld-Verified, our approach improves task success rates from 9.3% (SFT-only baseline) to 15.8%, a 70% relative improvement. On AgentNetBench, step accuracy increases from 64.1% to 69.3%. Our results demonstrate that passive internet videos can be transformed into high-quality supervision for computer-use agents, providing a scalable alternative to expensive manual annotation.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces VideoAgentTrek, a pipeline for mining GUI training data from unlabeled YouTube videos, and Video2Action, an inverse dynamics module that detects actions and extracts parameters like click coordinates. It resides in the 'Inverse Dynamics and Action Parameter Extraction' leaf, which contains only three papers total. This leaf sits within the broader 'Automated Action Extraction and Recognition' branch, indicating a relatively sparse research direction focused specifically on inferring structured action parameters from visual observations without manual annotation.

The taxonomy reveals neighboring leaves addressing related but distinct problems: 'Video-to-Scenario Translation' (two papers) focuses on converting recordings into replayable sequences, while 'GUI Element Detection and Change Recognition' (four papers) emphasizes identifying UI components rather than action semantics. The sibling papers in the same leaf—Learn Automate GUI and GUI Shift—both tackle inverse dynamics but differ in scope: one integrates demonstration learning with automation, the other addresses domain adaptation under layout changes. VideoAgentTrek's emphasis on large-scale, annotation-free extraction from diverse web videos positions it at the intersection of automated extraction and the 'Web Tutorial Mining for Agent Training' leaf (one paper), which also leverages online tutorials but may assume different input modalities or annotation strategies.

Among 22 candidates examined, the contribution-level analysis shows mixed novelty signals. The VideoAgentTrek pipeline (10 candidates examined, 1 refutable) and the Video2Action module (2 candidates examined, 1 refutable) both face at least one overlapping prior work within the limited search scope. The two-stage training methodology (10 candidates examined, 2 refutable) shows the most substantial prior overlap. These statistics suggest that while the specific combination may be novel, individual components—inverse dynamics modeling, video grounding, and pretraining-then-finetuning—have precedents in the examined literature. The relatively small candidate pool (22 papers) and sparse taxonomy leaf (3 papers) indicate this assessment is based on a focused but not exhaustive search.

Given the limited search scope and the sparse taxonomy leaf, the work appears to occupy a less-crowded research direction within GUI action extraction. The contribution-level statistics indicate that some technical components have prior work among the 22 candidates examined, but the integration of large-scale web video mining with inverse dynamics for agent training may represent a novel synthesis. A more exhaustive search across adjacent fields—such as video understanding, robotics imitation learning, or broader GUI automation—would be needed to fully assess originality beyond the top-K semantic matches analyzed here.

Taxonomy

Core-task Taxonomy Papers
30
3
Claimed Contributions
22
Contribution Candidate Papers Compared
4
Refutable Paper

Research Landscape Overview

Core task: Extracting GUI action trajectories from unlabeled screen-recorded videos. This field addresses the challenge of automatically recovering structured interaction sequences—clicks, scrolls, text entries, and navigation steps—from raw video recordings of user sessions. The taxonomy reveals five main branches that reflect different motivations and technical emphases. Automated Action Extraction and Recognition focuses on the core computer vision and inverse dynamics problems: identifying what actions occurred and inferring their parameters from pixel-level observations, as exemplified by VideoAgentTrek[0] and Learn Automate GUI[2]. Large-Scale Data Mining and Agent Training leverages extracted trajectories to build datasets for training autonomous agents or recommendation systems. Workflow and Behavioral Pattern Mining seeks higher-level process models and usage patterns from aggregated interaction logs. Human-Centered Analysis and Evaluation emphasizes usability testing, user experience assessment, and understanding how people actually navigate interfaces. Finally, Programming by Demonstration Systems, rooted in early work like Watch What I Do[5], aims to enable end-users to teach software new behaviors by example, turning recorded actions into reusable scripts or macros. Across these branches, a recurring tension exists between fully automated vision-based extraction and methods that rely on instrumentation or partial annotations. Within Automated Action Extraction, VideoAgentTrek[0] and its neighbors Learn Automate GUI[2] and GUI Shift[17] all tackle inverse dynamics—recovering action semantics from visual evidence alone—but differ in whether they assume access to underlying UI metadata or must work purely from pixels. VideoAgentTrek[0] emphasizes scalable, annotation-free extraction from diverse screen recordings, positioning itself closer to vision-centric approaches that handle arbitrary applications without prior instrumentation. In contrast, GUI Shift[17] explores domain adaptation when interface layouts change, and Learn Automate GUI[2] integrates demonstration learning with automation goals. Meanwhile, branches like Workflow Mining (e.g., BPMiner[14]) and Human-Centered Evaluation (e.g., IPTV Eye Gaze[3]) often assume richer input signals—event logs or gaze data—to study aggregate patterns or user experience, rather than solving the low-level action recognition problem. This landscape highlights an ongoing challenge: balancing the generality and scalability of pure video analysis against the precision afforded by instrumented or hybrid approaches.

Claimed Contributions

VideoAgentTrek pipeline for mining training data from unlabeled videos

The authors introduce a scalable pipeline that converts publicly available screen-recorded tutorial videos into structured training data for computer-use agents without requiring manual annotation. This approach addresses the data bottleneck in training GUI agents by leveraging abundant internet videos.

10 retrieved papers
Can Refute
Video2Action inverse dynamics module

The authors develop an inverse dynamics module that recovers action supervision from raw videos through two stages: detecting GUI action events with millisecond-precision temporal localization, and extracting structured action parameters such as click coordinates and typed text to produce complete training trajectories.

2 retrieved papers
Can Refute
Two-stage training methodology combining video pretraining and supervised fine-tuning

The authors propose a training methodology that first performs continued pretraining on large-scale automatically mined video trajectories to learn fundamental GUI interaction patterns, then applies supervised fine-tuning on curated data to sharpen task-specific performance, demonstrating substantial improvements over fine-tuning alone.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

VideoAgentTrek pipeline for mining training data from unlabeled videos

The authors introduce a scalable pipeline that converts publicly available screen-recorded tutorial videos into structured training data for computer-use agents without requiring manual annotation. This approach addresses the data bottleneck in training GUI agents by leveraging abundant internet videos.

Contribution

Video2Action inverse dynamics module

The authors develop an inverse dynamics module that recovers action supervision from raw videos through two stages: detecting GUI action events with millisecond-precision temporal localization, and extracting structured action parameters such as click coordinates and typed text to produce complete training trajectories.

Contribution

Two-stage training methodology combining video pretraining and supervised fine-tuning

The authors propose a training methodology that first performs continued pretraining on large-scale automatically mined video trajectories to learn fundamental GUI interaction patterns, then applies supervised fine-tuning on curated data to sharpen task-specific performance, demonstrating substantial improvements over fine-tuning alone.