EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video

ICLR 2026 Conference SubmissionAnonymous Authors
egocentric videomanipulationembodied airobotics
Abstract:

Imitation learning for manipulation has a well-known data scarcity problem. Unlike natural language and 2D computer vision, there is no Internet-scale corpus of data for dexterous manipulation. One appealing option is egocentric human video, a passively scalable data source. However, existing large-scale datasets such as Ego4D do not have native hand pose annotations and do not focus on object manipulation. To this end, we use Apple Vision Pro to collect EgoDex: the largest and most diverse dataset of dexterous human manipulation to date. EgoDex has 829 hours of egocentric video with paired 3D hand and finger tracking data collected at the time of recording, where multiple calibrated cameras and on-device SLAM can be used to precisely track the pose of every joint of each hand. The dataset covers a wide range of diverse manipulation behaviors with everyday household objects in 194 different tabletop tasks ranging from tying shoelaces to folding laundry. Furthermore, we train and systematically evaluate imitation learning policies for hand trajectory prediction on the dataset, introducing metrics and benchmarks for measuring progress in this increasingly important area. By releasing this large-scale dataset, we hope to push the frontier of robotics, computer vision, and foundation models. EgoDex is publicly available for download.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces EgoDex, a large-scale egocentric dataset with 829 hours of video and synchronized 3D hand tracking across 194 tabletop manipulation tasks. It resides in the 'Large-Scale Multimodal Egocentric Datasets' leaf alongside two sibling papers (OpenEgo and one other work). This leaf represents a relatively sparse research direction within the broader taxonomy of 34 papers, suggesting that comprehensive multimodal egocentric datasets for dexterous manipulation remain an emerging area rather than a saturated one.

The taxonomy reveals neighboring leaves focused on benchmark datasets with ground-truth hand-object poses and specialized annotation tools, indicating that the field distinguishes between large-scale general collections and curated evaluation benchmarks. EgoDex bridges these directions by providing both scale and fine-grained hand tracking, positioning it at the intersection of dataset construction and downstream policy learning branches. The taxonomy's policy learning subtopics (embodiment-aware imitation, hand trajectory retargeting) represent natural consumers of such datasets, highlighting how EgoDex connects data collection to robotic manipulation applications.

Among 30 candidates examined, the dataset contribution shows no clear refutation across 10 papers reviewed, while the passively scalable data collection approach encountered one refutable candidate among 10 examined. The benchmark and metrics contribution similarly found no overlapping prior work in its 10-candidate examination. These statistics suggest that within the limited search scope, the dataset's scale and diversity appear relatively novel, though the data collection methodology may have precedent in at least one prior work. The modest search scale means these findings reflect top semantic matches rather than exhaustive coverage.

Based on the limited literature search, EgoDex appears to occupy a distinctive position combining dataset scale, hand tracking fidelity, and task diversity. However, the analysis covers only top-30 semantic matches and does not capture the full landscape of egocentric manipulation datasets or related data collection methodologies. The single refutable candidate for the data collection approach warrants closer examination to understand the degree of methodological overlap.

Taxonomy

Core-task Taxonomy Papers
34
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: Learning dexterous manipulation from egocentric video with hand pose tracking. The field encompasses several interconnected branches that together enable robots to learn complex manipulation skills from human demonstrations captured in first-person view. Egocentric Dataset Construction and Annotation focuses on building large-scale multimodal collections that pair video with hand pose, object interactions, and task annotations, exemplified by works like H2O[4] and First-person hand action benchmark[3]. Hand Pose Estimation and Tracking develops methods to accurately localize and track hand configurations in egocentric perspectives, addressing challenges like occlusion and rapid motion. Hand Activity and Interaction Recognition interprets what actions humans perform and how they manipulate objects, bridging perception and task understanding. Policy Learning for Robotic Manipulation translates these human demonstrations into executable robot behaviors, often leveraging imitation learning or reinforcement learning frameworks such as DexVIP[8] and Learning dexterity from human[7]. Finally, Egocentric Interaction Interfaces and Applications explores practical systems that use hand tracking for AR/VR interfaces and interactive tools. Recent efforts reveal contrasting emphases between scaling dataset diversity versus depth of annotation, and between end-to-end policy learning versus modular perception-action pipelines. Works like Scalable vision-language-action model pretraining[5] pursue broad multimodal pretraining across diverse tasks, while others focus on rich hand-object interaction details within narrower domains. EgoDex[0] sits within the Large-Scale Multimodal Egocentric Datasets cluster, neighboring OpenEgo[26] and sharing emphasis on comprehensive egocentric data collection with detailed hand pose tracking. Compared to OpenEgo[26], which prioritizes open-world diversity, EgoDex[0] appears more focused on dexterous manipulation scenarios with fine-grained hand annotations suitable for downstream robotic learning. This positioning reflects ongoing debates about whether richer task-specific datasets or broader general-purpose collections better support transferable manipulation policies.

Claimed Contributions

EgoDex dataset for dexterous manipulation

The authors introduce EgoDex, a large-scale egocentric dataset containing 829 hours of video with native 3D hand and finger tracking across 194 tabletop manipulation tasks. The dataset is collected using Apple Vision Pro with on-device SLAM and calibrated cameras, providing precise skeletal annotations for dexterous manipulation behaviors.

10 retrieved papers
Benchmarks and metrics for hand trajectory prediction

The authors propose two benchmark tasks (dexterous trajectory prediction and inverse dynamics) with a best-of-K evaluation metric that accounts for multimodality in human motion. They systematically evaluate state-of-the-art imitation learning policies to establish baselines for future research.

10 retrieved papers
Passively scalable data collection approach

The authors propose a data collection paradigm that is passively scalable, unlike robot teleoperation which requires active effort. By using egocentric video with native 3D pose tracking, the approach enables large-scale data collection as a byproduct of human activity rather than deliberate demonstration.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

EgoDex dataset for dexterous manipulation

The authors introduce EgoDex, a large-scale egocentric dataset containing 829 hours of video with native 3D hand and finger tracking across 194 tabletop manipulation tasks. The dataset is collected using Apple Vision Pro with on-device SLAM and calibrated cameras, providing precise skeletal annotations for dexterous manipulation behaviors.

Contribution

Benchmarks and metrics for hand trajectory prediction

The authors propose two benchmark tasks (dexterous trajectory prediction and inverse dynamics) with a best-of-K evaluation metric that accounts for multimodality in human motion. They systematically evaluate state-of-the-art imitation learning policies to establish baselines for future research.

Contribution

Passively scalable data collection approach

The authors propose a data collection paradigm that is passively scalable, unlike robot teleoperation which requires active effort. By using egocentric video with native 3D pose tracking, the approach enables large-scale data collection as a byproduct of human activity rather than deliberate demonstration.