EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

egocentric videomanipulationembodied airobotics

Imitation learning for manipulation has a well-known data scarcity problem. Unlike natural language and 2D computer vision, there is no Internet-scale corpus of data for dexterous manipulation. One appealing option is egocentric human video, a passively scalable data source. However, existing large-scale datasets such as Ego4D do not have native hand pose annotations and do not focus on object manipulation. To this end, we use Apple Vision Pro to collect EgoDex: the largest and most diverse dataset of dexterous human manipulation to date. EgoDex has 829 hours of egocentric video with paired 3D hand and finger tracking data collected at the time of recording, where multiple calibrated cameras and on-device SLAM can be used to precisely track the pose of every joint of each hand. The dataset covers a wide range of diverse manipulation behaviors with everyday household objects in 194 different tabletop tasks ranging from tying shoelaces to folding laundry. Furthermore, we train and systematically evaluate imitation learning policies for hand trajectory prediction on the dataset, introducing metrics and benchmarks for measuring progress in this increasingly important area. By releasing this large-scale dataset, we hope to push the frontier of robotics, computer vision, and foundation models. EgoDex is publicly available for download.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces EgoDex, a large-scale egocentric dataset with 829 hours of video and synchronized 3D hand tracking across 194 tabletop manipulation tasks. It resides in the 'Large-Scale Multimodal Egocentric Datasets' leaf alongside two sibling papers (OpenEgo and one other work). This leaf represents a relatively sparse research direction within the broader taxonomy of 34 papers, suggesting that comprehensive multimodal egocentric datasets for dexterous manipulation remain an emerging area rather than a saturated one.

The taxonomy reveals neighboring leaves focused on benchmark datasets with ground-truth hand-object poses and specialized annotation tools, indicating that the field distinguishes between large-scale general collections and curated evaluation benchmarks. EgoDex bridges these directions by providing both scale and fine-grained hand tracking, positioning it at the intersection of dataset construction and downstream policy learning branches. The taxonomy's policy learning subtopics (embodiment-aware imitation, hand trajectory retargeting) represent natural consumers of such datasets, highlighting how EgoDex connects data collection to robotic manipulation applications.

Among 30 candidates examined, the dataset contribution shows no clear refutation across 10 papers reviewed, while the passively scalable data collection approach encountered one refutable candidate among 10 examined. The benchmark and metrics contribution similarly found no overlapping prior work in its 10-candidate examination. These statistics suggest that within the limited search scope, the dataset's scale and diversity appear relatively novel, though the data collection methodology may have precedent in at least one prior work. The modest search scale means these findings reflect top semantic matches rather than exhaustive coverage.

Based on the limited literature search, EgoDex appears to occupy a distinctive position combining dataset scale, hand tracking fidelity, and task diversity. However, the analysis covers only top-30 semantic matches and does not capture the full landscape of egocentric manipulation datasets or related data collection methodologies. The single refutable candidate for the data collection approach warrants closer examination to understand the degree of methodological overlap.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Learning dexterous manipulation from egocentric video with hand pose tracking. The field encompasses several interconnected branches that together enable robots to learn complex manipulation skills from human demonstrations captured in first-person view. Egocentric Dataset Construction and Annotation focuses on building large-scale multimodal collections that pair video with hand pose, object interactions, and task annotations, exemplified by works like H2O[4] and First-person hand action benchmark[3]. Hand Pose Estimation and Tracking develops methods to accurately localize and track hand configurations in egocentric perspectives, addressing challenges like occlusion and rapid motion. Hand Activity and Interaction Recognition interprets what actions humans perform and how they manipulate objects, bridging perception and task understanding. Policy Learning for Robotic Manipulation translates these human demonstrations into executable robot behaviors, often leveraging imitation learning or reinforcement learning frameworks such as DexVIP[8] and Learning dexterity from human[7]. Finally, Egocentric Interaction Interfaces and Applications explores practical systems that use hand tracking for AR/VR interfaces and interactive tools. Recent efforts reveal contrasting emphases between scaling dataset diversity versus depth of annotation, and between end-to-end policy learning versus modular perception-action pipelines. Works like Scalable vision-language-action model pretraining[5] pursue broad multimodal pretraining across diverse tasks, while others focus on rich hand-object interaction details within narrower domains. EgoDex[0] sits within the Large-Scale Multimodal Egocentric Datasets cluster, neighboring OpenEgo[26] and sharing emphasis on comprehensive egocentric data collection with detailed hand pose tracking. Compared to OpenEgo[26], which prioritizes open-world diversity, EgoDex[0] appears more focused on dexterous manipulation scenarios with fine-grained hand annotations suitable for downstream robotic learning. This positioning reflects ongoing debates about whether richer task-specific datasets or broader general-purpose collections better support transferable manipulation policies.

Claimed Contributions

EgoDex dataset for dexterous manipulation

10 retrieved papers

The authors introduce EgoDex, a large-scale egocentric dataset containing 829 hours of video with native 3D hand and finger tracking across 194 tabletop manipulation tasks. The dataset is collected using Apple Vision Pro with on-device SLAM and calibrated cameras, providing precise skeletal annotations for dexterous manipulation behaviors.

10 retrieved papers

Benchmarks and metrics for hand trajectory prediction

10 retrieved papers

The authors propose two benchmark tasks (dexterous trajectory prediction and inverse dynamics) with a best-of-K evaluation metric that accounts for multimodality in human motion. They systematically evaluate state-of-the-art imitation learning policies to establish baselines for future research.

10 retrieved papers

Passively scalable data collection approach

Can Refute

10 retrieved papers

The authors propose a data collection paradigm that is passively scalable, unlike robot teleoperation which requires active effort. By using egocentric video with native 3D pose tracking, the approach enables large-scale data collection as a byproduct of human activity rather than deliberate demonstration.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[5] Scalable vision-language-action model pretraining for robotic manipulation with real-life human activity videos PDF

Li Qixiu, Deng Yu, Liang, Yaobo, Luo Lin, Zhou Lei, Zeng, Lingqi, Feng Zhiyuan, Liang Huizhi, Xu Sicheng, Zhang Yi-Zhong, Chen Xi, Chen, Hao, Sun Lily, Chen Dong, Yang, Jiaolong, Guo, Baining (2025)

[26] OpenEgo: A Large-Scale Multimodal Egocentric Dataset for Dexterous Manipulation PDF

Xiang Yu, Ahad Jawaid, Yu Xiang (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

EgoDex dataset for dexterous manipulation

[3] First-person hand action benchmark with rgb-d videos and 3d hand pose annotations PDF

Cannot Refute

[4] H2O: Two Hands Manipulating Objects for First Person Interaction Recognition PDF

Cannot Refute

[6] PEAR: phrase-based hand-object interaction anticipation PDF

Cannot Refute

[10] SFHand: A Streaming Framework for Language-guided 3D Hand Forecasting and Embodied Manipulation PDF

Cannot Refute

[26] OpenEgo: A Large-Scale Multimodal Egocentric Dataset for Dexterous Manipulation PDF

Cannot Refute

[45] Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives PDF

Cannot Refute

[46] EgoMimic: Scaling Imitation Learning via Egocentric Video PDF

Cannot Refute

[48] Introducing HOT3D: An Egocentric Dataset for 3D Hand and Object Tracking PDF

Cannot Refute

[53] Hoi4d: A 4d egocentric dataset for category-level human-object interaction PDF

Cannot Refute

[54] MEgoHand: Multimodal Egocentric Hand-Object Interaction Motion Generation PDF

Cannot Refute

Contribution

Benchmarks and metrics for hand trajectory prediction

[35] Prediction-based human-robot collaboration in assembly tasks using a learning from demonstration model PDF

Cannot Refute

[36] Hand-object interaction pretraining from videos PDF

Cannot Refute

[37] A structured prediction approach for robot imitation learning PDF

Cannot Refute

[38] GenH2R: learning generalizable human-to-robot handover via scalable simulation demonstration and imitation PDF

Cannot Refute

[39] Vision-Based Dexterous Motion Planning by Dynamic Movement Primitives with Human Hand Demonstration PDF

Cannot Refute

[40] Leveraging pretrained latent representations for few-shot imitation learning on a dexterous robotic hand PDF

Cannot Refute

[41] A User-Centered Shared Control Scheme with Learning from Demonstration for Robotic Surgery PDF

Cannot Refute

[42] Robotic manipulation via imitation learning: Taxonomy, evolution, benchmark, and challenges PDF

Cannot Refute

[43] Learning from demonstrations: An intuitive VR environment for imitation learning of construction robots PDF

Cannot Refute

[44] Robot Programming by Demonstration: Trajectory Learning Enhanced by sEMG-Based User Hand Stiffness Estimation PDF

Cannot Refute

Contribution

Passively scalable data collection approach

[46] EgoMimic: Scaling Imitation Learning via Egocentric Video PDF

Can Refute

[3] First-person hand action benchmark with rgb-d videos and 3d hand pose annotations PDF

Cannot Refute

[4] H2O: Two Hands Manipulating Objects for First Person Interaction Recognition PDF

Cannot Refute

[45] Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives PDF

Cannot Refute

[47] AssemblyHands: Towards Egocentric Activity Understanding via 3D Hand Pose Estimation PDF

Cannot Refute

[48] Introducing HOT3D: An Egocentric Dataset for 3D Hand and Object Tracking PDF

Cannot Refute

[49] The Invisible EgoHand: 3D Hand Forecasting through EgoBody Pose Estimation PDF

Cannot Refute

[50] Ego2HandsPose: A Dataset for Egocentric Two-hand 3D Global Pose Estimation PDF

Cannot Refute

[51] HOT3D: Hand and Object Tracking in 3D from Egocentric Multi-View Videos PDF

Cannot Refute

[52] Forecasting human-object interaction: joint prediction of motor attention and actions in first person video PDF

Cannot Refute

EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[5] Scalable vision-language-action model pretraining for robotic manipulation with real-life human activity videos PDF

[26] OpenEgo: A Large-Scale Multimodal Egocentric Dataset for Dexterous Manipulation PDF

Contribution Analysis

EgoDex dataset for dexterous manipulation

[3] First-person hand action benchmark with rgb-d videos and 3d hand pose annotations PDF

[4] H2O: Two Hands Manipulating Objects for First Person Interaction Recognition PDF

[6] PEAR: phrase-based hand-object interaction anticipation PDF

[10] SFHand: A Streaming Framework for Language-guided 3D Hand Forecasting and Embodied Manipulation PDF

[26] OpenEgo: A Large-Scale Multimodal Egocentric Dataset for Dexterous Manipulation PDF

[45] Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives PDF

[46] EgoMimic: Scaling Imitation Learning via Egocentric Video PDF

[48] Introducing HOT3D: An Egocentric Dataset for 3D Hand and Object Tracking PDF

[53] Hoi4d: A 4d egocentric dataset for category-level human-object interaction PDF

[54] MEgoHand: Multimodal Egocentric Hand-Object Interaction Motion Generation PDF

Benchmarks and metrics for hand trajectory prediction

[35] Prediction-based human-robot collaboration in assembly tasks using a learning from demonstration model PDF

[36] Hand-object interaction pretraining from videos PDF

[37] A structured prediction approach for robot imitation learning PDF

[38] GenH2R: learning generalizable human-to-robot handover via scalable simulation demonstration and imitation PDF

[39] Vision-Based Dexterous Motion Planning by Dynamic Movement Primitives with Human Hand Demonstration PDF

[40] Leveraging pretrained latent representations for few-shot imitation learning on a dexterous robotic hand PDF

[41] A User-Centered Shared Control Scheme with Learning from Demonstration for Robotic Surgery PDF

[42] Robotic manipulation via imitation learning: Taxonomy, evolution, benchmark, and challenges PDF

[43] Learning from demonstrations: An intuitive VR environment for imitation learning of construction robots PDF

[44] Robot Programming by Demonstration: Trajectory Learning Enhanced by sEMG-Based User Hand Stiffness Estimation PDF

Passively scalable data collection approach

[46] EgoMimic: Scaling Imitation Learning via Egocentric Video PDF

[3] First-person hand action benchmark with rgb-d videos and 3d hand pose annotations PDF

[4] H2O: Two Hands Manipulating Objects for First Person Interaction Recognition PDF

[45] Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives PDF

[47] AssemblyHands: Towards Egocentric Activity Understanding via 3D Hand Pose Estimation PDF

[48] Introducing HOT3D: An Egocentric Dataset for 3D Hand and Object Tracking PDF

[49] The Invisible EgoHand: 3D Hand Forecasting through EgoBody Pose Estimation PDF

[50] Ego2HandsPose: A Dataset for Egocentric Two-hand 3D Global Pose Estimation PDF

[51] HOT3D: Hand and Object Tracking in 3D from Egocentric Multi-View Videos PDF

[52] Forecasting human-object interaction: joint prediction of motor attention and actions in first person video PDF

Table of Contents