EgoTwin: Dreaming Body and View in First Person

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 7.0 Download Report PDF

Egocentric Vision

While exocentric video synthesis has achieved great progress, egocentric video generation remains largely underexplored, which requires modeling first-person view content along with camera motion patterns induced by the wearer's body movements. To bridge this gap, we introduce a novel task of joint egocentric video and human motion generation, characterized by two key challenges: 1) Viewpoint Alignment: the camera trajectory in the generated video must accurately align with the head trajectory derived from human motion; 2) Causal Interplay: the synthesized human motion must causally align with the observed visual dynamics across adjacent video frames. To address these challenges, we propose EgoTwin, a joint video-motion generation framework built on the diffusion transformer architecture. Specifically, EgoTwin introduces a head-centric motion representation that anchors the human motion to the head joint and incorporates a cybernetics-inspired interaction mechanism that explicitly captures the causal interplay between video and motion within attention operations. For comprehensive evaluation, we curate a large-scale real-world dataset of synchronized text-video-motion triplets and design novel metrics to assess video-motion consistency. Extensive experiments demonstrate the effectiveness of the EgoTwin framework. Qualitative results are available on our project page: https://egotwin.pages.dev/.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces a joint egocentric video and human motion generation task, positioning itself within the 'Joint Video-Motion Generation' leaf of the taxonomy. This leaf contains only two papers, including the original work, indicating a relatively sparse and emerging research direction. The task addresses simultaneous synthesis of first-person visual content and full-body kinematics with explicit viewpoint alignment and causal interplay constraints, distinguishing it from methods that generate motion or video independently.

The taxonomy reveals that neighboring leaves focus on complementary aspects: 'Egocentric Pose Estimation and Forecasting' predicts future poses from visual inputs but excludes simultaneous video synthesis; 'Environment-Aware Motion Generation' conditions motion on 3D scene context without generating corresponding imagery; and 'Egocentric Avatar Animation' produces human animations from top-down views but lacks the bidirectional video-motion coupling emphasized here. The paper's approach bridges perception-driven motion forecasting and video-centric generation, occupying a distinct niche at their intersection.

Among 21 candidates examined across three contributions, no clearly refutable prior work was identified. The joint task formulation examined 8 candidates with no refutations, suggesting novelty in the problem definition itself. The head-centric motion representation analyzed 3 candidates without overlap, indicating a potentially underexplored anchoring strategy. The EgoTwin framework reviewed 10 candidates with no refutations, though the limited search scope means comprehensive diffusion transformer architectures for this dual-modality setting may exist beyond the examined set.

Based on the top-21 semantic matches and taxonomy structure, the work appears to occupy a sparse research area with few direct competitors. The analysis covers contributions at the task, representation, and framework levels but does not exhaustively survey all diffusion-based video generation or motion synthesis methods. The sibling paper in the same leaf and nearby taxonomy branches provide context, yet the limited candidate pool suggests caution in claiming definitively novel territory without broader literature coverage.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: joint egocentric video and human motion generation. This field addresses the challenge of synthesizing both first-person visual content and corresponding full-body motion in a unified manner, bridging perception and embodied simulation. The taxonomy reveals several complementary directions: Egocentric Motion Synthesis and Forecasting focuses on predicting or generating body movements from egocentric cues, often leveraging head-mounted sensors or sparse observations; Egocentric Human-Object Interaction Modeling emphasizes the interplay between hands, objects, and scene context during manipulation tasks; Egocentric Perception and Representation Learning develops encoders and features tailored to the first-person viewpoint; Egocentric Benchmarks and Data Capture provides datasets and capture protocols (e.g., Interaction Replica[7], Ego Humans[4]) that ground these methods in real recordings; Video-Centric Generation and Simulation explores broader video synthesis techniques applicable to egocentric settings; and Applications and Unified Frameworks integrate these components for end-to-end systems in mixed reality, robotics, and content creation. Within Motion Synthesis and Forecasting, a particularly active line of work targets joint video-motion generation, where models must produce coherent egocentric imagery alongside plausible body kinematics. EgoTwin[0] exemplifies this direction by jointly modeling visual and motion streams, ensuring temporal consistency between what the camera sees and how the body moves. Its closest neighbor, Egogen[1], similarly tackles egocentric video synthesis but may place different emphasis on conditioning modalities or diffusion architectures. Nearby efforts such as Rehearsal Reality[5] and Humanoid VLA[2] explore related themes—immersive simulation and embodied control—highlighting trade-offs between realism, interactivity, and computational cost. A key open question across these works is how to balance high-fidelity rendering with real-time motion forecasting, especially when integrating object interactions (e.g., Egocentric Human Object[3]) or adapting to diverse user behaviors captured in large-scale benchmarks.

Claimed Contributions

Joint egocentric video and human motion generation task

8 retrieved papers

The authors define a new task that requires generating synchronized egocentric video and human motion sequences from text descriptions, an initial pose, and an initial observation. This task explicitly models the tight coupling between camera motion and body movement in first-person perspectives, addressing viewpoint alignment and causal interplay challenges.

8 retrieved papers

Head-centric motion representation

3 retrieved papers

The authors propose a novel motion representation that explicitly exposes head joint pose and velocity, replacing the conventional root-centric representation. This reformulation facilitates accurate alignment between egocentric camera trajectories and head motion, which is critical for viewpoint consistency.

3 retrieved papers

EgoTwin diffusion transformer framework with interaction mechanism

10 retrieved papers

The authors develop a triple-branch diffusion transformer architecture with modality-specific branches for text, video, and motion. The framework employs asynchronous diffusion and a structured attention mask inspired by cybernetic observation-action loops to model the causal interplay between visual observations and human actions.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] Egogen: An egocentric synthetic data generator PDF

Gen Li, Kaifeng Zhao, Siwei Zhang, Xiaozhong Lyu, Mihai Dusmanu, Yan Zhang, Marc Pollefeys, Siyu Tang (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Joint egocentric video and human motion generation task

[31] Exocentric-to-egocentric video generation PDF

Cannot Refute

[32] Open-set synthesis for free-viewpoint human body reenactment of novel poses PDF

Cannot Refute

[33] Dancing Avatar: Pose and Text-Guided Human Motion Videos Synthesis with Image Diffusion Model PDF

Cannot Refute

[34] Generating Human Motion Videos using a Cascaded Text-to-Video Framework PDF

Cannot Refute

[35] Enhancing Human-Computer Interaction Through Decoupling Motion and Camera Control in Human-Centric Video Generation PDF

Cannot Refute

[36] Human Motion Aware Text-to-Video Generation with Explicit Camera Control PDF

Cannot Refute

[37] Text-Based Video Generation With Human Motion and Controllable Camera PDF

Cannot Refute

[38] Decomposing Text into Motion and Appearance for Training-Free Human Video Generation PDF

Cannot Refute

Contribution

Head-centric motion representation

[49] Mocap Everyone Everywhere: Lightweight Motion Capture with Smartwatches and a Head-Mounted Camera PDF

Cannot Refute

[50] Large eyeâhead gaze shifts measured with a wearable eye tracker and an industrial camera PDF

Cannot Refute

[51] ECHO: Ego-Centric modeling of Human-Object interactions PDF

Cannot Refute

Contribution

EgoTwin diffusion transformer framework with interaction mechanism

[39] From slow bidirectional to fast autoregressive video diffusion models PDF

Cannot Refute

[40] Improved video vae for latent video diffusion model PDF

Cannot Refute

[41] OmniTalker: One-shot Real-time Text-Driven Talking Audio-Video Generation With Multimodal Style Mimicking PDF

Cannot Refute

[42] LMP: Leveraging Motion Prior in Zero-Shot Video Generation with Diffusion Transformer PDF

Cannot Refute

[43] Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model PDF

Cannot Refute

[44] Edit-Your-Motion: Space-Time Diffusion Decoupling Learning for Video Motion Editing PDF

Cannot Refute

[45] Tora2: Motion and Appearance Customized Diffusion Transformer for Multi-Entity Video Generation PDF

Cannot Refute

[46] Long-Term TalkingFace Generation via Motion-Prior Conditional Diffusion Model PDF

Cannot Refute

[47] The Quest for Generalizable Motion Generation: Data, Model, and Evaluation PDF

Cannot Refute

[48] Video2Act: A Dual-System Video Diffusion Policy with Robotic Spatio-Motional Modeling PDF

Cannot Refute

EgoTwin: Dreaming Body and View in First Person

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] Egogen: An egocentric synthetic data generator PDF

Contribution Analysis

Joint egocentric video and human motion generation task

[31] Exocentric-to-egocentric video generation PDF

[32] Open-set synthesis for free-viewpoint human body reenactment of novel poses PDF

[33] Dancing Avatar: Pose and Text-Guided Human Motion Videos Synthesis with Image Diffusion Model PDF

[34] Generating Human Motion Videos using a Cascaded Text-to-Video Framework PDF

[35] Enhancing Human-Computer Interaction Through Decoupling Motion and Camera Control in Human-Centric Video Generation PDF

[36] Human Motion Aware Text-to-Video Generation with Explicit Camera Control PDF

[37] Text-Based Video Generation With Human Motion and Controllable Camera PDF

[38] Decomposing Text into Motion and Appearance for Training-Free Human Video Generation PDF

Head-centric motion representation

[49] Mocap Everyone Everywhere: Lightweight Motion Capture with Smartwatches and a Head-Mounted Camera PDF

[50] Large eyeâhead gaze shifts measured with a wearable eye tracker and an industrial camera PDF

[51] ECHO: Ego-Centric modeling of Human-Object interactions PDF

EgoTwin diffusion transformer framework with interaction mechanism

[39] From slow bidirectional to fast autoregressive video diffusion models PDF

[40] Improved video vae for latent video diffusion model PDF

[41] OmniTalker: One-shot Real-time Text-Driven Talking Audio-Video Generation With Multimodal Style Mimicking PDF

[42] LMP: Leveraging Motion Prior in Zero-Shot Video Generation with Diffusion Transformer PDF

[43] Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model PDF

[44] Edit-Your-Motion: Space-Time Diffusion Decoupling Learning for Video Motion Editing PDF

[45] Tora2: Motion and Appearance Customized Diffusion Transformer for Multi-Entity Video Generation PDF

[46] Long-Term TalkingFace Generation via Motion-Prior Conditional Diffusion Model PDF

[47] The Quest for Generalizable Motion Generation: Data, Model, and Evaluation PDF

[48] Video2Act: A Dual-System Video Diffusion Policy with Robotic Spatio-Motional Modeling PDF

Table of Contents

[50] Large eyeâhead gaze shifts measured with a wearable eye tracker and an industrial camera PDF