EgoTwin: Dreaming Body and View in First Person

ICLR 2026 Conference SubmissionAnonymous Authors
Egocentric Vision
Abstract:

While exocentric video synthesis has achieved great progress, egocentric video generation remains largely underexplored, which requires modeling first-person view content along with camera motion patterns induced by the wearer's body movements. To bridge this gap, we introduce a novel task of joint egocentric video and human motion generation, characterized by two key challenges: 1) Viewpoint Alignment: the camera trajectory in the generated video must accurately align with the head trajectory derived from human motion; 2) Causal Interplay: the synthesized human motion must causally align with the observed visual dynamics across adjacent video frames. To address these challenges, we propose EgoTwin, a joint video-motion generation framework built on the diffusion transformer architecture. Specifically, EgoTwin introduces a head-centric motion representation that anchors the human motion to the head joint and incorporates a cybernetics-inspired interaction mechanism that explicitly captures the causal interplay between video and motion within attention operations. For comprehensive evaluation, we curate a large-scale real-world dataset of synchronized text-video-motion triplets and design novel metrics to assess video-motion consistency. Extensive experiments demonstrate the effectiveness of the EgoTwin framework. Qualitative results are available on our project page: https://egotwin.pages.dev/.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces a joint egocentric video and human motion generation task, positioning itself within the 'Joint Video-Motion Generation' leaf of the taxonomy. This leaf contains only two papers, including the original work, indicating a relatively sparse and emerging research direction. The task addresses simultaneous synthesis of first-person visual content and full-body kinematics with explicit viewpoint alignment and causal interplay constraints, distinguishing it from methods that generate motion or video independently.

The taxonomy reveals that neighboring leaves focus on complementary aspects: 'Egocentric Pose Estimation and Forecasting' predicts future poses from visual inputs but excludes simultaneous video synthesis; 'Environment-Aware Motion Generation' conditions motion on 3D scene context without generating corresponding imagery; and 'Egocentric Avatar Animation' produces human animations from top-down views but lacks the bidirectional video-motion coupling emphasized here. The paper's approach bridges perception-driven motion forecasting and video-centric generation, occupying a distinct niche at their intersection.

Among 21 candidates examined across three contributions, no clearly refutable prior work was identified. The joint task formulation examined 8 candidates with no refutations, suggesting novelty in the problem definition itself. The head-centric motion representation analyzed 3 candidates without overlap, indicating a potentially underexplored anchoring strategy. The EgoTwin framework reviewed 10 candidates with no refutations, though the limited search scope means comprehensive diffusion transformer architectures for this dual-modality setting may exist beyond the examined set.

Based on the top-21 semantic matches and taxonomy structure, the work appears to occupy a sparse research area with few direct competitors. The analysis covers contributions at the task, representation, and framework levels but does not exhaustively survey all diffusion-based video generation or motion synthesis methods. The sibling paper in the same leaf and nearby taxonomy branches provide context, yet the limited candidate pool suggests caution in claiming definitively novel territory without broader literature coverage.

Taxonomy

Core-task Taxonomy Papers
30
3
Claimed Contributions
21
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: joint egocentric video and human motion generation. This field addresses the challenge of synthesizing both first-person visual content and corresponding full-body motion in a unified manner, bridging perception and embodied simulation. The taxonomy reveals several complementary directions: Egocentric Motion Synthesis and Forecasting focuses on predicting or generating body movements from egocentric cues, often leveraging head-mounted sensors or sparse observations; Egocentric Human-Object Interaction Modeling emphasizes the interplay between hands, objects, and scene context during manipulation tasks; Egocentric Perception and Representation Learning develops encoders and features tailored to the first-person viewpoint; Egocentric Benchmarks and Data Capture provides datasets and capture protocols (e.g., Interaction Replica[7], Ego Humans[4]) that ground these methods in real recordings; Video-Centric Generation and Simulation explores broader video synthesis techniques applicable to egocentric settings; and Applications and Unified Frameworks integrate these components for end-to-end systems in mixed reality, robotics, and content creation. Within Motion Synthesis and Forecasting, a particularly active line of work targets joint video-motion generation, where models must produce coherent egocentric imagery alongside plausible body kinematics. EgoTwin[0] exemplifies this direction by jointly modeling visual and motion streams, ensuring temporal consistency between what the camera sees and how the body moves. Its closest neighbor, Egogen[1], similarly tackles egocentric video synthesis but may place different emphasis on conditioning modalities or diffusion architectures. Nearby efforts such as Rehearsal Reality[5] and Humanoid VLA[2] explore related themes—immersive simulation and embodied control—highlighting trade-offs between realism, interactivity, and computational cost. A key open question across these works is how to balance high-fidelity rendering with real-time motion forecasting, especially when integrating object interactions (e.g., Egocentric Human Object[3]) or adapting to diverse user behaviors captured in large-scale benchmarks.

Claimed Contributions

Joint egocentric video and human motion generation task

The authors define a new task that requires generating synchronized egocentric video and human motion sequences from text descriptions, an initial pose, and an initial observation. This task explicitly models the tight coupling between camera motion and body movement in first-person perspectives, addressing viewpoint alignment and causal interplay challenges.

8 retrieved papers
Head-centric motion representation

The authors propose a novel motion representation that explicitly exposes head joint pose and velocity, replacing the conventional root-centric representation. This reformulation facilitates accurate alignment between egocentric camera trajectories and head motion, which is critical for viewpoint consistency.

3 retrieved papers
EgoTwin diffusion transformer framework with interaction mechanism

The authors develop a triple-branch diffusion transformer architecture with modality-specific branches for text, video, and motion. The framework employs asynchronous diffusion and a structured attention mask inspired by cybernetic observation-action loops to model the causal interplay between visual observations and human actions.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Joint egocentric video and human motion generation task

The authors define a new task that requires generating synchronized egocentric video and human motion sequences from text descriptions, an initial pose, and an initial observation. This task explicitly models the tight coupling between camera motion and body movement in first-person perspectives, addressing viewpoint alignment and causal interplay challenges.

Contribution

Head-centric motion representation

The authors propose a novel motion representation that explicitly exposes head joint pose and velocity, replacing the conventional root-centric representation. This reformulation facilitates accurate alignment between egocentric camera trajectories and head motion, which is critical for viewpoint consistency.

Contribution

EgoTwin diffusion transformer framework with interaction mechanism

The authors develop a triple-branch diffusion transformer architecture with modality-specific branches for text, video, and motion. The framework employs asynchronous diffusion and a structured attention mask inspired by cybernetic observation-action loops to model the causal interplay between visual observations and human actions.