Animating the Uncaptured: Humanoid Mesh Animation with Video Diffusion Models

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Motion generationMotion Tracking & Transfer

Animation of humanoid characters is essential in various graphics applications, but require significant time and cost to create realistic animations. We propose an approach to synthesize 4D animated sequences of input static 3D humanoid meshes, leveraging strong generalized motion priors from generative video models -- as such video models contain powerful motion information covering a wide variety of human motions. From an input static 3D humanoid mesh and a text prompt describing the desired animation, we synthesize a corresponding video conditioned on a rendered image of the 3D mesh. We then employ an underlying SMPL representation to animate the corresponding 3D mesh according to the video-generated motion, based on our motion optimization. This enables a cost-effective and accessible solution to enable the synthesis of diverse and realistic 4D animations

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Humanoid mesh animation from text prompts using video diffusion models. The field encompasses diverse approaches to generating and controlling human motion, organized into several major branches. Motion Representation and Generation Frameworks explore foundational architectures for synthesizing movement, including mesh-based methods that directly deform character geometry and skeletal or latent representations that encode motion abstractly. Motion Control and Conditioning focuses on how external signals—text descriptions, audio cues, or spatial constraints—guide the generation process, enabling fine-grained user control. Specialized Motion Synthesis Tasks address domain-specific challenges such as hand gestures, object interactions, or dance choreography, while Human Image and Video Animation targets the problem of animating static images or driving video sequences with new motion. Meanwhile, 3D Avatar and Character Creation deals with building animatable digital humans from scratch, and Motion Priors and Reconstruction leverages learned distributions or inverse methods to recover plausible motion from partial observations. Data, Evaluation, and Generalization examines benchmarks, metrics, and cross-domain robustness, and Application-Specific Systems integrates these techniques into end-user tools for gaming, virtual reality, or content creation. Within Motion Representation and Generation Frameworks, a particularly active line of work centers on mesh-based motion synthesis, where methods like Animating the Uncaptured[0] and MotionDreamer[12] leverage video diffusion priors to deform character meshes in a temporally coherent manner. These approaches contrast with skeletal or latent-space methods such as MotionDiffuse[2] and Text-driven human motion generation[3], which operate on abstract pose representations before retargeting to geometry. Animating the Uncaptured[0] sits within the Video-Guided Mesh Deformation cluster, emphasizing direct mesh manipulation driven by learned video features, closely related to Towards motion from video[1] and Motiondreamer[32], which similarly exploit video signals for geometry-level animation. This focus on mesh-level fidelity distinguishes it from works like MagicAnimate[5] or UniAnimate[15], which prioritize image-space rendering and may sacrifice geometric precision for visual plausibility. Key open questions include balancing mesh detail with computational efficiency and generalizing across diverse body shapes and motion styles without extensive per-character tuning.

Claimed Contributions

Text-to-motion generation approach leveraging video diffusion models

Can Refute

10 retrieved papers

The authors propose a novel approach to synthesize 4D animated sequences of static 3D humanoid meshes by leveraging motion priors from generative video models. Given a static mesh and text prompt, they generate a video using a T2V diffusion model and transfer the motion to the mesh.

10 retrieved papers

Can Refute

Robust motion tracking pipeline combining multiple cues

10 retrieved papers

The authors develop a tracking pipeline that extracts and combines 2D body landmarks, silhouettes, and dense DINOv2 features from generated video frames to accurately reconstruct and transfer motion to the input mesh using SMPL as a deformation proxy.

10 retrieved papers

SMPL-based deformation proxy with barycentric reparameterization

2 retrieved papers

The authors introduce a method to register the SMPL body model to the input mesh and reparameterize mesh vertices using barycentric coordinates relative to SMPL faces, enabling motion transfer through optimization of SMPL parameters while maintaining mesh structure.

2 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] Towards motion from video diffusion models PDF

Paul Janson, Tiberiu Popa, Eugene Belilovsky (2024) • ECCV Workshops

[12] MotionDreamer: Zero-Shot 3D Mesh Animation from Video Diffusion Models PDF

Uzolas, Lukas, Eisemann, Elmar, Lukas Uzolas, Kellnhofer, Petr, E. Eisemann, Petr Kellnhofer (2024)

[32] Motiondreamer: Exploring semantic video diffusion features for zero-shot 3d mesh animation PDF

Lukas Uzolas, Elmar Eisemann, Petr Kellnhofer, E. Eisemann (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Text-to-motion generation approach leveraging video diffusion models

[1] Towards motion from video diffusion models PDF

Can Refute

[6] Animax: Animating the inanimate in 3d with joint video-pose diffusion models PDF

Can Refute

[12] MotionDreamer: Zero-Shot 3D Mesh Animation from Video Diffusion Models PDF

Can Refute

[11] Animate3d: Animating any 3d model with multi-view video diffusion PDF

Cannot Refute

[39] Ct4d: Consistent text-to-4d generation with animatable meshes PDF

Cannot Refute

[51] FusionDeformer: text-guided mesh deformation using diffusion models PDF

Cannot Refute

[52] Animateme: 4d facial expressions via diffusion models PDF

Cannot Refute

[53] Articulated Kinematics Distillation from Video Diffusion Models PDF

Cannot Refute

[54] ShapeâConditioned Human Motion Diffusion Model with Mesh Representation PDF

Cannot Refute

[55] Tada! text to animatable digital avatars PDF

Cannot Refute

Contribution

Robust motion tracking pipeline combining multiple cues

[56] Neural localizer fields for continuous 3d human pose and shape estimation PDF

Cannot Refute

[57] Beyond Sparse Keypoints: Dense Pose Modeling for Robust Gait Recognition PDF

Cannot Refute

[58] SKEP-Net: Depth-based Human Pose Monitoring and Exercise Recognition using GMM-Segmentation PDF

Cannot Refute

[59] Aerial insights: deep learning-based human action recognition in drone imagery PDF

Cannot Refute

[60] Recovering 3d human body configurations using shape contexts PDF

Cannot Refute

[61] Depth map-based human activity tracking and recognition using body joints features and self-organized map PDF

Cannot Refute

[62] BodySLAM: joint camera localisation, mapping, and human motion tracking PDF

Cannot Refute

[63] MiShape PDF

Cannot Refute

[64] Nonrigid motion analysis: Articulated and elastic motion PDF

Cannot Refute

[65] Precision tracking via joint detailed shape estimation of arbitrary extended objects PDF

Cannot Refute

Contribution

SMPL-based deformation proxy with barycentric reparameterization

[66] SmplSkin: Physics-based Simulation of Skin Mechanics on Personalized Avatars PDF

Cannot Refute

[67] An Algorithmic Quasi-Realtime Model for Cloth Projection onto a Parametric Body in Geodesic Coordinate Space PDF

Cannot Refute

Animating the Uncaptured: Humanoid Mesh Animation with Video Diffusion Models

Overview

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] Towards motion from video diffusion models PDF

[12] MotionDreamer: Zero-Shot 3D Mesh Animation from Video Diffusion Models PDF

[32] Motiondreamer: Exploring semantic video diffusion features for zero-shot 3d mesh animation PDF

Contribution Analysis

Text-to-motion generation approach leveraging video diffusion models

[1] Towards motion from video diffusion models PDF

[6] Animax: Animating the inanimate in 3d with joint video-pose diffusion models PDF

[12] MotionDreamer: Zero-Shot 3D Mesh Animation from Video Diffusion Models PDF

[11] Animate3d: Animating any 3d model with multi-view video diffusion PDF

[39] Ct4d: Consistent text-to-4d generation with animatable meshes PDF

[51] FusionDeformer: text-guided mesh deformation using diffusion models PDF

[52] Animateme: 4d facial expressions via diffusion models PDF

[53] Articulated Kinematics Distillation from Video Diffusion Models PDF

[54] ShapeâConditioned Human Motion Diffusion Model with Mesh Representation PDF

[55] Tada! text to animatable digital avatars PDF

Robust motion tracking pipeline combining multiple cues

[56] Neural localizer fields for continuous 3d human pose and shape estimation PDF

[57] Beyond Sparse Keypoints: Dense Pose Modeling for Robust Gait Recognition PDF

[58] SKEP-Net: Depth-based Human Pose Monitoring and Exercise Recognition using GMM-Segmentation PDF

[59] Aerial insights: deep learning-based human action recognition in drone imagery PDF

[60] Recovering 3d human body configurations using shape contexts PDF

[61] Depth map-based human activity tracking and recognition using body joints features and self-organized map PDF

[62] BodySLAM: joint camera localisation, mapping, and human motion tracking PDF

[63] MiShape PDF

[64] Nonrigid motion analysis: Articulated and elastic motion PDF

[65] Precision tracking via joint detailed shape estimation of arbitrary extended objects PDF

SMPL-based deformation proxy with barycentric reparameterization

[66] SmplSkin: Physics-based Simulation of Skin Mechanics on Personalized Avatars PDF

[67] An Algorithmic Quasi-Realtime Model for Cloth Projection onto a Parametric Body in Geodesic Coordinate Space PDF

Table of Contents

[54] ShapeâConditioned Human Motion Diffusion Model with Mesh Representation PDF