Animating the Uncaptured: Humanoid Mesh Animation with Video Diffusion Models

ICLR 2026 Conference SubmissionAnonymous Authors
Motion generationMotion Tracking & Transfer
Abstract:

Animation of humanoid characters is essential in various graphics applications, but require significant time and cost to create realistic animations. We propose an approach to synthesize 4D animated sequences of input static 3D humanoid meshes, leveraging strong generalized motion priors from generative video models -- as such video models contain powerful motion information covering a wide variety of human motions. From an input static 3D humanoid mesh and a text prompt describing the desired animation, we synthesize a corresponding video conditioned on a rendered image of the 3D mesh. We then employ an underlying SMPL representation to animate the corresponding 3D mesh according to the video-generated motion, based on our motion optimization. This enables a cost-effective and accessible solution to enable the synthesis of diverse and realistic 4D animations

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
22
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: Humanoid mesh animation from text prompts using video diffusion models. The field encompasses diverse approaches to generating and controlling human motion, organized into several major branches. Motion Representation and Generation Frameworks explore foundational architectures for synthesizing movement, including mesh-based methods that directly deform character geometry and skeletal or latent representations that encode motion abstractly. Motion Control and Conditioning focuses on how external signals—text descriptions, audio cues, or spatial constraints—guide the generation process, enabling fine-grained user control. Specialized Motion Synthesis Tasks address domain-specific challenges such as hand gestures, object interactions, or dance choreography, while Human Image and Video Animation targets the problem of animating static images or driving video sequences with new motion. Meanwhile, 3D Avatar and Character Creation deals with building animatable digital humans from scratch, and Motion Priors and Reconstruction leverages learned distributions or inverse methods to recover plausible motion from partial observations. Data, Evaluation, and Generalization examines benchmarks, metrics, and cross-domain robustness, and Application-Specific Systems integrates these techniques into end-user tools for gaming, virtual reality, or content creation. Within Motion Representation and Generation Frameworks, a particularly active line of work centers on mesh-based motion synthesis, where methods like Animating the Uncaptured[0] and MotionDreamer[12] leverage video diffusion priors to deform character meshes in a temporally coherent manner. These approaches contrast with skeletal or latent-space methods such as MotionDiffuse[2] and Text-driven human motion generation[3], which operate on abstract pose representations before retargeting to geometry. Animating the Uncaptured[0] sits within the Video-Guided Mesh Deformation cluster, emphasizing direct mesh manipulation driven by learned video features, closely related to Towards motion from video[1] and Motiondreamer[32], which similarly exploit video signals for geometry-level animation. This focus on mesh-level fidelity distinguishes it from works like MagicAnimate[5] or UniAnimate[15], which prioritize image-space rendering and may sacrifice geometric precision for visual plausibility. Key open questions include balancing mesh detail with computational efficiency and generalizing across diverse body shapes and motion styles without extensive per-character tuning.

Claimed Contributions

Text-to-motion generation approach leveraging video diffusion models

The authors propose a novel approach to synthesize 4D animated sequences of static 3D humanoid meshes by leveraging motion priors from generative video models. Given a static mesh and text prompt, they generate a video using a T2V diffusion model and transfer the motion to the mesh.

10 retrieved papers
Can Refute
Robust motion tracking pipeline combining multiple cues

The authors develop a tracking pipeline that extracts and combines 2D body landmarks, silhouettes, and dense DINOv2 features from generated video frames to accurately reconstruct and transfer motion to the input mesh using SMPL as a deformation proxy.

10 retrieved papers
SMPL-based deformation proxy with barycentric reparameterization

The authors introduce a method to register the SMPL body model to the input mesh and reparameterize mesh vertices using barycentric coordinates relative to SMPL faces, enabling motion transfer through optimization of SMPL parameters while maintaining mesh structure.

2 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Text-to-motion generation approach leveraging video diffusion models

The authors propose a novel approach to synthesize 4D animated sequences of static 3D humanoid meshes by leveraging motion priors from generative video models. Given a static mesh and text prompt, they generate a video using a T2V diffusion model and transfer the motion to the mesh.

Contribution

Robust motion tracking pipeline combining multiple cues

The authors develop a tracking pipeline that extracts and combines 2D body landmarks, silhouettes, and dense DINOv2 features from generated video frames to accurately reconstruct and transfer motion to the input mesh using SMPL as a deformation proxy.

Contribution

SMPL-based deformation proxy with barycentric reparameterization

The authors introduce a method to register the SMPL body model to the input mesh and reparameterize mesh vertices using barycentric coordinates relative to SMPL faces, enabling motion transfer through optimization of SMPL parameters while maintaining mesh structure.