MTVCraft: Tokenizing 4D Motion for Arbitrary Character Animation

ICLR 2026 Conference SubmissionAnonymous Authors
Character AnimationMotion TokenizationVideo Generation
Abstract:

Character image animation has rapidly advanced with the rise of digital humans. However, existing methods rely largely on 2D-rendered pose images for motion guidance, which limits generalization and discards essential 4D information for open-world animation. To address this, we propose MTVCraft (Motion Tokenization Video Crafter), the first framework that directly models raw 3D motion sequences (i.e., 4D motion) for character image animation. Specifically, we introduce 4DMoT (4D motion tokenizer) to quantize 3D motion sequences into 4D motion tokens. Compared to 2D-rendered pose images, 4D motion tokens offer more robust spatial-temporal cues and avoid strict pixel-level alignment between pose images and the character, enabling more flexible and disentangled control. Next, we introduce MV-DiT (Motion-aware Video DiT). By designing unique motion attention with 4D positional encodings, MV-DiT can effectively leverage motion tokens as 4D compact yet expressive context for character image animation in the complex 4D world. We implement MTVCraft on both CogVideoX-5B (small scale) and Wan-2.1-14B (large scale), demonstrating that our framework is easily scalable and can be applied to models of varying sizes. Experiments on the TikTok and Fashion benchmarks demonstrate our state-of-the-art performance. Moreover, powered by robust motion tokens, MTVCraft showcases unparalleled zero-shot generalization. It can animate arbitrary characters in both single and multiple settings, in full-body and half-body forms, and even non-human objects across diverse styles and scenarios. Hence, it marks a significant step forward in this field and opens a new direction for pose-guided video generation.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: Character image animation using 3D motion sequences. The field organizes itself around several complementary dimensions. Motion Representation and Guidance explores how to encode and condition animation on 3D pose or skeletal data, including discrete tokenization schemes that compress motion into learnable vocabularies. Generative Models and Synthesis Frameworks focuses on diffusion-based and neural rendering pipelines that translate motion signals into realistic character videos. Motion Capture and Data Acquisition addresses the upstream problem of obtaining high-quality 3D motion, whether through marker-based systems or markerless vision approaches. Animation Editing and Control provides tools for temporal refinement and user-driven adjustments, while Data-Driven Motion Synthesis and Retrieval leverages large motion databases for retrieval and blending. Application Domains span talking heads, full-body dance, and game characters, and Technical Foundations covers enabling components like inverse kinematics and skeletal rigging. Surveys and Comprehensive Reviews tie these threads together, offering periodic snapshots of progress across methods such as Champ[1], Animate-X[2], and CharacterShot[4]. Within Motion Representation, a particularly active line of work examines motion tokenization and discrete representation, seeking compact codes that generative models can consume efficiently. MTVCraft[0] sits squarely in this branch, proposing a tokenization strategy that bridges 3D motion sequences and video synthesis. Its close neighbor MTVCrafter[44] shares a similar emphasis on discrete motion vocabularies, suggesting a small cluster dedicated to learned motion codebooks. By contrast, works like Champ[1] and Animate-X[2] often rely on continuous pose embeddings or direct skeletal conditioning, trading off the compactness of tokens for more direct geometric control. Meanwhile, methods such as VidAnimator[6] and Make-an-animation[7] explore end-to-end diffusion pipelines that may bypass explicit tokenization altogether. The central tension revolves around whether discrete motion codes offer better generalization and editability, or whether continuous representations preserve finer kinematic detail. MTVCraft[0] thus contributes to an emerging conversation on how best to distill motion into a form that balances expressiveness, efficiency, and compatibility with modern generative architectures.

Claimed Contributions

4D Motion Tokens for character animation

The authors introduce a new motion representation called 4D Motion Tokens that discretizes spatial and temporal motion information. This representation aims to enable more effective character animation by capturing motion dynamics in a tokenized format.

10 retrieved papers
MTVCraft framework for arbitrary character animation

The authors develop MTVCraft, a unified framework that uses the proposed 4D Motion Tokens to perform character animation for arbitrary characters. The framework is designed to handle diverse animation scenarios using the tokenized motion representation.

10 retrieved papers
Motion tokenization approach combining spatial and temporal dimensions

The authors present a core methodological insight of tokenizing motion across both spatial and temporal dimensions simultaneously, rather than treating them separately. This approach forms the foundation of their 4D Motion Tokens representation.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

4D Motion Tokens for character animation

The authors introduce a new motion representation called 4D Motion Tokens that discretizes spatial and temporal motion information. This representation aims to enable more effective character animation by capturing motion dynamics in a tokenized format.

Contribution

MTVCraft framework for arbitrary character animation

The authors develop MTVCraft, a unified framework that uses the proposed 4D Motion Tokens to perform character animation for arbitrary characters. The framework is designed to handle diverse animation scenarios using the tokenized motion representation.

Contribution

Motion tokenization approach combining spatial and temporal dimensions

The authors present a core methodological insight of tokenizing motion across both spatial and temporal dimensions simultaneously, rather than treating them separately. This approach forms the foundation of their 4D Motion Tokens representation.