MotionWeaver: Holistic 4D-Anchored Framework for Multi-Humanoid Image Animation

ICLR 2026 Conference SubmissionAnonymous Authors
Character AnimationDiffusion ModelVideo Generation
Abstract:

Character image animation, which synthesizes videos of reference characters driven by pose sequences, has advanced rapidly but remains largely limited to single-human settings. Existing methods struggle to generalize to multi-humanoid scenarios, which involve diverse humanoid forms, complex interactions, and frequent occlusions. We address this gap with two key innovations. First, we introduce unified motion representations that extract identity-agnostic 4D motions and explicitly bind them to corresponding characters, enabling generalization across diverse humanoid forms and seamless extension to multi-humanoid scenarios. Second, we propose a holistic 4D-anchored paradigm that constructs a shared 4D space to fuse motion representations with video latents, and further reinforces this process with hierarchical 4D-level supervision to better handle interactions and occlusions. We instantiate these ideas in MotionWeaver, an end-to-end framework for multi-humanoid image animation. To support this setting, we curate a 46-hour dataset of multi-human videos with rich interactions, and construct a 300-video benchmark featuring paired humanoid characters. Quantitative and qualitative experiments demonstrate that MotionWeaver not only achieves state-of-the-art results on our benchmark but also generalizes effectively across diverse humanoid forms, complex interactions, and challenging multi-humanoid scenarios.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces MotionWeaver, a framework for multi-humanoid image animation driven by pose sequences, emphasizing unified 4D motion representations and hierarchical supervision. It resides in the 'Unified and 4D Motion Representations' leaf, which contains only two papers total (including one sibling, MTVCrafter). This leaf sits within the broader 'Motion Representation and Encoding' branch, indicating a relatively sparse research direction focused on spatiotemporal motion encoding. The small sibling count suggests this specific approach—unifying motion across diverse humanoid forms in 4D space—is not yet densely explored.

The taxonomy reveals neighboring leaves addressing related but distinct challenges: 'Enhanced Motion Representation for Character Animation' tackles generalization across character types, while 'Pose-Based Animation Control' branches explore alignment-free and zero-shot methods. The 'Motion Synthesis and Generation' subtree includes video-pose diffusion and skeletal synthesis, which share temporal modeling concerns but differ in their generative focus. MotionWeaver's emphasis on multi-humanoid scenarios and explicit identity-agnostic binding distinguishes it from these adjacent directions, which largely target single-character or non-unified representations.

Among the single contribution analyzed ('MotionWeaver framework for multi-humanoid image animation'), one candidate paper was examined, and none were found to clearly refute the contribution. This limited search scope—examining only one candidate from semantic retrieval—means the analysis cannot comprehensively assess prior work overlap. The absence of refutable candidates in this small sample suggests either genuine novelty or insufficient coverage of the broader literature. The contribution's focus on multi-humanoid settings and 4D-anchored fusion may differentiate it from existing single-character or 2D/3D-only methods.

Based on the limited search (one candidate examined), the framework appears to occupy a relatively underexplored niche within multi-humanoid animation. However, the small sample size precludes strong conclusions about field-wide novelty. The taxonomy structure indicates sparse activity in unified 4D representations, but a more exhaustive search across the 'Pose-Based Animation Control' and 'Motion Synthesis' branches would be necessary to fully contextualize the work's originality.

Taxonomy

Core-task Taxonomy Papers
34
1
Claimed Contributions
1
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Multi-humanoid image animation driven by pose sequences. This field addresses the challenge of animating multiple human figures in images by leveraging pose information as control signals. The taxonomy reveals a rich landscape organized around several key themes. Motion Representation and Encoding explores how pose and motion data are structured, including unified and 4D representations that capture spatial and temporal dynamics. Pose-Based Animation Control focuses on methods that directly use skeletal or keypoint-based guidance to drive character movement, while Motion Synthesis and Generation encompasses generative approaches for creating realistic animations. Motion Retargeting and Adaptation deals with transferring motion across different character morphologies, and Multimodal and Expressive Animation Control integrates diverse input modalities—such as text, audio, or sketches—to enrich expressiveness. Additional branches cover foundational aspects like Pose Estimation and Tracking, classical Skeletal Animation Techniques, and practical concerns such as Animation Standards and Data Compression. Recent work has seen particularly active development in unified motion representations and multimodal control strategies. For instance, MotionWeaver[0] and its close neighbor MTVCrafter[21] both emphasize holistic 4D motion encoding, enabling coherent multi-character animation from pose sequences. Meanwhile, methods like Animate-X[2] and Animate-X++[3] push the boundaries of expressive control by integrating diverse modalities, and Follow-Your-Pose v2[4] refines pose-driven synthesis with improved temporal consistency. MotionWeaver[0] sits within the Unified and 4D Motion Representations cluster, sharing conceptual ground with MTVCrafter[21] in its emphasis on spatiotemporal coherence, yet it distinguishes itself by targeting multi-humanoid scenarios more explicitly. Compared to works like Animax[5] or Versatile Multimodal Controls[9], which explore broader multimodal inputs, MotionWeaver[0] maintains a tighter focus on pose-sequence-driven animation, balancing representational richness with controllability for complex multi-character scenes.

Claimed Contributions

MotionWeaver framework for multi-humanoid image animation

The authors introduce MotionWeaver, a holistic framework that uses 4D-anchored representations to animate multiple humanoid characters from images. The framework addresses the challenge of animating scenes with multiple human subjects simultaneously.

1 retrieved paper

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MotionWeaver framework for multi-humanoid image animation

The authors introduce MotionWeaver, a holistic framework that uses 4D-anchored representations to animate multiple humanoid characters from images. The framework addresses the challenge of animating scenes with multiple human subjects simultaneously.