Instilling an Active Mind in Avatars via Cognitive Simulation

ICLR 2026 Conference SubmissionAnonymous Authors
Video GeneratioHuman AnimationAvatarMultimedia
Abstract:

Current video avatar models can generate fluid animations but struggle to capture a character's authentic essence, primarily synchronizing motion with low-level audio cues instead of understanding higher-level semantics like emotion or intent. To bridge this gap, we propose a novel framework for generating character animations that are not only physically plausible but also semantically rich and expressive. Our model is built on two technical innovations. First, we employ Multimodal Large Language Models to generate a structured textual representation from input conditions, providing high-level semantic guidance for creating contextually and emotionally resonant actions. Second, to ensure robust fusion of multimodal signals, we introduce a specialized Multimodal Diffusion Transformer architecture featuring a novel Pseudo Last Frame design. This allows our model to accurately interpret the joint semantics of audio, images and text, generating motions that are deeply coherent with the overall context. Comprehensive experiments validate the superiority of our method, which achieves compelling results in lip-sync accuracy, video quality, motion naturalness, and semantic consistency. The approach also shows strong generalization to challenging scenarios, including multi-person and non-human subjects. Our video results are linked in https://anonymous.4open.science/w/InstillinganActiveMindinAvatars_Anonymous/ .

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
21
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: Semantically coherent character animation from multimodal inputs. The field has evolved into a rich ecosystem organized around input modalities and synthesis objectives. Audio-Driven Facial and Co-Speech Animation focuses on generating expressive talking heads and gestures synchronized with speech, often leveraging datasets like BEAT Dataset[14] and methods such as SemTalk[1] and MedTalk[9]. Image-to-Video Character Animation emphasizes animating static portraits or full-body characters from reference images, with works like Animate Anyone[2] and HunyuanVideo HOMA[8] demonstrating strong visual fidelity. Text-to-Motion Synthesis and Control explores language-conditioned body motion generation, balancing semantic understanding with physical plausibility through approaches like MoFusion[7] and Efficient Text Motion[12]. Scene-Aware and Context-Conditioned Motion Synthesis integrates environmental constraints, while Long-Form and Narrative-Driven Animation tackles temporal coherence over extended sequences using methods such as MovieDreamer[5] and Story to Motion[25]. Motion Style Transfer and Cross-Domain Adaptation addresses stylistic variation, and Unified and Multimodal Motion Synthesis Frameworks aim to handle diverse input combinations within single architectures, exemplified by Unified Multimodal Motion[46] and Multimodal Autoregressive Motion[44]. Recent efforts reveal a tension between modality-specific depth and cross-modal generalization. Audio-driven methods achieve fine-grained lip-sync and gesture timing but may struggle with broader semantic grounding, whereas text-driven approaches offer flexible high-level control at the cost of temporal precision. Active Mind Avatars[0] sits within the Image-to-Video Character Animation branch, specifically in Multimodal-Driven Portrait and Character Animation, where it shares conceptual ground with InterActHuman[13] and Versatile Multimodal Controls[27]. Unlike purely image-based animators such as Animate Anyone 2[39], Active Mind Avatars[0] emphasizes integrating cognitive or semantic signals to drive character behavior, aligning with the broader push toward semantically aware synthesis seen in works like Semantically Consistent Motion[20] and AI Knowledge Motion[35]. This positioning reflects an emerging interest in bridging low-level visual fidelity with higher-level intentionality, a theme that cuts across multiple branches and remains an active area of exploration.

Claimed Contributions

Dual-system cognitive framework for video avatar generation

The authors introduce a novel perspective that frames video avatar generation using dual-process cognitive theory, distinguishing between reactive System 1 processes (low-level audio-to-motion mappings) and deliberative System 2 processes (high-level semantic reasoning). This framework addresses the limitation that existing methods only simulate reactive behavior without contextual reasoning.

1 retrieved paper
MLLM-based agentic reasoning module with specialized MMDiT architecture

The framework employs Multimodal Large Language Models as agents to generate high-level semantic guidance through multi-step reasoning (Analyzer and Planner). It integrates this with a specialized Multimodal Diffusion Transformer architecture that uses symmetric fusion of text, audio, and video branches, along with a novel pseudo-last-frame conditioning strategy to mitigate modal interference.

10 retrieved papers
Pseudo-last-frame conditioning strategy

A novel conditioning mechanism that discards the reference image during training and instead uses a pseudo-last-frame with shifted positional encoding during inference. This approach eliminates training artifacts where models learn spurious correlations between reference images and generated sequences, enabling better motion dynamics while maintaining identity consistency.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Dual-system cognitive framework for video avatar generation

The authors introduce a novel perspective that frames video avatar generation using dual-process cognitive theory, distinguishing between reactive System 1 processes (low-level audio-to-motion mappings) and deliberative System 2 processes (high-level semantic reasoning). This framework addresses the limitation that existing methods only simulate reactive behavior without contextual reasoning.

Contribution

MLLM-based agentic reasoning module with specialized MMDiT architecture

The framework employs Multimodal Large Language Models as agents to generate high-level semantic guidance through multi-step reasoning (Analyzer and Planner). It integrates this with a specialized Multimodal Diffusion Transformer architecture that uses symmetric fusion of text, audio, and video branches, along with a novel pseudo-last-frame conditioning strategy to mitigate modal interference.

Contribution

Pseudo-last-frame conditioning strategy

A novel conditioning mechanism that discards the reference image during training and instead uses a pseudo-last-frame with shifted positional encoding during inference. This approach eliminates training artifacts where models learn spurious correlations between reference images and generated sequences, enabling better motion dynamics while maintaining identity consistency.

Instilling an Active Mind in Avatars via Cognitive Simulation | Novelty Validation