MoAlign: Motion-Centric Representation Alignment for Video Diffusion Models

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

video diffusion modelsphysical plausability

Text-to-video diffusion models have enabled high-quality video synthesis, yet often fail to generate temporally coherent and physically plausible motion. A key reason is the models' insufficient understanding of complex motions that natural videos often entail. Recent works tackle this problem by aligning diffusion model features with those from pretrained video encoders. However, these encoders mix video appearance and dynamics into entangled features, limiting the benefit of such alignment. In this paper, we propose a motion-centric alignment framework that learns a disentangled motion subspace from a pretrained video encoder. This subspace is optimized to predict ground-truth optical flow, ensuring it captures true motion dynamics. We then align the latent features of a text-to-video diffusion model to this new subspace, enabling the generative model to internalize motion knowledge and generate more plausible videos. Our method improves the physical commonsense in a state-of-the-art video diffusion model, while preserving adherence to textual prompts, as evidenced by empirical evaluations on VideoPhy, VideoPhy2, VBench, and VBench-2.0, along with a user study.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a motion-centric alignment framework that learns a disentangled motion subspace from a pretrained video encoder, optimized to predict optical flow, and aligns text-to-video diffusion features to this subspace. It resides in the 'Optical Flow-Based Motion Modeling' leaf, which contains four papers total including the original work. This leaf sits within the broader 'Motion Representation and Alignment Techniques' branch, indicating a moderately populated research direction focused on explicit motion extraction. The taxonomy shows this is an active but not overcrowded area, with parallel approaches exploring attention-based and latent motion representations in sibling leaves.

The taxonomy reveals neighboring research directions that share conceptual overlap but differ in technical approach. The 'Attention-Based Motion Extraction' leaf (three papers) extracts motion from cross-frame attention maps rather than optical flow, while 'Latent Motion Representations' (two papers) treats motion as a separate generative problem. The 'Decoupled Appearance-Motion Modeling' leaf (three papers) also addresses appearance-motion entanglement but without necessarily using optical flow as the disentanglement signal. These neighboring branches suggest the field is exploring multiple pathways to improve motion modeling, with optical flow being one of several viable strategies.

Among twenty-nine candidates examined, the analysis found limited prior work overlap. Contribution A (motion-specific subspace learning) examined ten candidates with zero refutations, suggesting relative novelty in this specific formulation. Contribution B (soft relational alignment) examined nine candidates and found one refutable match, indicating some precedent for alignment techniques but not necessarily in this exact motion-centric context. Contribution C (motion-centric fine-tuning framework) examined ten candidates with zero refutations. The search scope was constrained to top-K semantic matches plus citation expansion, not an exhaustive survey of all video diffusion literature.

Based on the limited search scope of twenty-nine candidates, the work appears to occupy a distinct position within optical flow-based motion modeling, particularly in its approach to disentangling motion subspaces and aligning diffusion features. The taxonomy context suggests this is a moderately explored direction with clear boundaries from attention-based and latent motion alternatives. However, the analysis does not cover the full breadth of video diffusion research, and additional related work may exist outside the examined candidate set.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: motion-centric representation alignment for video diffusion models. The field of video diffusion models has evolved into a rich landscape organized around several complementary themes. Motion Representation and Alignment Techniques focus on how to encode and align motion signals—ranging from optical flow-based methods like MoAlign[0] and Spectral Motion Alignment[17] to trajectory and latent-space approaches exemplified by Align Your Latents[3]. Motion Customization and Transfer explores how to extract and reuse motion patterns from reference videos, as seen in MotionDirector[1] and MotionMatcher[22]. Temporal Modeling Architectures and Training Strategies address the backbone designs and optimization schemes that enable coherent video synthesis, while Conditional Video Generation and Control investigates how text, sketches, or other modalities guide the generation process. Domain-Specific Applications, Video Editing and Motion Manipulation, and Evaluation branches round out the taxonomy by addressing specialized use cases, post-hoc editing workflows, and metrics for assessing motion quality. Within Motion Representation and Alignment Techniques, a particularly active line of work centers on optical flow-based modeling, where MoAlign[0] sits alongside methods like Spectral Motion Alignment[17] and MoVideo[36]. These approaches share a common emphasis on leveraging dense pixel-level motion cues to improve temporal consistency, yet they differ in how they integrate flow into the diffusion process—some use flow as an explicit conditioning signal, while others align latent features or employ spectral decompositions. MoAlign[0] distinguishes itself by focusing on aligning motion representations directly within the diffusion framework, contrasting with Spectral Motion Alignment[17], which emphasizes frequency-domain analysis, and MoVideo[36], which explores motion-guided video synthesis through flow warping. This cluster of optical flow methods reflects ongoing questions about the best granularity and stage at which to inject motion information, balancing computational efficiency with the fidelity of generated dynamics.

Claimed Contributions

Motion-specific subspace learning from pretrained video encoder

10 retrieved papers

The authors propose learning a low-dimensional motion-centric subspace by projecting features from a frozen pretrained video encoder (VideoMAEv2) and supervising them with optical flow prediction. This disentangles motion dynamics from static appearance, creating a motion-only representation space.

10 retrieved papers

Soft relational alignment of diffusion features to motion subspace

Can Refute

9 retrieved papers

The authors introduce a soft relational alignment mechanism that matches the pairwise similarity structure of diffusion model latent features to the learned motion subspace. This alignment uses token relation distillation with temporal weighting to internalize motion understanding directly into the generative model without requiring inference-time conditioning.

9 retrieved papers

Can Refute

Motion-centric fine-tuning framework for video diffusion models

10 retrieved papers

The authors present MoAlign, a two-stage fine-tuning framework that first learns motion-specific features supervised by optical flow, then aligns video diffusion model representations to this motion subspace. This approach improves physical plausibility and temporal coherence in text-to-video generation without external simulators or conditioning inputs.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[9] Optical Flow Representation Alignment Mamba Diffusion Model for Medical Video Generation PDF

Wang Zhenbin, Zhang Lei, Zhenbin Wang, Wang, Lituan, Lei Zhang, Zhu, Minjuan, Lituan Wang, Zhang, Zhenwei, Minjuan Zhu, Zhenwei Zhang (2024)

[17] Spectral Motion Alignment for Video Motion Transfer using Diffusion Models PDF

Geon Yeong Park, Hyeonho Jeong, Sang Wan Lee, Jong Chul Ye (2024)

[36] MoVideo: Motion-Aware Video Generation with Diffusion Models PDF

Liang Jing-yun, Fan Yuchen, Jingyun Liang, Zhang Kai, Yuchen Fan, Timofte, Radu, Kai Zhang, Van Gool Luc, R. Timofte, Ranjan Rakesh, L. V. Gool, Rakesh Ranjan (2023)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Motion-specific subspace learning from pretrained video encoder

[64] Onlyflow: Optical flow based motion conditioning for video diffusion models PDF

Cannot Refute

[65] FFNeRV: Flow-Guided Frame-Wise Neural Representations for Videos PDF

Cannot Refute

[66] Float: Generative motion latent flow matching for audio-driven talking portrait PDF

Cannot Refute

[67] Motion perception-driven multimodal self-supervised video object segmentation PDF

Cannot Refute

[68] Motion-Aware 3D Gaussian Splatting for Efficient Dynamic Scene Reconstruction PDF

Cannot Refute

[69] Self-supervised Video Object Segmentation by Motion Grouping PDF

Cannot Refute

[70] Dual motion GAN for future-flow embedded video prediction PDF

Cannot Refute

[71] FlowSeek: optical flow made easier with depth foundation models and motion bases PDF

Cannot Refute

[72] FlowDreamer: A RGB-D World Model with Flow-based Motion Representations for Robot Manipulation PDF

Cannot Refute

[73] SelfME: Self-supervised motion learning for micro-expression recognition PDF

Cannot Refute

Contribution

Soft relational alignment of diffusion features to motion subspace

[52] VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models PDF

Can Refute

[17] Spectral Motion Alignment for Video Motion Transfer using Diffusion Models PDF

Cannot Refute

[51] Lead: Latent realignment for human motion diffusion PDF

Cannot Refute

[53] A Unified Diffusion Framework for Scene-aware Human Motion Estimation from Sparse Signals PDF

Cannot Refute

[54] LUMA: Low-Dimension Unified Motion Alignment with Dual-Path Anchoring for Text-to-Motion Diffusion Model PDF

Cannot Refute

[55] Imagine2Act: Leveraging Object-Action Motion Consistency from Imagined Goals for Robotic Manipulation PDF

Cannot Refute

[56] Multi-Scale Generation of Spatial Interaction Scenes via Implicit Neural Representations and Diffusion Models PDF

Cannot Refute

[57] Cross-Modal Diffusion for Biomechanical Dynamical Systems Through Local Manifold Alignment PDF

Cannot Refute

[58] Diffusion Model in Robotics: A Comprehensive Review PDF

Cannot Refute

Contribution

Motion-centric fine-tuning framework for video diffusion models

[1] MotionDirector: Motion Customization of Text-to-Video Diffusion Models PDF

Cannot Refute

[16] Videomage: Multi-subject and motion customization of text-to-video diffusion models PDF

Cannot Refute

[19] Decouple and track: Benchmarking and improving video diffusion transformers for motion transfer PDF

Cannot Refute

[46] Motion consistency model: Accelerating video diffusion with disentangled motion-appearance distillation PDF

Cannot Refute

[48] Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model PDF

Cannot Refute

[59] Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition PDF

Cannot Refute

[60] Follow-your-motion: Video motion transfer via efficient spatial-temporal decoupled finetuning PDF

Cannot Refute

[61] WorldForge: Unlocking Emergent 3D/4D Generation in Video Diffusion Model via Training-Free Guidance PDF

Cannot Refute

[62] Dream Video: Composing Your Dream Videos with Customized Subject and Motion PDF

Cannot Refute

[63] X-nemo: Expressive neural motion reenactment via disentangled latent attention PDF

Cannot Refute

MoAlign: Motion-Centric Representation Alignment for Video Diffusion Models

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[9] Optical Flow Representation Alignment Mamba Diffusion Model for Medical Video Generation PDF

[17] Spectral Motion Alignment for Video Motion Transfer using Diffusion Models PDF

[36] MoVideo: Motion-Aware Video Generation with Diffusion Models PDF

Contribution Analysis

Motion-specific subspace learning from pretrained video encoder

[64] Onlyflow: Optical flow based motion conditioning for video diffusion models PDF

[65] FFNeRV: Flow-Guided Frame-Wise Neural Representations for Videos PDF

[66] Float: Generative motion latent flow matching for audio-driven talking portrait PDF

[67] Motion perception-driven multimodal self-supervised video object segmentation PDF

[68] Motion-Aware 3D Gaussian Splatting for Efficient Dynamic Scene Reconstruction PDF

[69] Self-supervised Video Object Segmentation by Motion Grouping PDF

[70] Dual motion GAN for future-flow embedded video prediction PDF

[71] FlowSeek: optical flow made easier with depth foundation models and motion bases PDF

[72] FlowDreamer: A RGB-D World Model with Flow-based Motion Representations for Robot Manipulation PDF

[73] SelfME: Self-supervised motion learning for micro-expression recognition PDF

Soft relational alignment of diffusion features to motion subspace

[52] VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models PDF

[17] Spectral Motion Alignment for Video Motion Transfer using Diffusion Models PDF

[51] Lead: Latent realignment for human motion diffusion PDF

[53] A Unified Diffusion Framework for Scene-aware Human Motion Estimation from Sparse Signals PDF

[54] LUMA: Low-Dimension Unified Motion Alignment with Dual-Path Anchoring for Text-to-Motion Diffusion Model PDF

[55] Imagine2Act: Leveraging Object-Action Motion Consistency from Imagined Goals for Robotic Manipulation PDF

[56] Multi-Scale Generation of Spatial Interaction Scenes via Implicit Neural Representations and Diffusion Models PDF

[57] Cross-Modal Diffusion for Biomechanical Dynamical Systems Through Local Manifold Alignment PDF

[58] Diffusion Model in Robotics: A Comprehensive Review PDF

Motion-centric fine-tuning framework for video diffusion models

[1] MotionDirector: Motion Customization of Text-to-Video Diffusion Models PDF

[16] Videomage: Multi-subject and motion customization of text-to-video diffusion models PDF

[19] Decouple and track: Benchmarking and improving video diffusion transformers for motion transfer PDF

[46] Motion consistency model: Accelerating video diffusion with disentangled motion-appearance distillation PDF

[48] Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model PDF

[59] Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition PDF

[60] Follow-your-motion: Video motion transfer via efficient spatial-temporal decoupled finetuning PDF

[61] WorldForge: Unlocking Emergent 3D/4D Generation in Video Diffusion Model via Training-Free Guidance PDF

[62] Dream Video: Composing Your Dream Videos with Customized Subject and Motion PDF

[63] X-nemo: Expressive neural motion reenactment via disentangled latent attention PDF

Table of Contents