MoAlign: Motion-Centric Representation Alignment for Video Diffusion Models

ICLR 2026 Conference SubmissionAnonymous Authors
video diffusion modelsphysical plausability
Abstract:

Text-to-video diffusion models have enabled high-quality video synthesis, yet often fail to generate temporally coherent and physically plausible motion. A key reason is the models' insufficient understanding of complex motions that natural videos often entail. Recent works tackle this problem by aligning diffusion model features with those from pretrained video encoders. However, these encoders mix video appearance and dynamics into entangled features, limiting the benefit of such alignment. In this paper, we propose a motion-centric alignment framework that learns a disentangled motion subspace from a pretrained video encoder. This subspace is optimized to predict ground-truth optical flow, ensuring it captures true motion dynamics. We then align the latent features of a text-to-video diffusion model to this new subspace, enabling the generative model to internalize motion knowledge and generate more plausible videos. Our method improves the physical commonsense in a state-of-the-art video diffusion model, while preserving adherence to textual prompts, as evidenced by empirical evaluations on VideoPhy, VideoPhy2, VBench, and VBench-2.0, along with a user study.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a motion-centric alignment framework that learns a disentangled motion subspace from a pretrained video encoder, optimized to predict optical flow, and aligns text-to-video diffusion features to this subspace. It resides in the 'Optical Flow-Based Motion Modeling' leaf, which contains four papers total including the original work. This leaf sits within the broader 'Motion Representation and Alignment Techniques' branch, indicating a moderately populated research direction focused on explicit motion extraction. The taxonomy shows this is an active but not overcrowded area, with parallel approaches exploring attention-based and latent motion representations in sibling leaves.

The taxonomy reveals neighboring research directions that share conceptual overlap but differ in technical approach. The 'Attention-Based Motion Extraction' leaf (three papers) extracts motion from cross-frame attention maps rather than optical flow, while 'Latent Motion Representations' (two papers) treats motion as a separate generative problem. The 'Decoupled Appearance-Motion Modeling' leaf (three papers) also addresses appearance-motion entanglement but without necessarily using optical flow as the disentanglement signal. These neighboring branches suggest the field is exploring multiple pathways to improve motion modeling, with optical flow being one of several viable strategies.

Among twenty-nine candidates examined, the analysis found limited prior work overlap. Contribution A (motion-specific subspace learning) examined ten candidates with zero refutations, suggesting relative novelty in this specific formulation. Contribution B (soft relational alignment) examined nine candidates and found one refutable match, indicating some precedent for alignment techniques but not necessarily in this exact motion-centric context. Contribution C (motion-centric fine-tuning framework) examined ten candidates with zero refutations. The search scope was constrained to top-K semantic matches plus citation expansion, not an exhaustive survey of all video diffusion literature.

Based on the limited search scope of twenty-nine candidates, the work appears to occupy a distinct position within optical flow-based motion modeling, particularly in its approach to disentangling motion subspaces and aligning diffusion features. The taxonomy context suggests this is a moderately explored direction with clear boundaries from attention-based and latent motion alternatives. However, the analysis does not cover the full breadth of video diffusion research, and additional related work may exist outside the examined candidate set.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
29
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: motion-centric representation alignment for video diffusion models. The field of video diffusion models has evolved into a rich landscape organized around several complementary themes. Motion Representation and Alignment Techniques focus on how to encode and align motion signals—ranging from optical flow-based methods like MoAlign[0] and Spectral Motion Alignment[17] to trajectory and latent-space approaches exemplified by Align Your Latents[3]. Motion Customization and Transfer explores how to extract and reuse motion patterns from reference videos, as seen in MotionDirector[1] and MotionMatcher[22]. Temporal Modeling Architectures and Training Strategies address the backbone designs and optimization schemes that enable coherent video synthesis, while Conditional Video Generation and Control investigates how text, sketches, or other modalities guide the generation process. Domain-Specific Applications, Video Editing and Motion Manipulation, and Evaluation branches round out the taxonomy by addressing specialized use cases, post-hoc editing workflows, and metrics for assessing motion quality. Within Motion Representation and Alignment Techniques, a particularly active line of work centers on optical flow-based modeling, where MoAlign[0] sits alongside methods like Spectral Motion Alignment[17] and MoVideo[36]. These approaches share a common emphasis on leveraging dense pixel-level motion cues to improve temporal consistency, yet they differ in how they integrate flow into the diffusion process—some use flow as an explicit conditioning signal, while others align latent features or employ spectral decompositions. MoAlign[0] distinguishes itself by focusing on aligning motion representations directly within the diffusion framework, contrasting with Spectral Motion Alignment[17], which emphasizes frequency-domain analysis, and MoVideo[36], which explores motion-guided video synthesis through flow warping. This cluster of optical flow methods reflects ongoing questions about the best granularity and stage at which to inject motion information, balancing computational efficiency with the fidelity of generated dynamics.

Claimed Contributions

Motion-specific subspace learning from pretrained video encoder

The authors propose learning a low-dimensional motion-centric subspace by projecting features from a frozen pretrained video encoder (VideoMAEv2) and supervising them with optical flow prediction. This disentangles motion dynamics from static appearance, creating a motion-only representation space.

10 retrieved papers
Soft relational alignment of diffusion features to motion subspace

The authors introduce a soft relational alignment mechanism that matches the pairwise similarity structure of diffusion model latent features to the learned motion subspace. This alignment uses token relation distillation with temporal weighting to internalize motion understanding directly into the generative model without requiring inference-time conditioning.

9 retrieved papers
Can Refute
Motion-centric fine-tuning framework for video diffusion models

The authors present MoAlign, a two-stage fine-tuning framework that first learns motion-specific features supervised by optical flow, then aligns video diffusion model representations to this motion subspace. This approach improves physical plausibility and temporal coherence in text-to-video generation without external simulators or conditioning inputs.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Motion-specific subspace learning from pretrained video encoder

The authors propose learning a low-dimensional motion-centric subspace by projecting features from a frozen pretrained video encoder (VideoMAEv2) and supervising them with optical flow prediction. This disentangles motion dynamics from static appearance, creating a motion-only representation space.

Contribution

Soft relational alignment of diffusion features to motion subspace

The authors introduce a soft relational alignment mechanism that matches the pairwise similarity structure of diffusion model latent features to the learned motion subspace. This alignment uses token relation distillation with temporal weighting to internalize motion understanding directly into the generative model without requiring inference-time conditioning.

Contribution

Motion-centric fine-tuning framework for video diffusion models

The authors present MoAlign, a two-stage fine-tuning framework that first learns motion-specific features supervised by optical flow, then aligns video diffusion model representations to this motion subspace. This approach improves physical plausibility and temporal coherence in text-to-video generation without external simulators or conditioning inputs.