Abstract:

Video DiTs have advanced video generation, yet they still struggle to model multi-instance or subject-object interactions. This raises a key question: How do these models internally represent interactions? To answer this, we curate MATRIX-11K, a video dataset with interaction-aware captions and multi-instance mask tracks. Using this dataset, we conduct a systematic analysis that formalizes two perspectives of video DiTs: semantic grounding, via video-to-text attention, which evaluates whether noun and verb tokens capture instances and their relations; and semantic propagation, via video-to-video attention, which assesses whether instance bindings persist across frames. We find both effects concentrate in a small subset of interaction-dominant layers. Motivated by this, we introduce MATRIX, a simple and effective regularization that aligns attention in specific layers of video DiTs with multi-instance mask tracks from the MATRIX-11K dataset, enhancing both grounding and propagation. We further propose InterGenEval, an evaluation protocol for interaction-aware video generation. In experiments, MATRIX improves both interaction fidelity and semantic alignment while reducing drift and hallucination. Extensive ablations validate our design choices. Codes and weights will be released.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces MATRIX, a regularization framework that aligns video diffusion transformer attention with multi-instance mask tracks to improve interaction modeling. It resides in the 'Attention-based Interaction Alignment' leaf, which contains only three papers total, indicating a relatively sparse research direction within the broader interaction modeling branch. The taxonomy shows that while multi-instance interaction modeling has several subcategories (attention-based, multi-subject customization, human-object synthesis), the attention-alignment approach remains less crowded compared to other video generation domains like human-centric synthesis or motion control.

The taxonomy reveals that neighboring leaves focus on multi-subject customization and human-object interaction synthesis, which share the goal of modeling entity relationships but differ in methodology. The broader 'Multi-Instance and Subject-Object Interaction Modeling' branch sits alongside motion control and multimodal generation branches, suggesting that interaction-aware generation represents one of several parallel research thrusts. The scope note clarifies that this category excludes single-subject generation and non-attention-based interaction methods, positioning MATRIX within a specific methodological niche that emphasizes attention mechanisms for propagating semantic bindings across frames.

Among the thirty candidates examined, none clearly refuted any of the three core contributions: the MATRIX-11K dataset with interaction-aware annotations, the attention regularization framework, and the InterGenEval protocol. Each contribution was assessed against ten candidates with zero refutable overlaps identified. This suggests that within the limited search scope, the combination of a specialized interaction dataset, layer-specific attention alignment, and a dedicated evaluation protocol appears distinct from prior work. The dataset contribution in particular addresses a gap in interaction-annotated video data, while the regularization approach targets specific 'interaction-dominant layers' rather than applying uniform constraints.

Based on the limited literature search covering thirty semantically related papers, the work appears to occupy a relatively novel position within attention-based interaction modeling. The analysis does not claim exhaustive coverage of all possible prior work, and the sparse population of the taxonomy leaf suggests this research direction remains underexplored. The absence of refutable candidates indicates that the specific combination of contributions has not been directly addressed in the examined literature, though the search scope remains constrained to top-K semantic matches and their citations.

Taxonomy

Core-task Taxonomy Papers
47
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Interaction-aware video generation using diffusion transformers. The field has evolved into several major branches that address distinct challenges in generating coherent, controllable video content. Multi-Instance and Subject-Object Interaction Modeling focuses on ensuring that multiple entities or objects interact plausibly within a scene, often leveraging attention mechanisms to align relationships. Motion and Trajectory Control emphasizes precise spatial-temporal guidance, enabling users to specify how objects move through space. Multimodal Audio-Visual Generation tackles the joint synthesis of sound and vision, ensuring temporal synchronization and semantic consistency. Long-Form and Multi-Scene Video Generation addresses scalability challenges for extended narratives, while Human-Centric Video Generation specializes in realistic human motion, faces, and gestures. Specialized Application Domains target niche use cases such as gaming or medical imaging, and Architectural Innovations and Training Strategies explore novel transformer designs and optimization techniques to improve efficiency and quality. Within the interaction modeling branch, a handful of works concentrate on attention-based mechanisms that bind entities together during generation. MATRIX[0] exemplifies this direction by using cross-attention to align subject-object relationships, ensuring that interactions remain semantically coherent across frames. Nearby approaches such as Target-aware[9] and BindWeave[43] similarly emphasize structured attention or binding strategies to maintain consistency among multiple instances. These methods contrast with trajectory-focused techniques like Tora[8] or motion-centric frameworks that prioritize explicit spatial control over relational reasoning. The main trade-off revolves around balancing fine-grained interaction fidelity with computational overhead, as attention-based alignment can be resource-intensive. MATRIX[0] sits squarely in this attention-driven cluster, sharing conceptual ground with BindWeave[43] in its emphasis on binding mechanisms, while differing from more trajectory-oriented works that rely on explicit path specifications rather than learned relational attention.

Claimed Contributions

MATRIX-11K dataset with interaction-aware captions and multi-instance mask tracks

The authors introduce a new video dataset called MATRIX-11K that contains videos annotated with interaction-aware captions and multi-instance mask tracks, designed to support analysis and training of interaction-aware video generation models.

10 retrieved papers
MATRIX regularization framework for aligning attention with mask tracks

The authors propose MATRIX, a regularization method that aligns attention mechanisms in interaction-dominant layers of video diffusion transformers with multi-instance mask tracks, using Semantic Grounding Alignment and Semantic Propagation Alignment losses to enhance interaction modeling.

10 retrieved papers
InterGenEval evaluation protocol for interaction-aware video generation

The authors develop InterGenEval, a new evaluation protocol specifically designed to assess the quality of interaction-aware video generation, addressing the need for systematic evaluation of multi-instance and subject-object interactions in generated videos.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MATRIX-11K dataset with interaction-aware captions and multi-instance mask tracks

The authors introduce a new video dataset called MATRIX-11K that contains videos annotated with interaction-aware captions and multi-instance mask tracks, designed to support analysis and training of interaction-aware video generation models.

Contribution

MATRIX regularization framework for aligning attention with mask tracks

The authors propose MATRIX, a regularization method that aligns attention mechanisms in interaction-dominant layers of video diffusion transformers with multi-instance mask tracks, using Semantic Grounding Alignment and Semantic Propagation Alignment losses to enhance interaction modeling.

Contribution

InterGenEval evaluation protocol for interaction-aware video generation

The authors develop InterGenEval, a new evaluation protocol specifically designed to assess the quality of interaction-aware video generation, addressing the need for systematic evaluation of multi-instance and subject-object interactions in generated videos.