MATRIX: Mask Track Alignment for Interaction-aware Video Generation

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

video generative model

Video DiTs have advanced video generation, yet they still struggle to model multi-instance or subject-object interactions. This raises a key question: How do these models internally represent interactions? To answer this, we curate MATRIX-11K, a video dataset with interaction-aware captions and multi-instance mask tracks. Using this dataset, we conduct a systematic analysis that formalizes two perspectives of video DiTs: semantic grounding, via video-to-text attention, which evaluates whether noun and verb tokens capture instances and their relations; and semantic propagation, via video-to-video attention, which assesses whether instance bindings persist across frames. We find both effects concentrate in a small subset of interaction-dominant layers. Motivated by this, we introduce MATRIX, a simple and effective regularization that aligns attention in specific layers of video DiTs with multi-instance mask tracks from the MATRIX-11K dataset, enhancing both grounding and propagation. We further propose InterGenEval, an evaluation protocol for interaction-aware video generation. In experiments, MATRIX improves both interaction fidelity and semantic alignment while reducing drift and hallucination. Extensive ablations validate our design choices. Codes and weights will be released.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces MATRIX, a regularization framework that aligns video diffusion transformer attention with multi-instance mask tracks to improve interaction modeling. It resides in the 'Attention-based Interaction Alignment' leaf, which contains only three papers total, indicating a relatively sparse research direction within the broader interaction modeling branch. The taxonomy shows that while multi-instance interaction modeling has several subcategories (attention-based, multi-subject customization, human-object synthesis), the attention-alignment approach remains less crowded compared to other video generation domains like human-centric synthesis or motion control.

The taxonomy reveals that neighboring leaves focus on multi-subject customization and human-object interaction synthesis, which share the goal of modeling entity relationships but differ in methodology. The broader 'Multi-Instance and Subject-Object Interaction Modeling' branch sits alongside motion control and multimodal generation branches, suggesting that interaction-aware generation represents one of several parallel research thrusts. The scope note clarifies that this category excludes single-subject generation and non-attention-based interaction methods, positioning MATRIX within a specific methodological niche that emphasizes attention mechanisms for propagating semantic bindings across frames.

Among the thirty candidates examined, none clearly refuted any of the three core contributions: the MATRIX-11K dataset with interaction-aware annotations, the attention regularization framework, and the InterGenEval protocol. Each contribution was assessed against ten candidates with zero refutable overlaps identified. This suggests that within the limited search scope, the combination of a specialized interaction dataset, layer-specific attention alignment, and a dedicated evaluation protocol appears distinct from prior work. The dataset contribution in particular addresses a gap in interaction-annotated video data, while the regularization approach targets specific 'interaction-dominant layers' rather than applying uniform constraints.

Based on the limited literature search covering thirty semantically related papers, the work appears to occupy a relatively novel position within attention-based interaction modeling. The analysis does not claim exhaustive coverage of all possible prior work, and the sparse population of the taxonomy leaf suggests this research direction remains underexplored. The absence of refutable candidates indicates that the specific combination of contributions has not been directly addressed in the examined literature, though the search scope remains constrained to top-K semantic matches and their citations.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Interaction-aware video generation using diffusion transformers. The field has evolved into several major branches that address distinct challenges in generating coherent, controllable video content. Multi-Instance and Subject-Object Interaction Modeling focuses on ensuring that multiple entities or objects interact plausibly within a scene, often leveraging attention mechanisms to align relationships. Motion and Trajectory Control emphasizes precise spatial-temporal guidance, enabling users to specify how objects move through space. Multimodal Audio-Visual Generation tackles the joint synthesis of sound and vision, ensuring temporal synchronization and semantic consistency. Long-Form and Multi-Scene Video Generation addresses scalability challenges for extended narratives, while Human-Centric Video Generation specializes in realistic human motion, faces, and gestures. Specialized Application Domains target niche use cases such as gaming or medical imaging, and Architectural Innovations and Training Strategies explore novel transformer designs and optimization techniques to improve efficiency and quality. Within the interaction modeling branch, a handful of works concentrate on attention-based mechanisms that bind entities together during generation. MATRIX[0] exemplifies this direction by using cross-attention to align subject-object relationships, ensuring that interactions remain semantically coherent across frames. Nearby approaches such as Target-aware[9] and BindWeave[43] similarly emphasize structured attention or binding strategies to maintain consistency among multiple instances. These methods contrast with trajectory-focused techniques like Tora[8] or motion-centric frameworks that prioritize explicit spatial control over relational reasoning. The main trade-off revolves around balancing fine-grained interaction fidelity with computational overhead, as attention-based alignment can be resource-intensive. MATRIX[0] sits squarely in this attention-driven cluster, sharing conceptual ground with BindWeave[43] in its emphasis on binding mechanisms, while differing from more trajectory-oriented works that rely on explicit path specifications rather than learned relational attention.

Claimed Contributions

MATRIX-11K dataset with interaction-aware captions and multi-instance mask tracks

10 retrieved papers

The authors introduce a new video dataset called MATRIX-11K that contains videos annotated with interaction-aware captions and multi-instance mask tracks, designed to support analysis and training of interaction-aware video generation models.

10 retrieved papers

MATRIX regularization framework for aligning attention with mask tracks

10 retrieved papers

The authors propose MATRIX, a regularization method that aligns attention mechanisms in interaction-dominant layers of video diffusion transformers with multi-instance mask tracks, using Semantic Grounding Alignment and Semantic Propagation Alignment losses to enhance interaction modeling.

10 retrieved papers

InterGenEval evaluation protocol for interaction-aware video generation

10 retrieved papers

The authors develop InterGenEval, a new evaluation protocol specifically designed to assess the quality of interaction-aware video generation, addressing the need for systematic evaluation of multi-instance and subject-object interactions in generated videos.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[9] Target-aware video diffusion models PDF

Kim Taek-Soo, Joo, Hanbyul (2025)

[43] BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration PDF

Li ZhaoYang, Qian Dong-jun, Zhaoyang Li, Su kai, Dongjun Qian, Diao, Qishuai, Kai Su, Xia Xiangyang, Qishuai Diao, Liu Chang, Xiangyang Xia, Yang Wenfei, Chang Liu, Zhang Tianzhu, Wenfei Yang, Yuan, Zehuan, Tianzhu Zhang, Zehuan Yuan (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MATRIX-11K dataset with interaction-aware captions and multi-instance mask tracks

[48] InterRVOS: Interaction-aware Referring Video Object Segmentation PDF

Cannot Refute

[49] EPIC-KITCHENS VISOR Benchmark: VIdeo Segmentations and Object Relations PDF

Cannot Refute

[50] Modular Interactive Video Object Segmentation: Interaction-to-Mask, Propagation and Difference-Aware Fusion PDF

Cannot Refute

[51] Video object segmentation and tracking: A survey PDF

Cannot Refute

[52] SAM2MOT: A Novel Paradigm of Multi-Object Tracking by Segmentation PDF

Cannot Refute

[53] Detection of Anomalous Behavior of Manufacturing Workers Using Deep Learning-Based Recognition of HumanâObject Interaction PDF

Cannot Refute

[54] Disentangling Spatio-Temporal Knowledge for Weakly Supervised Object Detection and Segmentation in Surgical Video PDF

Cannot Refute

[55] M3-VOS: Multi-Phase, Multi-Transition, and Multi-Scenery Video Object Segmentation PDF

Cannot Refute

[56] Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models PDF

Cannot Refute

[57] Efficient and Robust Video Object Segmentation Through Isogenous Memory Sampling and Frame Relation Mining. PDF

Cannot Refute

Contribution

MATRIX regularization framework for aligning attention with mask tracks

[9] Target-aware video diffusion models PDF

Cannot Refute

[18] DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation PDF

Cannot Refute

[19] Peekaboo: Interactive Video Generation via Masked-Diffusion PDF

Cannot Refute

[66] MAGVIT: Masked Generative Video Transformer PDF

Cannot Refute

[67] Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation PDF

Cannot Refute

[68] Acdit: Interpolating autoregressive conditional modeling and diffusion transformer PDF

Cannot Refute

[69] Dreamix: Video Diffusion Models are General Video Editors PDF

Cannot Refute

[70] OutDreamer: Video Outpainting with a Diffusion Transformer PDF

Cannot Refute

[71] Maskgwm: A generalizable driving world model with video mask reconstruction PDF

Cannot Refute

[72] Cinetrans: Learning to generate videos with cinematic transitions via masked diffusion models PDF

Cannot Refute

Contribution

InterGenEval evaluation protocol for interaction-aware video generation

[19] Peekaboo: Interactive Video Generation via Masked-Diffusion PDF

Cannot Refute

[48] InterRVOS: Interaction-aware Referring Video Object Segmentation PDF

Cannot Refute

[58] Hoigen-1m: A large-scale dataset for human-object interaction video generation PDF

Cannot Refute

[59] VideoPhy: Evaluating Physical Commonsense for Video Generation PDF

Cannot Refute

[60] What are you doing? a closer look at controllable human video generation PDF

Cannot Refute

[61] Learning Spatiotemporal Interactions for User-Generated Video Quality Assessment PDF

Cannot Refute

[62] VideoGen-Eval: Agent-based System for Video Generation Evaluation PDF

Cannot Refute

[63] Improving dynamic object interactions in text-to-video generation with ai feedback PDF

Cannot Refute

[64] Sim4IA-Bench: A User Simulation Benchmark Suite for Next Query and Utterance Prediction PDF

Cannot Refute

[65] Inspiring the next generation of segment anything models: Comprehensively evaluate sam and sam 2 with diverse prompts towards context-dependent concepts â¦ PDF

Cannot Refute

MATRIX: Mask Track Alignment for Interaction-aware Video Generation

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[9] Target-aware video diffusion models PDF

[43] BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration PDF

Contribution Analysis

MATRIX-11K dataset with interaction-aware captions and multi-instance mask tracks

[48] InterRVOS: Interaction-aware Referring Video Object Segmentation PDF

[49] EPIC-KITCHENS VISOR Benchmark: VIdeo Segmentations and Object Relations PDF

[50] Modular Interactive Video Object Segmentation: Interaction-to-Mask, Propagation and Difference-Aware Fusion PDF

[51] Video object segmentation and tracking: A survey PDF

[52] SAM2MOT: A Novel Paradigm of Multi-Object Tracking by Segmentation PDF

[53] Detection of Anomalous Behavior of Manufacturing Workers Using Deep Learning-Based Recognition of HumanâObject Interaction PDF

[54] Disentangling Spatio-Temporal Knowledge for Weakly Supervised Object Detection and Segmentation in Surgical Video PDF

[55] M3-VOS: Multi-Phase, Multi-Transition, and Multi-Scenery Video Object Segmentation PDF

[56] Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models PDF

[57] Efficient and Robust Video Object Segmentation Through Isogenous Memory Sampling and Frame Relation Mining. PDF

MATRIX regularization framework for aligning attention with mask tracks

[9] Target-aware video diffusion models PDF

[18] DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation PDF

[19] Peekaboo: Interactive Video Generation via Masked-Diffusion PDF

[66] MAGVIT: Masked Generative Video Transformer PDF

[67] Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation PDF

[68] Acdit: Interpolating autoregressive conditional modeling and diffusion transformer PDF

[69] Dreamix: Video Diffusion Models are General Video Editors PDF

[70] OutDreamer: Video Outpainting with a Diffusion Transformer PDF

[71] Maskgwm: A generalizable driving world model with video mask reconstruction PDF

[72] Cinetrans: Learning to generate videos with cinematic transitions via masked diffusion models PDF

InterGenEval evaluation protocol for interaction-aware video generation

[19] Peekaboo: Interactive Video Generation via Masked-Diffusion PDF

[48] InterRVOS: Interaction-aware Referring Video Object Segmentation PDF

[58] Hoigen-1m: A large-scale dataset for human-object interaction video generation PDF

[59] VideoPhy: Evaluating Physical Commonsense for Video Generation PDF

[60] What are you doing? a closer look at controllable human video generation PDF

[61] Learning Spatiotemporal Interactions for User-Generated Video Quality Assessment PDF

[62] VideoGen-Eval: Agent-based System for Video Generation Evaluation PDF

[63] Improving dynamic object interactions in text-to-video generation with ai feedback PDF

[64] Sim4IA-Bench: A User Simulation Benchmark Suite for Next Query and Utterance Prediction PDF

[65] Inspiring the next generation of segment anything models: Comprehensively evaluate sam and sam 2 with diverse prompts towards context-dependent concepts â¦ PDF

Table of Contents

[53] Detection of Anomalous Behavior of Manufacturing Workers Using Deep Learning-Based Recognition of HumanâObject Interaction PDF

[65] Inspiring the next generation of segment anything models: Comprehensively evaluate sam and sam 2 with diverse prompts towards context-dependent concepts â¦ PDF