Target-Aware Video Diffusion Models

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Controllable video diffusion modelsHuman-scene interactionRobotics planning

We present a target-aware video diffusion model that generates videos from an input image, in which an actor interacts with a specified target while performing a desired action. The target is defined by a segmentation mask, and the action is described through a text prompt. Our key motivation is to incorporate target awareness into video generation, enabling actors to perform directed actions on designated objects. This enables video diffusion models to act as motion planners, producing plausible predictions of human–object interactions by leveraging the priors of large-scale video generative models. We build our target-aware model by extending a baseline model to incorporate the target mask as an additional input. To enforce target awareness, we introduce a special token that encodes the target's spatial information within the text prompt. We then fine-tune the model with our curated dataset using an additional cross-attention loss that aligns the cross-attention maps associated with this token with the input target mask. To further improve performance, we selectively apply this loss to the most semantically relevant attention regions and transformer blocks. Experimental results show that our target-aware model outperforms existing solutions in generating videos where actors interact accurately with the specified targets. We further demonstrate its efficacy in two downstream applications: zero-shot 3D HOI motion synthesis with physical plausibility and long-term video content creation.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces a target-aware video diffusion model that generates videos where actors interact with specific objects defined by segmentation masks and text prompts. Within the taxonomy, it resides in the 'Spatial Control via Masks and Trajectories' leaf, which contains five papers total including this work. This leaf sits under 'Controllable Video Generation with Interaction Modeling', indicating a moderately populated research direction focused on explicit spatial guidance mechanisms. The sibling papers (Boximator, Mask2IV, VHOI, MATRIX) suggest an active but not overcrowded subfield exploring mask-based and trajectory-based control for interaction synthesis.

The taxonomy reveals neighboring research directions that contextualize this work's positioning. Adjacent leaves include 'Pose-Guided Human-Object Interaction Synthesis' (3 papers) emphasizing skeletal guidance, 'Character Animation with Scene Interaction' (2 papers) focusing on character-specific modeling, and 'Multi-Concept and Multi-Identity Interaction Generation' (2 papers) handling multiple subjects. The parent branch 'Controllable Video Generation with Interaction Modeling' excludes methods focused solely on 3D reconstruction or text-only control, clarifying that this work's mask-based spatial guidance distinguishes it from purely language-driven approaches in the sibling 'Text-Driven Human-Object Interaction Generation' branch.

Among 25 candidates examined across three contributions, no clearly refutable prior work was identified. The target-aware diffusion model with mask-based specification examined 10 candidates with zero refutations, the cross-attention loss with special token examined 5 candidates with zero refutations, and the curated dataset examined 10 candidates with zero refutations. This suggests that within the limited search scope, the specific combination of mask-based target specification, cross-attention alignment via special tokens, and selective loss application appears relatively unexplored. The sibling papers in the same taxonomy leaf employ related spatial control mechanisms but may differ in their attention-based grounding strategies or dataset curation approaches.

Based on the top-25 semantic matches examined, the work appears to occupy a distinct position within spatial control methods for interaction generation. The taxonomy structure indicates this is an active but not saturated research area, with the paper's specific technical approach—combining mask inputs, special token encoding, and selective cross-attention loss—not directly overlapped by the examined candidates. However, the limited search scope means potentially relevant work outside the top-25 matches or in adjacent subfields may exist but was not captured in this analysis.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: target-aware video generation with human-object interactions. This field addresses the challenge of synthesizing realistic videos in which humans interact with specific objects in controllable, semantically meaningful ways. The taxonomy reveals several complementary research directions. Controllable Video Generation with Interaction Modeling focuses on spatial and temporal control mechanisms—such as masks, trajectories, and bounding boxes—that guide where and how interactions unfold (e.g., Boximator[24], Mask2IV[16]). Text-Driven Human-Object Interaction Generation emphasizes language-based conditioning to specify interaction semantics, while 3D Human-Object Interaction Reconstruction and Tracking tackles the geometric and pose estimation aspects needed for physically plausible contact. Motion Synthesis and Diffusion-Based HOI Generation explores generative models that produce diverse, natural human motions conditioned on object affordances. Meanwhile, Datasets, Benchmarks, and Evaluation provide the empirical infrastructure, Specialized Applications target downstream tasks like robotics or virtual anchors, and Foundational Techniques supply core representations and architectures. Recent work has intensified around fine-grained spatial control and physically grounded interaction modeling. A dense cluster of methods leverages mask-based or trajectory-based guidance to steer diffusion models, balancing flexibility with precise object targeting (Target-Aware Video Diffusion[0], VHOI[27], MATRIX[50]). These approaches often grapple with trade-offs between open-ended creativity and strict adherence to user-specified constraints. Target-Aware Video Diffusion[0] sits squarely within the Spatial Control via Masks and Trajectories branch, emphasizing how explicit spatial cues can anchor object interactions during generation. Compared to neighbors like Boximator[24], which uses bounding-box annotations for layout control, or Mask2IV[16], which conditions on segmentation masks, the original paper appears to integrate target-specific priors more tightly into the diffusion process. This focus on target awareness distinguishes it from more general controllable generation schemes, positioning it as a specialized solution for scenarios demanding precise human-object coordination.

Claimed Contributions

Target-aware video diffusion model with mask-based target specification

10 retrieved papers

The authors introduce a video diffusion model that accepts a segmentation mask to specify a target object and generates videos showing an actor performing text-prompted actions directed at that target. This enables explicit control over actor-target interactions without requiring dense motion annotations.

10 retrieved papers

Cross-attention loss with special [TGT] token for spatial grounding

5 retrieved papers

The authors propose a training method that introduces a special [TGT] token in text prompts and applies a cross-attention loss to align the token's attention maps with the input target mask. This loss is selectively applied to semantically relevant attention regions and transformer blocks to enforce target awareness.

5 retrieved papers

Curated dataset for target-aware video generation

10 retrieved papers

The authors construct a dataset of 1290 video clips from BEHAVE and Ego-Exo4D, where each clip shows an actor initially not interacting with a target and then engaging with it. Each video is annotated with target masks and text prompts describing the action.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[16] Mask2IV: Interaction-Centric Video Generation via Mask Trajectories PDF

Li Gen, Zhao Bo, Gen Li, Yang Jianfei, Bo Zhao, Sevilla-Lara, Laura, Jianfei Yang, Laura Sevilla-Lara (2025) • arXiv.org

[24] Boximator: Generating Rich and Controllable Motions for Video Synthesis PDF

Wang Jiawei, Zhang, Yuchen, Jiawei Wang, Zou Jiaxin, Yuchen Zhang, Zeng Yan, Jiaxin Zou, Wei Guoqiang, Yan Zeng, Yuan Li-ping, Guoqiang Wei, Li Hang, Liping Yuan, Hang Li (2024) • International Conference on Machine Learning

[27] VHOI: Controllable Video Generation of Human-Object Interactions from Sparse Trajectories via Motion Densification PDF

Wanyue Zhang, Lin Geng Foo, Thabo Beeler, Rishabh Dabral, Christian Theobalt (2025)

[50] MATRIX: Mask Track Alignment for Interaction-aware Video Generation PDF

Kim, Seongchan, Siyoon Jin, Seongchan Kim, Lee Jae-Ho, Dahyun Chung, Choi Hyun-Wook, Jaeho Lee, Nam, Jisu, Hyunwook Choi, Kim Jiyoung, Jisu Nam, Seungryong, Jiyoung Kim, Seungryong Kim (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Target-aware video diffusion model with mask-based target specification

[16] Mask2IV: Interaction-Centric Video Generation via Mask Trajectories PDF

Cannot Refute

[64] DanceTogether! Identity-Preserving Multi-Person Interactive Video Generation PDF

Cannot Refute

[65] GenCompositor: Generative Video Compositing with Diffusion Transformer PDF

Cannot Refute

[66] MotionPro: A Precise Motion Controller for Image-to-Video Generation* PDF

Cannot Refute

[67] MAGREF: Masked Guidance for Any-Reference Video Generation PDF

Cannot Refute

[68] Diffusion Mask-Driven Visual-language Tracking PDF

Cannot Refute

[69] TokenMotion: Decoupled Motion Control via Token Disentanglement for Human-centric Video Generation PDF

Cannot Refute

[70] Drag-A-Video: Non-rigid Video Editing with Point-based Interaction PDF

Cannot Refute

[71] Intermask: 3d human interaction generation via collaborative masked modeling PDF

Cannot Refute

[72] MGMAE: Motion Guided Masking for Video Masked Autoencoding PDF

Cannot Refute

Contribution

Cross-attention loss with special [TGT] token for spatial grounding

[59] Exploring Cross-Attention Maps in Multi-modal Diffusion Transformers for Training-Free Semantic Segmentation PDF

Cannot Refute

[60] Not all diffusion model activations have been evaluated as discriminative features PDF

Cannot Refute

[61] Text-Image Alignment in Diffusion Models: The Role of Attention Sink PDF

Cannot Refute

[62] Personalization of Vision-language Models and the Multi-Concept Challenge PDF

Cannot Refute

[63] Improving global awareness of linkset predictions using Cross-Attentive Modulation tokens PDF

Cannot Refute

Contribution

Curated dataset for target-aware video generation

[14] InterRVOS: Interaction-aware Referring Video Object Segmentation PDF

Cannot Refute

[25] NIFTY: Neural Object Interaction Fields for Guided Human Motion Synthesis PDF

Cannot Refute

[51] Reasoning About Physical Interactions with Object-Oriented Prediction and Planning PDF

Cannot Refute

[52] Object Motion Guided Human Motion Synthesis PDF

Cannot Refute

[53] InterTrack: Tracking Human Object Interaction Without Object Templates PDF

Cannot Refute

[54] Towards Human-like Virtual Beings: Simulating Human Behavior in 3D Scenes PDF

Cannot Refute

[55] CORE4D: A 4D Human-Object-Human Interaction Dataset for Collaborative Object REarrangement PDF

Cannot Refute

[56] G-HOP: Generative Hand-Object Prior for Interaction Reconstruction and Grasp Synthesis PDF

Cannot Refute

[57] Decoupled Generative Modeling for Human-Object Interaction Synthesis PDF

Cannot Refute

[58] Object interaction-based surveillance video synopsis PDF

Cannot Refute

Target-Aware Video Diffusion Models

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[16] Mask2IV: Interaction-Centric Video Generation via Mask Trajectories PDF

[24] Boximator: Generating Rich and Controllable Motions for Video Synthesis PDF

[27] VHOI: Controllable Video Generation of Human-Object Interactions from Sparse Trajectories via Motion Densification PDF

[50] MATRIX: Mask Track Alignment for Interaction-aware Video Generation PDF

Contribution Analysis

Target-aware video diffusion model with mask-based target specification

[16] Mask2IV: Interaction-Centric Video Generation via Mask Trajectories PDF

[64] DanceTogether! Identity-Preserving Multi-Person Interactive Video Generation PDF

[65] GenCompositor: Generative Video Compositing with Diffusion Transformer PDF

[66] MotionPro: A Precise Motion Controller for Image-to-Video Generation* PDF

[67] MAGREF: Masked Guidance for Any-Reference Video Generation PDF

[68] Diffusion Mask-Driven Visual-language Tracking PDF

[69] TokenMotion: Decoupled Motion Control via Token Disentanglement for Human-centric Video Generation PDF

[70] Drag-A-Video: Non-rigid Video Editing with Point-based Interaction PDF

[71] Intermask: 3d human interaction generation via collaborative masked modeling PDF

[72] MGMAE: Motion Guided Masking for Video Masked Autoencoding PDF

Cross-attention loss with special [TGT] token for spatial grounding

[59] Exploring Cross-Attention Maps in Multi-modal Diffusion Transformers for Training-Free Semantic Segmentation PDF

[60] Not all diffusion model activations have been evaluated as discriminative features PDF

[61] Text-Image Alignment in Diffusion Models: The Role of Attention Sink PDF

[62] Personalization of Vision-language Models and the Multi-Concept Challenge PDF

[63] Improving global awareness of linkset predictions using Cross-Attentive Modulation tokens PDF

Curated dataset for target-aware video generation

[14] InterRVOS: Interaction-aware Referring Video Object Segmentation PDF

[25] NIFTY: Neural Object Interaction Fields for Guided Human Motion Synthesis PDF

[51] Reasoning About Physical Interactions with Object-Oriented Prediction and Planning PDF

[52] Object Motion Guided Human Motion Synthesis PDF

[53] InterTrack: Tracking Human Object Interaction Without Object Templates PDF

[54] Towards Human-like Virtual Beings: Simulating Human Behavior in 3D Scenes PDF

[55] CORE4D: A 4D Human-Object-Human Interaction Dataset for Collaborative Object REarrangement PDF

[56] G-HOP: Generative Hand-Object Prior for Interaction Reconstruction and Grasp Synthesis PDF

[57] Decoupled Generative Modeling for Human-Object Interaction Synthesis PDF

[58] Object interaction-based surveillance video synopsis PDF

Table of Contents