Target-Aware Video Diffusion Models

ICLR 2026 Conference SubmissionAnonymous Authors
Controllable video diffusion modelsHuman-scene interactionRobotics planning
Abstract:

We present a target-aware video diffusion model that generates videos from an input image, in which an actor interacts with a specified target while performing a desired action. The target is defined by a segmentation mask, and the action is described through a text prompt. Our key motivation is to incorporate target awareness into video generation, enabling actors to perform directed actions on designated objects. This enables video diffusion models to act as motion planners, producing plausible predictions of human–object interactions by leveraging the priors of large-scale video generative models. We build our target-aware model by extending a baseline model to incorporate the target mask as an additional input. To enforce target awareness, we introduce a special token that encodes the target's spatial information within the text prompt. We then fine-tune the model with our curated dataset using an additional cross-attention loss that aligns the cross-attention maps associated with this token with the input target mask. To further improve performance, we selectively apply this loss to the most semantically relevant attention regions and transformer blocks. Experimental results show that our target-aware model outperforms existing solutions in generating videos where actors interact accurately with the specified targets. We further demonstrate its efficacy in two downstream applications: zero-shot 3D HOI motion synthesis with physical plausibility and long-term video content creation.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces a target-aware video diffusion model that generates videos where actors interact with specific objects defined by segmentation masks and text prompts. Within the taxonomy, it resides in the 'Spatial Control via Masks and Trajectories' leaf, which contains five papers total including this work. This leaf sits under 'Controllable Video Generation with Interaction Modeling', indicating a moderately populated research direction focused on explicit spatial guidance mechanisms. The sibling papers (Boximator, Mask2IV, VHOI, MATRIX) suggest an active but not overcrowded subfield exploring mask-based and trajectory-based control for interaction synthesis.

The taxonomy reveals neighboring research directions that contextualize this work's positioning. Adjacent leaves include 'Pose-Guided Human-Object Interaction Synthesis' (3 papers) emphasizing skeletal guidance, 'Character Animation with Scene Interaction' (2 papers) focusing on character-specific modeling, and 'Multi-Concept and Multi-Identity Interaction Generation' (2 papers) handling multiple subjects. The parent branch 'Controllable Video Generation with Interaction Modeling' excludes methods focused solely on 3D reconstruction or text-only control, clarifying that this work's mask-based spatial guidance distinguishes it from purely language-driven approaches in the sibling 'Text-Driven Human-Object Interaction Generation' branch.

Among 25 candidates examined across three contributions, no clearly refutable prior work was identified. The target-aware diffusion model with mask-based specification examined 10 candidates with zero refutations, the cross-attention loss with special token examined 5 candidates with zero refutations, and the curated dataset examined 10 candidates with zero refutations. This suggests that within the limited search scope, the specific combination of mask-based target specification, cross-attention alignment via special tokens, and selective loss application appears relatively unexplored. The sibling papers in the same taxonomy leaf employ related spatial control mechanisms but may differ in their attention-based grounding strategies or dataset curation approaches.

Based on the top-25 semantic matches examined, the work appears to occupy a distinct position within spatial control methods for interaction generation. The taxonomy structure indicates this is an active but not saturated research area, with the paper's specific technical approach—combining mask inputs, special token encoding, and selective cross-attention loss—not directly overlapped by the examined candidates. However, the limited search scope means potentially relevant work outside the top-25 matches or in adjacent subfields may exist but was not captured in this analysis.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
25
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: target-aware video generation with human-object interactions. This field addresses the challenge of synthesizing realistic videos in which humans interact with specific objects in controllable, semantically meaningful ways. The taxonomy reveals several complementary research directions. Controllable Video Generation with Interaction Modeling focuses on spatial and temporal control mechanisms—such as masks, trajectories, and bounding boxes—that guide where and how interactions unfold (e.g., Boximator[24], Mask2IV[16]). Text-Driven Human-Object Interaction Generation emphasizes language-based conditioning to specify interaction semantics, while 3D Human-Object Interaction Reconstruction and Tracking tackles the geometric and pose estimation aspects needed for physically plausible contact. Motion Synthesis and Diffusion-Based HOI Generation explores generative models that produce diverse, natural human motions conditioned on object affordances. Meanwhile, Datasets, Benchmarks, and Evaluation provide the empirical infrastructure, Specialized Applications target downstream tasks like robotics or virtual anchors, and Foundational Techniques supply core representations and architectures. Recent work has intensified around fine-grained spatial control and physically grounded interaction modeling. A dense cluster of methods leverages mask-based or trajectory-based guidance to steer diffusion models, balancing flexibility with precise object targeting (Target-Aware Video Diffusion[0], VHOI[27], MATRIX[50]). These approaches often grapple with trade-offs between open-ended creativity and strict adherence to user-specified constraints. Target-Aware Video Diffusion[0] sits squarely within the Spatial Control via Masks and Trajectories branch, emphasizing how explicit spatial cues can anchor object interactions during generation. Compared to neighbors like Boximator[24], which uses bounding-box annotations for layout control, or Mask2IV[16], which conditions on segmentation masks, the original paper appears to integrate target-specific priors more tightly into the diffusion process. This focus on target awareness distinguishes it from more general controllable generation schemes, positioning it as a specialized solution for scenarios demanding precise human-object coordination.

Claimed Contributions

Target-aware video diffusion model with mask-based target specification

The authors introduce a video diffusion model that accepts a segmentation mask to specify a target object and generates videos showing an actor performing text-prompted actions directed at that target. This enables explicit control over actor-target interactions without requiring dense motion annotations.

10 retrieved papers
Cross-attention loss with special [TGT] token for spatial grounding

The authors propose a training method that introduces a special [TGT] token in text prompts and applies a cross-attention loss to align the token's attention maps with the input target mask. This loss is selectively applied to semantically relevant attention regions and transformer blocks to enforce target awareness.

5 retrieved papers
Curated dataset for target-aware video generation

The authors construct a dataset of 1290 video clips from BEHAVE and Ego-Exo4D, where each clip shows an actor initially not interacting with a target and then engaging with it. Each video is annotated with target masks and text prompts describing the action.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Target-aware video diffusion model with mask-based target specification

The authors introduce a video diffusion model that accepts a segmentation mask to specify a target object and generates videos showing an actor performing text-prompted actions directed at that target. This enables explicit control over actor-target interactions without requiring dense motion annotations.

Contribution

Cross-attention loss with special [TGT] token for spatial grounding

The authors propose a training method that introduces a special [TGT] token in text prompts and applies a cross-attention loss to align the token's attention maps with the input target mask. This loss is selectively applied to semantically relevant attention regions and transformer blocks to enforce target awareness.

Contribution

Curated dataset for target-aware video generation

The authors construct a dataset of 1290 video clips from BEHAVE and Ego-Exo4D, where each clip shows an actor initially not interacting with a target and then engaging with it. Each video is annotated with target masks and text prompts describing the action.