Video-As-Prompt: Unified Semantic Control for Video Generation

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.6 Download Report PDF

Video GenerationControllable Video GenerationVideo Dataset

Unified, generalizable semantic control in video generation remains a critical open challenge. Existing methods either introduce artifacts by enforcing inappropriate pixel-wise priors from structure-based controls, or rely on non-generalizable, condition-specific finetuning or task-specific architectures. We introduce Video-As-Prompt (VAP), a new paradigm that reframes this problem as in-context generation. VAP leverages a reference video as a direct semantic prompt, guiding a frozen Video Diffusion Transformer (DiT) via a plug-and-play Mixture-of-Transformers (MoT) expert. This architecture prevents catastrophic forgetting and is guided by a temporally biased position embedding that eliminates spurious mapping priors for robust context retrieval. To power this approach and catalyze future research, we built VAP-Data, the largest dataset for this task with over 100K paired videos across 100 semantic conditions. As a single unified model, VAP sets a new state-of-the-art for open-source methods, achieving a 38.7% user preference rate that rivals leading condition-specific commercial models. VAP's strong zero-shot generalization and support for various applications mark a significant advance toward general-purpose, controllable video generation.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Video-As-Prompt (VAP), a unified semantic control paradigm using reference videos as in-context prompts to guide video generation through a plug-and-play Mixture-of-Transformers architecture. Within the taxonomy, VAP occupies the 'Reference-Based and In-Context Generation' leaf, which currently contains only this paper as a sibling. This positioning reflects a relatively sparse research direction compared to densely populated branches like Motion and Trajectory Control or Multimodal and Multi-Condition Control, suggesting the reference-based paradigm represents an emerging rather than crowded area of investigation.

The taxonomy reveals VAP's relationship to neighboring approaches: it diverges from explicit multi-condition frameworks (Unified Multi-Condition Frameworks) that combine text, audio, and layout signals, and from training-free methods that manipulate attention without architectural changes. The reference-based paradigm sits between caption-based semantic generation, which relies on textual descriptions, and motion customization techniques that adapt motion representations. VAP's in-context learning approach offers an alternative to heavily parameterized multi-condition systems, emphasizing demonstration-driven guidance over explicit control signal specification, thereby occupying a distinct methodological niche within the broader semantic control landscape.

Among eighteen candidates examined, the core VAP paradigm (Contribution A) shows one potentially refutable prior work from eight candidates reviewed, while the plug-and-play architecture (Contribution B) had no candidates examined, and the VAP-Data dataset (Contribution C) found no refutations across ten candidates. The limited search scope—eighteen papers rather than hundreds—means these statistics reflect top-K semantic matches and citation expansion, not exhaustive coverage. The dataset contribution appears more novel within this sample, while the unified paradigm faces at least one overlapping prior work, though the architectural instantiation remains less explored in the examined literature.

Given the constrained search scope and sparse taxonomy leaf, VAP appears to explore a less-traveled methodological path within semantic video control. The analysis captures top semantic neighbors but cannot rule out relevant work outside this sample. The reference-based framing and in-context generation angle differentiate VAP from dominant multi-condition and motion-centric approaches, though the single refutable candidate for the core paradigm warrants careful examination of how VAP's formulation advances beyond that prior work.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: unified semantic control for video generation. The field has evolved into a rich taxonomy spanning diverse control modalities and architectural strategies. Major branches include Motion and Trajectory Control, which governs camera and object movement (e.g., MotionCtrl[1], FlowDirector[12]); Multimodal and Multi-Condition Control, integrating text, audio, and other signals (Uniadapter[2], UniAVGen[38]); Layout and Spatial Control for scene composition (MagicDrive[30]); and Pose and Gesture Control for human-centric generation (PoseCrafter[18], Language Gesture Control[6]). Training-Free and Plug-and-Play Control methods offer flexibility without retraining (Training Free Guidance[27]), while Cascaded and Hierarchical Architectures address temporal coherence and multi-scale synthesis. Domain-Specific branches target applications like autonomous driving or sign language, and Survey and Benchmark Studies (Controllable Video Survey[35], Interactive Generative Survey[36]) provide overarching perspectives on progress and open challenges. Recent work highlights tensions between flexibility and precision: some methods pursue unified frameworks handling multiple conditions simultaneously (Unictrl[33], Univideo[21]), while others specialize in fine-grained control of specific attributes (DiTCtrl[10], COTA Motion[14]). Reference-Based and In-Context Generation has emerged as a distinct paradigm, where example videos guide synthesis rather than explicit control signals. Video As Prompt[0] exemplifies this direction, leveraging video exemplars to steer generation in a more intuitive, demonstration-driven manner. This approach contrasts with heavily parameterized multi-condition systems like Uni3c[4] or sound-guided methods (Sound Guided Semantic[5]), offering a complementary path that emphasizes learning from visual context. The interplay between explicit control mechanisms and implicit reference-based guidance remains an active area, with ongoing exploration of how to balance user intent specification and generative flexibility.

Claimed Contributions

Video-As-Prompt (VAP) unified semantic-controlled video generation paradigm

Can Refute

8 retrieved papers

VAP introduces a new paradigm that reframes semantic-controlled video generation as in-context generation by using reference videos with desired semantics as direct prompts. This approach unifies diverse semantic controls (concept, style, motion, camera) in a single model without requiring per-condition finetuning or task-specific architectures.

8 retrieved papers

Can Refute

Plug-and-play in-context video generation framework with mixture-of-transformers architecture

0 retrieved papers

The framework augments frozen Video Diffusion Transformers with a trainable parallel expert using a Mixture-of-Transformers design. It incorporates temporally biased position embedding to eliminate spurious pixel-mapping priors and enable robust context retrieval while preventing catastrophic forgetting.

0 retrieved papers

VAP-Data dataset for semantic-controlled video generation

10 retrieved papers

VAP-Data is the largest dataset specifically designed for semantic-controlled video generation, containing over 100,000 curated paired video samples spanning 100 semantic conditions across four categories (concept, style, motion, camera). This dataset provides a robust foundation for training unified semantic-controlled video generation models.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Video-As-Prompt (VAP) unified semantic-controlled video generation paradigm

[59] VFXMaster: Unlocking Dynamic Visual Effect Generation via In-Context Learning PDF

Can Refute

[49] Text2Story: Advancing Video Storytelling with Text Guidance PDF

Cannot Refute

[57] Video in-context learning: Autoregressive transformers are zero-shot video imitators PDF

Cannot Refute

[58] Personalised video generation: Temporal diffusion synthesis with generative large language model PDF

Cannot Refute

[60] Slot-ID: Identity-Preserving Video Generation from Reference Videos via Slot-Based Temporal Identity Encoding PDF

Cannot Refute

[61] MAGREF: Masked Guidance for Any-Reference Video Generation with Subject Disentanglement PDF

Cannot Refute

[62] AICL: Action In-Context Learning for Text-to-Video Generation PDF

Cannot Refute

[63] Rewriting Video: Text as Interface for Video Repurposing PDF

Cannot Refute

Contribution

Plug-and-play in-context video generation framework with mixture-of-transformers architecture

Contribution

VAP-Data dataset for semantic-controlled video generation

[5] Sound-guided semantic video generation PDF

Cannot Refute

[7] I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models PDF

Cannot Refute

[24] Cascaded Dynamic Memory Refinement and Semantic Alignment for Exo-to-Ego Cross-view Video Generation PDF

Cannot Refute

[28] Sounding Video Generator: A Unified Framework for Text-Guided Sounding Video Generation PDF

Cannot Refute

[51] Openvid-1m: A large-scale high-quality dataset for text-to-video generation PDF

Cannot Refute

[52] HunyuanVideo: A Systematic Framework For Large Video Generative Models PDF

Cannot Refute

[53] InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation PDF

Cannot Refute

[54] CelebV-Text: A Large-Scale Facial Text-Video Dataset PDF

Cannot Refute

[55] HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning PDF

Cannot Refute

[56] Video generation from single semantic label map PDF

Cannot Refute

Video-As-Prompt: Unified Semantic Control for Video Generation

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Video-As-Prompt (VAP) unified semantic-controlled video generation paradigm

[59] VFXMaster: Unlocking Dynamic Visual Effect Generation via In-Context Learning PDF

[49] Text2Story: Advancing Video Storytelling with Text Guidance PDF

[57] Video in-context learning: Autoregressive transformers are zero-shot video imitators PDF

[58] Personalised video generation: Temporal diffusion synthesis with generative large language model PDF

[60] Slot-ID: Identity-Preserving Video Generation from Reference Videos via Slot-Based Temporal Identity Encoding PDF

[61] MAGREF: Masked Guidance for Any-Reference Video Generation with Subject Disentanglement PDF

[62] AICL: Action In-Context Learning for Text-to-Video Generation PDF

[63] Rewriting Video: Text as Interface for Video Repurposing PDF

Plug-and-play in-context video generation framework with mixture-of-transformers architecture

VAP-Data dataset for semantic-controlled video generation

[5] Sound-guided semantic video generation PDF

[7] I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models PDF

[24] Cascaded Dynamic Memory Refinement and Semantic Alignment for Exo-to-Ego Cross-view Video Generation PDF

[28] Sounding Video Generator: A Unified Framework for Text-Guided Sounding Video Generation PDF

[51] Openvid-1m: A large-scale high-quality dataset for text-to-video generation PDF

[52] HunyuanVideo: A Systematic Framework For Large Video Generative Models PDF

[53] InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation PDF

[54] CelebV-Text: A Large-Scale Facial Text-Video Dataset PDF

[55] HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning PDF

[56] Video generation from single semantic label map PDF

Table of Contents