Video-As-Prompt: Unified Semantic Control for Video Generation

ICLR 2026 Conference SubmissionAnonymous Authors
Video GenerationControllable Video GenerationVideo Dataset
Abstract:

Unified, generalizable semantic control in video generation remains a critical open challenge. Existing methods either introduce artifacts by enforcing inappropriate pixel-wise priors from structure-based controls, or rely on non-generalizable, condition-specific finetuning or task-specific architectures. We introduce Video-As-Prompt (VAP), a new paradigm that reframes this problem as in-context generation. VAP leverages a reference video as a direct semantic prompt, guiding a frozen Video Diffusion Transformer (DiT) via a plug-and-play Mixture-of-Transformers (MoT) expert. This architecture prevents catastrophic forgetting and is guided by a temporally biased position embedding that eliminates spurious mapping priors for robust context retrieval. To power this approach and catalyze future research, we built VAP-Data, the largest dataset for this task with over 100K paired videos across 100 semantic conditions. As a single unified model, VAP sets a new state-of-the-art for open-source methods, achieving a 38.7% user preference rate that rivals leading condition-specific commercial models. VAP's strong zero-shot generalization and support for various applications mark a significant advance toward general-purpose, controllable video generation.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Video-As-Prompt (VAP), a unified semantic control paradigm using reference videos as in-context prompts to guide video generation through a plug-and-play Mixture-of-Transformers architecture. Within the taxonomy, VAP occupies the 'Reference-Based and In-Context Generation' leaf, which currently contains only this paper as a sibling. This positioning reflects a relatively sparse research direction compared to densely populated branches like Motion and Trajectory Control or Multimodal and Multi-Condition Control, suggesting the reference-based paradigm represents an emerging rather than crowded area of investigation.

The taxonomy reveals VAP's relationship to neighboring approaches: it diverges from explicit multi-condition frameworks (Unified Multi-Condition Frameworks) that combine text, audio, and layout signals, and from training-free methods that manipulate attention without architectural changes. The reference-based paradigm sits between caption-based semantic generation, which relies on textual descriptions, and motion customization techniques that adapt motion representations. VAP's in-context learning approach offers an alternative to heavily parameterized multi-condition systems, emphasizing demonstration-driven guidance over explicit control signal specification, thereby occupying a distinct methodological niche within the broader semantic control landscape.

Among eighteen candidates examined, the core VAP paradigm (Contribution A) shows one potentially refutable prior work from eight candidates reviewed, while the plug-and-play architecture (Contribution B) had no candidates examined, and the VAP-Data dataset (Contribution C) found no refutations across ten candidates. The limited search scope—eighteen papers rather than hundreds—means these statistics reflect top-K semantic matches and citation expansion, not exhaustive coverage. The dataset contribution appears more novel within this sample, while the unified paradigm faces at least one overlapping prior work, though the architectural instantiation remains less explored in the examined literature.

Given the constrained search scope and sparse taxonomy leaf, VAP appears to explore a less-traveled methodological path within semantic video control. The analysis captures top semantic neighbors but cannot rule out relevant work outside this sample. The reference-based framing and in-context generation angle differentiate VAP from dominant multi-condition and motion-centric approaches, though the single refutable candidate for the core paradigm warrants careful examination of how VAP's formulation advances beyond that prior work.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
18
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: unified semantic control for video generation. The field has evolved into a rich taxonomy spanning diverse control modalities and architectural strategies. Major branches include Motion and Trajectory Control, which governs camera and object movement (e.g., MotionCtrl[1], FlowDirector[12]); Multimodal and Multi-Condition Control, integrating text, audio, and other signals (Uniadapter[2], UniAVGen[38]); Layout and Spatial Control for scene composition (MagicDrive[30]); and Pose and Gesture Control for human-centric generation (PoseCrafter[18], Language Gesture Control[6]). Training-Free and Plug-and-Play Control methods offer flexibility without retraining (Training Free Guidance[27]), while Cascaded and Hierarchical Architectures address temporal coherence and multi-scale synthesis. Domain-Specific branches target applications like autonomous driving or sign language, and Survey and Benchmark Studies (Controllable Video Survey[35], Interactive Generative Survey[36]) provide overarching perspectives on progress and open challenges. Recent work highlights tensions between flexibility and precision: some methods pursue unified frameworks handling multiple conditions simultaneously (Unictrl[33], Univideo[21]), while others specialize in fine-grained control of specific attributes (DiTCtrl[10], COTA Motion[14]). Reference-Based and In-Context Generation has emerged as a distinct paradigm, where example videos guide synthesis rather than explicit control signals. Video As Prompt[0] exemplifies this direction, leveraging video exemplars to steer generation in a more intuitive, demonstration-driven manner. This approach contrasts with heavily parameterized multi-condition systems like Uni3c[4] or sound-guided methods (Sound Guided Semantic[5]), offering a complementary path that emphasizes learning from visual context. The interplay between explicit control mechanisms and implicit reference-based guidance remains an active area, with ongoing exploration of how to balance user intent specification and generative flexibility.

Claimed Contributions

Video-As-Prompt (VAP) unified semantic-controlled video generation paradigm

VAP introduces a new paradigm that reframes semantic-controlled video generation as in-context generation by using reference videos with desired semantics as direct prompts. This approach unifies diverse semantic controls (concept, style, motion, camera) in a single model without requiring per-condition finetuning or task-specific architectures.

8 retrieved papers
Can Refute
Plug-and-play in-context video generation framework with mixture-of-transformers architecture

The framework augments frozen Video Diffusion Transformers with a trainable parallel expert using a Mixture-of-Transformers design. It incorporates temporally biased position embedding to eliminate spurious pixel-mapping priors and enable robust context retrieval while preventing catastrophic forgetting.

0 retrieved papers
VAP-Data dataset for semantic-controlled video generation

VAP-Data is the largest dataset specifically designed for semantic-controlled video generation, containing over 100,000 curated paired video samples spanning 100 semantic conditions across four categories (concept, style, motion, camera). This dataset provides a robust foundation for training unified semantic-controlled video generation models.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Video-As-Prompt (VAP) unified semantic-controlled video generation paradigm

VAP introduces a new paradigm that reframes semantic-controlled video generation as in-context generation by using reference videos with desired semantics as direct prompts. This approach unifies diverse semantic controls (concept, style, motion, camera) in a single model without requiring per-condition finetuning or task-specific architectures.

Contribution

Plug-and-play in-context video generation framework with mixture-of-transformers architecture

The framework augments frozen Video Diffusion Transformers with a trainable parallel expert using a Mixture-of-Transformers design. It incorporates temporally biased position embedding to eliminate spurious pixel-mapping priors and enable robust context retrieval while preventing catastrophic forgetting.

Contribution

VAP-Data dataset for semantic-controlled video generation

VAP-Data is the largest dataset specifically designed for semantic-controlled video generation, containing over 100,000 curated paired video samples spanning 100 semantic conditions across four categories (concept, style, motion, camera). This dataset provides a robust foundation for training unified semantic-controlled video generation models.