Phantom-Data: Towards a General Subject-Consistent Video Generation Dataset

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Video GenerationMultimodal Generation

Subject-to-video generation has witnessed substantial progress in recent years. However, existing models still face significant challenges in faithfully following textual instructions. This limitation, commonly known as the copy-paste problem, arises from the widely used in-pair training paradigm. This approach inherently entangles subject identity with background and contextual attributes by sampling reference images from the same scene as the target video. To address this issue, we introduce \textbf{Phantom-Data, the first general-purpose cross-pair subject-to-video consistency dataset}, containing approximately one million identity-consistent pairs across diverse categories. Our dataset is constructed via a three-stage pipeline: (1) a general and input-aligned subject detection module, (2) large-scale cross-context subject retrieval from more than 53 million videos and 3 billion images, and (3) prior-guided identity verification to ensure visual consistency under contextual variation. Comprehensive experiments show that training with Phantom-Data significantly improves prompt alignment and visual quality while preserving identity consistency on par with in-pair baselines.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Phantom-Data, a large-scale cross-pair dataset designed to address the copy-paste problem in subject-to-video generation. It resides in the Cross-Modal Alignment Approaches leaf, which contains three papers total, including the original work. This leaf sits within Single-Subject Identity Preservation, a moderately populated branch with six sub-categories. The taxonomy reveals that cross-modal alignment is one of several competing strategies for identity preservation, alongside motion-explicit methods, training-free enhancement, and optimization-based approaches, suggesting a relatively active but not overcrowded research direction.

The taxonomy shows that neighboring leaves include Motion-Explicit Identity Methods (two papers) and Feature-Guided Consistency (two papers), both addressing identity preservation through different technical mechanisms. The broader Identity-Preserving Generation Methods branch encompasses pose-driven animation (five papers) and spatial-temporal decoupling (two papers), indicating that the field has diversified into multiple architectural paradigms. Phantom-Data's focus on dataset construction distinguishes it from sibling papers that primarily propose model architectures or alignment mechanisms, positioning it at the intersection of data-centric and model-centric approaches within the cross-modal alignment paradigm.

Among twenty candidates examined, both contributions show evidence of prior work. The dataset contribution examined ten candidates with one refutable match, while the three-stage pipeline examined ten candidates with one refutable match. The limited search scope (twenty total candidates from semantic search) means these statistics reflect a focused sample rather than exhaustive coverage. The dataset contribution appears to face more substantial prior work in the form of existing identity-consistent datasets, while the pipeline's novelty depends on the specific combination of detection, retrieval, and verification stages compared to existing data construction methods.

Based on the top-twenty semantic matches examined, the work appears to occupy a relatively sparse niche within cross-modal alignment approaches, specifically targeting dataset construction rather than model architecture. The taxonomy context suggests that while identity preservation is an active area, data-centric solutions remain less explored than model-centric ones. However, the limited search scope and presence of refutable candidates for both contributions indicate that the novelty assessment should be tempered by acknowledgment of existing work in large-scale identity-consistent data curation.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Subject-consistent video generation from reference images. The field has organized itself around several major branches that reflect different technical emphases and application contexts. Identity-Preserving Generation Methods focus on maintaining the appearance of specific subjects across frames, often through cross-modal alignment or tuning strategies that bind reference images to generative models. Multi-Subject and Multi-Scene Generation tackles the challenge of coordinating multiple entities or environments within a single video, while Specialized Application Domains address use cases such as fashion try-on, human animation, and domain-specific content creation. Video Editing and Manipulation methods enable post-hoc modifications to existing footage, and Controllable Generation with Structural Guidance incorporates spatial or temporal constraints like pose sequences or depth maps. Foundational Models and Frameworks provide the underlying architectures and training paradigms, and Evaluation and Benchmarking establishes metrics and datasets to assess quality and consistency. Representative works such as Animate Anyone[2] and Storydiffusion[3] illustrate how identity preservation and narrative coherence can be achieved through different architectural choices. Within the Identity-Preserving branch, a particularly active line of work explores cross-modal alignment approaches that leverage reference images to guide video synthesis without extensive fine-tuning. Phantom-Data[0] sits squarely in this cluster, emphasizing data-driven strategies for aligning subject features across modalities. It shares conceptual ground with Phantom[1], which also prioritizes identity consistency through alignment mechanisms, and with Subject-driven Video Generation via[8], which investigates how reference conditioning can be integrated into diffusion-based pipelines. These methods contrast with tuning-heavy approaches like VideoBooth[12] or CustomCrafter[41], which adapt model weights to specific subjects but require longer optimization. The central trade-off revolves around generalization versus personalization: alignment-based techniques aim for broader applicability with minimal per-subject overhead, while tuning methods offer finer control at the cost of computational expense. Phantom-Data[0] contributes to this ongoing conversation by exploring how large-scale data can improve cross-modal feature correspondence, positioning itself among works that seek scalable identity preservation without sacrificing temporal coherence.

Claimed Contributions

Phantom-Data dataset for subject-to-video generation

Can Refute

10 retrieved papers

The authors present Phantom-Data, a novel dataset designed for subject-to-video generation that contains around one million identity-consistent pairs spanning multiple categories. This is claimed as the first general-purpose cross-pair dataset addressing subject consistency in video generation.

10 retrieved papers

Can Refute

Three-stage pipeline for dataset construction

Can Refute

10 retrieved papers

The authors develop a three-stage construction pipeline that includes subject detection, large-scale cross-context retrieval from massive video and image collections, and identity verification to ensure visual consistency across different contexts.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] Phantom: Subject-consistent video generation via cross-modal alignment PDF

Liu Lijie, Lijie Liu, Ma, Tianxiang, Tianxiang Ma, Li Bingchuan, Benjamin Li, Chen, Zhuowei, Zhuowei Chen, Bingchuan Li, Liu Jia-wei, Jiawei Liu, Li Gen, Qian He, Zhou Siyu, Xinglong Wu, He Qian, Wu Xinglong (2025)

[8] Subject-driven Video Generation via Disentangled Identity and Motion PDF

Zhang Jingxu, Jin, Wonjoon, Cho, Sunghyun, Dai, Qi, Park, Jaesik, Luo Chong (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Phantom-Data dataset for subject-to-video generation

[64] Multi-subject open-set personalization in video generation PDF

Can Refute

[9] Customvideo: Customizing text-to-video generation with multiple subjects PDF

Cannot Refute

[46] DanceTogether! Identity-Preserving Multi-Person Interactive Video Generation PDF

Cannot Refute

[48] Moca: Identity-preserving text-to-video generation via mixture of cross attention PDF

Cannot Refute

[61] ID-Animator: Zero-Shot Identity-Preserving Human Video Generation PDF

Cannot Refute

[62] Concat-ID: Towards Universal Identity-Preserving Video Synthesis PDF

Cannot Refute

[63] MotionCharacter: Identity-Preserving and Motion Controllable Human Video Generation PDF

Cannot Refute

[65] Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation PDF

Cannot Refute

[66] Diffusion self-distillation for zero-shot customized image generation PDF

Cannot Refute

[67] DreamActor-H1: High-Fidelity Human-Product Demonstration Video Generation via Motion-designed Diffusion Transformers PDF

Cannot Refute

Contribution

Three-stage pipeline for dataset construction

[52] Voxceleb: a large-scale speaker identification dataset PDF

Can Refute

[51] Deep learning for person re-identification: A survey and outlook PDF

Cannot Refute

[53] Video Object Segmentation with Re-identification PDF

Cannot Refute

[54] A fast and accurate system for face detection, identification, and verification PDF

Cannot Refute

[55] Masked Face Detection and Reconstruction for Identity Verification PDF

Cannot Refute

[56] Unsupervised object discovery and tracking in video collections PDF

Cannot Refute

[57] TVPR: Text-to-Video Person Retrieval and a New Benchmark PDF

Cannot Refute

[58] UPAR: Unified Pedestrian Attribute Recognition and Person Retrieval PDF

Cannot Refute

[59] Sequential action retrieval for generating narratives from long videos PDF

Cannot Refute

[60] Video Segmentation and Retrieval Based on Image-Driven Person Recognition for Surveillance PDF

Cannot Refute

Phantom-Data: Towards a General Subject-Consistent Video Generation Dataset

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] Phantom: Subject-consistent video generation via cross-modal alignment PDF

[8] Subject-driven Video Generation via Disentangled Identity and Motion PDF

Contribution Analysis

Phantom-Data dataset for subject-to-video generation

[64] Multi-subject open-set personalization in video generation PDF

[9] Customvideo: Customizing text-to-video generation with multiple subjects PDF

[46] DanceTogether! Identity-Preserving Multi-Person Interactive Video Generation PDF

[48] Moca: Identity-preserving text-to-video generation via mixture of cross attention PDF

[61] ID-Animator: Zero-Shot Identity-Preserving Human Video Generation PDF

[62] Concat-ID: Towards Universal Identity-Preserving Video Synthesis PDF

[63] MotionCharacter: Identity-Preserving and Motion Controllable Human Video Generation PDF

[65] Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation PDF

[66] Diffusion self-distillation for zero-shot customized image generation PDF

[67] DreamActor-H1: High-Fidelity Human-Product Demonstration Video Generation via Motion-designed Diffusion Transformers PDF

Three-stage pipeline for dataset construction

[52] Voxceleb: a large-scale speaker identification dataset PDF

[51] Deep learning for person re-identification: A survey and outlook PDF

[53] Video Object Segmentation with Re-identification PDF

[54] A fast and accurate system for face detection, identification, and verification PDF

[55] Masked Face Detection and Reconstruction for Identity Verification PDF

[56] Unsupervised object discovery and tracking in video collections PDF

[57] TVPR: Text-to-Video Person Retrieval and a New Benchmark PDF

[58] UPAR: Unified Pedestrian Attribute Recognition and Person Retrieval PDF

[59] Sequential action retrieval for generating narratives from long videos PDF

[60] Video Segmentation and Retrieval Based on Image-Driven Person Recognition for Surveillance PDF

Table of Contents