Phantom-Data: Towards a General Subject-Consistent Video Generation Dataset

ICLR 2026 Conference SubmissionAnonymous Authors
Video GenerationMultimodal Generation
Abstract:

Subject-to-video generation has witnessed substantial progress in recent years. However, existing models still face significant challenges in faithfully following textual instructions. This limitation, commonly known as the copy-paste problem, arises from the widely used in-pair training paradigm. This approach inherently entangles subject identity with background and contextual attributes by sampling reference images from the same scene as the target video. To address this issue, we introduce \textbf{Phantom-Data, the first general-purpose cross-pair subject-to-video consistency dataset}, containing approximately one million identity-consistent pairs across diverse categories. Our dataset is constructed via a three-stage pipeline: (1) a general and input-aligned subject detection module, (2) large-scale cross-context subject retrieval from more than 53 million videos and 3 billion images, and (3) prior-guided identity verification to ensure visual consistency under contextual variation. Comprehensive experiments show that training with Phantom-Data significantly improves prompt alignment and visual quality while preserving identity consistency on par with in-pair baselines.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Phantom-Data, a large-scale cross-pair dataset designed to address the copy-paste problem in subject-to-video generation. It resides in the Cross-Modal Alignment Approaches leaf, which contains three papers total, including the original work. This leaf sits within Single-Subject Identity Preservation, a moderately populated branch with six sub-categories. The taxonomy reveals that cross-modal alignment is one of several competing strategies for identity preservation, alongside motion-explicit methods, training-free enhancement, and optimization-based approaches, suggesting a relatively active but not overcrowded research direction.

The taxonomy shows that neighboring leaves include Motion-Explicit Identity Methods (two papers) and Feature-Guided Consistency (two papers), both addressing identity preservation through different technical mechanisms. The broader Identity-Preserving Generation Methods branch encompasses pose-driven animation (five papers) and spatial-temporal decoupling (two papers), indicating that the field has diversified into multiple architectural paradigms. Phantom-Data's focus on dataset construction distinguishes it from sibling papers that primarily propose model architectures or alignment mechanisms, positioning it at the intersection of data-centric and model-centric approaches within the cross-modal alignment paradigm.

Among twenty candidates examined, both contributions show evidence of prior work. The dataset contribution examined ten candidates with one refutable match, while the three-stage pipeline examined ten candidates with one refutable match. The limited search scope (twenty total candidates from semantic search) means these statistics reflect a focused sample rather than exhaustive coverage. The dataset contribution appears to face more substantial prior work in the form of existing identity-consistent datasets, while the pipeline's novelty depends on the specific combination of detection, retrieval, and verification stages compared to existing data construction methods.

Based on the top-twenty semantic matches examined, the work appears to occupy a relatively sparse niche within cross-modal alignment approaches, specifically targeting dataset construction rather than model architecture. The taxonomy context suggests that while identity preservation is an active area, data-centric solutions remain less explored than model-centric ones. However, the limited search scope and presence of refutable candidates for both contributions indicate that the novelty assessment should be tempered by acknowledgment of existing work in large-scale identity-consistent data curation.

Taxonomy

Core-task Taxonomy Papers
50
2
Claimed Contributions
20
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: Subject-consistent video generation from reference images. The field has organized itself around several major branches that reflect different technical emphases and application contexts. Identity-Preserving Generation Methods focus on maintaining the appearance of specific subjects across frames, often through cross-modal alignment or tuning strategies that bind reference images to generative models. Multi-Subject and Multi-Scene Generation tackles the challenge of coordinating multiple entities or environments within a single video, while Specialized Application Domains address use cases such as fashion try-on, human animation, and domain-specific content creation. Video Editing and Manipulation methods enable post-hoc modifications to existing footage, and Controllable Generation with Structural Guidance incorporates spatial or temporal constraints like pose sequences or depth maps. Foundational Models and Frameworks provide the underlying architectures and training paradigms, and Evaluation and Benchmarking establishes metrics and datasets to assess quality and consistency. Representative works such as Animate Anyone[2] and Storydiffusion[3] illustrate how identity preservation and narrative coherence can be achieved through different architectural choices. Within the Identity-Preserving branch, a particularly active line of work explores cross-modal alignment approaches that leverage reference images to guide video synthesis without extensive fine-tuning. Phantom-Data[0] sits squarely in this cluster, emphasizing data-driven strategies for aligning subject features across modalities. It shares conceptual ground with Phantom[1], which also prioritizes identity consistency through alignment mechanisms, and with Subject-driven Video Generation via[8], which investigates how reference conditioning can be integrated into diffusion-based pipelines. These methods contrast with tuning-heavy approaches like VideoBooth[12] or CustomCrafter[41], which adapt model weights to specific subjects but require longer optimization. The central trade-off revolves around generalization versus personalization: alignment-based techniques aim for broader applicability with minimal per-subject overhead, while tuning methods offer finer control at the cost of computational expense. Phantom-Data[0] contributes to this ongoing conversation by exploring how large-scale data can improve cross-modal feature correspondence, positioning itself among works that seek scalable identity preservation without sacrificing temporal coherence.

Claimed Contributions

Phantom-Data dataset for subject-to-video generation

The authors present Phantom-Data, a novel dataset designed for subject-to-video generation that contains around one million identity-consistent pairs spanning multiple categories. This is claimed as the first general-purpose cross-pair dataset addressing subject consistency in video generation.

10 retrieved papers
Can Refute
Three-stage pipeline for dataset construction

The authors develop a three-stage construction pipeline that includes subject detection, large-scale cross-context retrieval from massive video and image collections, and identity verification to ensure visual consistency across different contexts.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Phantom-Data dataset for subject-to-video generation

The authors present Phantom-Data, a novel dataset designed for subject-to-video generation that contains around one million identity-consistent pairs spanning multiple categories. This is claimed as the first general-purpose cross-pair dataset addressing subject consistency in video generation.

Contribution

Three-stage pipeline for dataset construction

The authors develop a three-stage construction pipeline that includes subject detection, large-scale cross-context retrieval from massive video and image collections, and identity verification to ensure visual consistency across different contexts.