MAGREF: Masked Guidance for Any-Reference Video Generation with Subject Disentanglement

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Video generation; Diffusion models

We tackle the task of any-reference video generation, which aims to synthesize videos conditioned on arbitrary types and combinations of reference subjects, together with textual prompts. This task faces persistent challenges, including identity inconsistency, entanglement among multiple reference subjects, and copy-paste artifacts. To address these issues, we introduce MAGREF, a unified and effective framework for any-reference video generation. Our approach incorporates masked guidance and a subject disentanglement mechanism, enabling flexible synthesis conditioned on diverse reference images and textual prompts. Specifically, masked guidance employs a region-aware masking mechanism combined with pixel-wise channel concatenation to preserve appearance features of multiple subjects along the channel dimension. This design preserves identity consistency and maintains the capabilities of the pre-trained backbone, without requiring any architectural changes. To mitigate subject confusion, we introduce a subject disentanglement mechanism which injects the semantic values of each subject derived from the text condition into its corresponding visual region. Additionally, we establish a four-stage data pipeline to construct diverse training pairs, effectively alleviating copy-paste artifacts. Extensive experiments on a comprehensive benchmark demonstrate that MAGREF consistently outperforms existing state-of-the-art approaches, paving the way for scalable, controllable, and high-fidelity any-reference video synthesis. The code and video demos are available in the supplementary materials.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces MAGREF, a framework for any-reference video generation that synthesizes videos conditioned on multiple reference subjects and textual prompts. It resides in the 'Attention-Based Subject Disentanglement' leaf, which contains six papers addressing identity consistency and subject confusion through attention mechanisms. This leaf is one of three under 'Multi-Subject Identity Preservation and Disentanglement', indicating a moderately populated research direction focused on preventing feature entanglement when multiple subjects appear in generated videos.

The taxonomy reveals neighboring leaves employing alternative strategies: 'Embedding and Spatial Control for Multi-Subject Synthesis' (five papers) uses subject embeddings and LoRAs, while 'Hierarchical and Cross-Modal Identity Grounding' (five papers) leverages hierarchical structures to link subjects with references. MAGREF's attention-based approach contrasts with these embedding-centric methods, positioning it within a cluster that prioritizes dynamic feature separation over static identity encodings. The broader 'Subject-Consistent Video Synthesis from Reference Images' category (eight papers across three leaves) addresses related appearance preservation challenges but without the multi-subject disentanglement focus that defines MAGREF's core problem space.

Among thirty candidates examined, the framework-level contribution shows one refutable candidate from ten examined, suggesting some overlap with prior unified approaches to multi-subject video generation. The masked guidance mechanism (zero refutable from ten candidates) and the four-stage data pipeline (zero refutable from ten candidates) appear more distinctive within this limited search scope. The statistics indicate that while the overall framework concept has precedent, the specific technical mechanisms for subject disentanglement and training data construction may offer incremental novelty relative to the examined literature.

Based on top-thirty semantic matches, MAGREF appears to build on established attention-based disentanglement paradigms while introducing specific masking and channel concatenation strategies. The analysis covers a focused subset of the field; broader searches or domain-specific venues might reveal additional overlapping work in multi-subject video synthesis or reference-conditioned generation pipelines not captured here.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Any-reference video generation with multiple subject conditioning. The field addresses the challenge of synthesizing videos that faithfully preserve the identities and appearances of multiple subjects drawn from reference images, while maintaining temporal coherence and narrative control. The taxonomy reveals several complementary research directions: Multi-Subject Identity Preservation and Disentanglement focuses on cleanly separating and encoding distinct subject features to avoid identity leakage, often through attention-based mechanisms or specialized encoders. Subject-Consistent Video Synthesis from Reference Images emphasizes fidelity to input appearance cues across frames, leveraging works like Animate Anyone[4] and Videomage[5] that condition generation on reference imagery. Temporal and Motion Control branches explore how to guide dynamics and pose sequences, while Multi-Scene and Narrative Video Generation tackles longer-form storytelling with methods such as Dreamstory[13] and Cinema[15]. Training-Free and Tuning-Free Approaches seek efficiency by avoiding per-subject optimization, and Video Editing and Insertion address localized modifications. Multi-Stage and High-Fidelity Synthesis Pipelines combine refinement steps for quality, whereas View Synthesis and Free Viewpoint Video handle geometric transformations and novel camera angles. A particularly active line of work centers on attention-based subject disentanglement, where methods like Fastcomposer[2] and Disenstudio[7] use cross-attention or layered feature extraction to keep multiple identities distinct during generation. MAGREF[0] sits squarely within this cluster, employing masked attention mechanisms to prevent feature entanglement when conditioning on several reference subjects simultaneously. Compared to Multi-subject Open-set[1], which also tackles open-set multi-subject scenarios, MAGREF[0] emphasizes fine-grained attention control to maintain per-subject consistency. Meanwhile, Refdrop[8] explores dropout-based strategies for reference conditioning, offering a complementary perspective on how to balance subject fidelity with generative flexibility. The central trade-off across these branches remains how to scale identity preservation to many subjects without sacrificing temporal smoothness or narrative coherence, an open question that continues to drive innovation in training-free architectures and multi-stage pipelines.

Claimed Contributions

MAGREF framework for any-reference video generation

Can Refute

10 retrieved papers

The authors propose MAGREF, a framework that enables video synthesis conditioned on arbitrary types and combinations of reference subjects along with textual prompts. It addresses challenges of identity inconsistency, subject entanglement, and copy-paste artifacts in multi-subject video generation.

10 retrieved papers

Can Refute

Masked guidance with subject disentanglement mechanism

10 retrieved papers

The method uses region-aware masking with pixel-wise channel concatenation to preserve appearance features, and injects semantic values from text conditions into corresponding visual regions to prevent subject confusion across multiple references.

10 retrieved papers

Four-stage data pipeline for training pairs

10 retrieved papers

The authors develop a data construction pipeline that creates diverse training pairs to reduce copy-paste artifacts in the generated videos.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] Multi-subject Open-set Personalization in Video Generation PDF

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Yu-Wei Fang, Kwot Sin Lee, Yuwei Fang, Ivan Skorokhodov, Kfir Aberman, Jun-Yan Zhu, Ming-Hsuan Yang, Sergey Tulyakov (2025)

[2] Fastcomposer: Tuning-free multi-subject image generation with localized attention PDF

Guangxuan Xiao, Tianwei Yin, William T. Freeman, FrÃ©do Durand, W. Freeman, Song Han, F. Durand (2025)

[8] Refdrop: Controllable consistency in image or video generation via reference feature guidance PDF

Yongxin Chen, Jiaojiao Fan, Haotian Xue, Qin-sheng Zhang (2024)

[20] VideoAlchemy: Open-set Personalization in Video Generation PDF

TS Chen, A Siarohin, W Menapace, Y Fang (2024)

[22] MAGREF: Masked Guidance for Any-Reference Video Generation PDF

Deng Yu-fan, Yufan Deng, Guo Xun, Xun Guo, Wang Yizhi, Yuanyang Yin, Fang, Jacob Zhiyuan, Jacob Zhiyuan Fang, Yuan, Shenghai, Yiding Yang, Yang Yiding, Yizhi Wang, Wang, Angtian, Shenghai Yuan, Liu Bo, Angtian Wang, Huang Hai-bin, Bo Liu, Ma Chongyang, Haibin Huang, Chongyang Ma (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MAGREF framework for any-reference video generation

[3] Phantom: Subject-consistent video generation via cross-modal alignment PDF

Can Refute

[6] VideoDreamer: Customized Multi-Subject Text-to-Video Generation with Disen-Mix Finetuning on Language-Video Foundation Models PDF

Cannot Refute

[11] Customvideo: Customizing text-to-video generation with multiple subjects PDF

Cannot Refute

[15] Cinema: Coherent multi-subject video generation via mllm-based guidance PDF

Cannot Refute

[22] MAGREF: Masked Guidance for Any-Reference Video Generation PDF

Cannot Refute

[25] Bindweave: Subject-consistent video generation via cross-modal integration PDF

Cannot Refute

[54] Text-Guided Synthesis of Crowd Animation PDF

Cannot Refute

[55] Seedance 1.0: Exploring the Boundaries of Video Generation Models PDF

Cannot Refute

[56] Identity-Preserving Text-to-Video Generation via Training-Free Prompt, Image, and Guidance Enhancement PDF

Cannot Refute

[57] MEVG: Multi-event Video Generation with Text-to-Video Models PDF

Cannot Refute

Contribution

Masked guidance with subject disentanglement mechanism

[7] Disenstudio: Customized multi-subject text-to-video generation with disentangled spatial control PDF

Cannot Refute

[58] Disco: Disentangled control for realistic human dance generation PDF

Cannot Refute

[59] Structure and Content-Guided Video Synthesis with Diffusion Models PDF

Cannot Refute

[60] Moa: Mixture-of-attention for subject-context disentanglement in personalized image generation PDF

Cannot Refute

[61] Training-Free Color-Style Disentanglement for Constrained Text-to-Image Synthesis PDF

Cannot Refute

[62] Non-generative Generalized Zero-shot Learning via Task-correlated Disentanglement and Controllable Samples Synthesis PDF

Cannot Refute

[63] FreeFuse: Multi-Subject LoRA Fusion via Auto Masking at Test Time PDF

Cannot Refute

[64] TripleFDS: Triple Feature Disentanglement and Synthesis for Scene Text Editing PDF

Cannot Refute

[65] FreerCustom: Training-Free Multi-Concept Customization for Image and Video Generation: C. Zhao et al. PDF

Cannot Refute

[66] Disentangled Motion Latent Flow Matching for Fine-Grained Controllable Talking Portrait Synthesis PDF

Cannot Refute

Contribution

Four-stage data pipeline for training pairs

The authors develop a data construction pipeline that creates diverse training pairs to reduce copy-paste artifacts in the generated videos.

[22] MAGREF: Masked Guidance for Any-Reference Video Generation PDF

Cannot Refute

[45] Cut-and-splat: Leveraging gaussian splatting for synthetic data generation PDF

Cannot Refute

[46] Latesteval: Addressing data contamination in language model evaluation through dynamic and time-sensitive test construction PDF

Cannot Refute

[47] Equipping computational pathology systems with artifact processing pipelines: a showcase for computation and performance trade-offs PDF

Cannot Refute

[48] Enhancing Colony Detection of Microorganisms in Agar Dishes Using SAM-Based Synthetic Data Augmentation in Low-Data Scenarios. PDF

Cannot Refute

[49] Phantom-Data: Towards a General Subject-Consistent Video Generation Dataset PDF

Cannot Refute

[50] Neural Pipeline for Zero-Shot Data-to-Text Generation PDF

Cannot Refute

[51] NEAR: An artifact removal pipeline for human newborn EEG data PDF

Cannot Refute

[52] Generic Mechanism for Reducing Repetitions in Encoder-decoder Models PDF

Cannot Refute

[53] Optimizing Data Quality in Real-Time: A Self-Healing Pipeline Approach PDF

Cannot Refute

MAGREF: Masked Guidance for Any-Reference Video Generation with Subject Disentanglement

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] Multi-subject Open-set Personalization in Video Generation PDF

[2] Fastcomposer: Tuning-free multi-subject image generation with localized attention PDF

[8] Refdrop: Controllable consistency in image or video generation via reference feature guidance PDF

[20] VideoAlchemy: Open-set Personalization in Video Generation PDF

[22] MAGREF: Masked Guidance for Any-Reference Video Generation PDF

Contribution Analysis

MAGREF framework for any-reference video generation

[3] Phantom: Subject-consistent video generation via cross-modal alignment PDF

[6] VideoDreamer: Customized Multi-Subject Text-to-Video Generation with Disen-Mix Finetuning on Language-Video Foundation Models PDF

[11] Customvideo: Customizing text-to-video generation with multiple subjects PDF

[15] Cinema: Coherent multi-subject video generation via mllm-based guidance PDF

[22] MAGREF: Masked Guidance for Any-Reference Video Generation PDF

[25] Bindweave: Subject-consistent video generation via cross-modal integration PDF

[54] Text-Guided Synthesis of Crowd Animation PDF

[55] Seedance 1.0: Exploring the Boundaries of Video Generation Models PDF

[56] Identity-Preserving Text-to-Video Generation via Training-Free Prompt, Image, and Guidance Enhancement PDF

[57] MEVG: Multi-event Video Generation with Text-to-Video Models PDF

Masked guidance with subject disentanglement mechanism

[7] Disenstudio: Customized multi-subject text-to-video generation with disentangled spatial control PDF

[58] Disco: Disentangled control for realistic human dance generation PDF

[59] Structure and Content-Guided Video Synthesis with Diffusion Models PDF

[60] Moa: Mixture-of-attention for subject-context disentanglement in personalized image generation PDF

[61] Training-Free Color-Style Disentanglement for Constrained Text-to-Image Synthesis PDF

[62] Non-generative Generalized Zero-shot Learning via Task-correlated Disentanglement and Controllable Samples Synthesis PDF

[63] FreeFuse: Multi-Subject LoRA Fusion via Auto Masking at Test Time PDF

[64] TripleFDS: Triple Feature Disentanglement and Synthesis for Scene Text Editing PDF

[65] FreerCustom: Training-Free Multi-Concept Customization for Image and Video Generation: C. Zhao et al. PDF

[66] Disentangled Motion Latent Flow Matching for Fine-Grained Controllable Talking Portrait Synthesis PDF

Four-stage data pipeline for training pairs

[22] MAGREF: Masked Guidance for Any-Reference Video Generation PDF

[45] Cut-and-splat: Leveraging gaussian splatting for synthetic data generation PDF

[46] Latesteval: Addressing data contamination in language model evaluation through dynamic and time-sensitive test construction PDF

[47] Equipping computational pathology systems with artifact processing pipelines: a showcase for computation and performance trade-offs PDF

[48] Enhancing Colony Detection of Microorganisms in Agar Dishes Using SAM-Based Synthetic Data Augmentation in Low-Data Scenarios. PDF

[49] Phantom-Data: Towards a General Subject-Consistent Video Generation Dataset PDF

[50] Neural Pipeline for Zero-Shot Data-to-Text Generation PDF

[51] NEAR: An artifact removal pipeline for human newborn EEG data PDF

[52] Generic Mechanism for Reducing Repetitions in Encoder-decoder Models PDF

[53] Optimizing Data Quality in Real-Time: A Self-Healing Pipeline Approach PDF

Table of Contents