LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Video GenerationVideo CustomizationDiffusion ModelsMulti-Subject GenerationFace-Attribute Alignment

Recent advances in diffusion models have significantly improved text-to-video generation, enabling personalized content creation with fine-grained control over both foreground and background elements. However, precise face–attribute alignment across subjects remains challenging, as existing methods lack explicit mechanisms to ensure intra-group consistency. Addressing this gap requires both explicit modeling strategies and face-attribute-aware data resources. We therefore propose $\textbf{\textit{Lumos{X}}}$ , a framework that advances both data and model design. On the data side, a tailored collection pipeline orchestrates captions and visual cues from independent videos, while multimodal large language models (MLLMs) infer and assign subject-specific dependencies. These extracted relational priors impose a finer-grained structure that amplifies the expressive control of personalized video generation and enables the construction of a comprehensive benchmark. On the modeling side, Relational Self-Attention and Relational Cross-Attention intertwine position-aware embeddings with refined attention dynamics to inscribe explicit subject–attribute dependencies, enforcing disciplined intra-group cohesion and amplifying the separation between distinct subject clusters. Comprehensive evaluations on our benchmark demonstrate that $\textit{LumosX}$ achieves state-of-the-art performance in fine-grained, identity-consistent, and semantically aligned personalized multi-subject video generation.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

LumosX proposes a framework for personalized multi-subject video generation with explicit face-attribute alignment, combining a tailored data collection pipeline that extracts subject-specific dependencies via multimodal large language models and novel Relational Self-Attention and Relational Cross-Attention modules. The paper resides in the 'Relational Modeling for Subject-Attribute Alignment' leaf, which contains only two papers total. This leaf sits within the broader 'Multi-Subject Identity Preservation and Disentanglement' branch, indicating a relatively sparse research direction focused on explicit relational priors for intra-group consistency.

The taxonomy reveals that neighboring work divides into several directions: sibling approaches like masked guidance for subject disentanglement, temporal control methods enabling timestamp-based appearance scheduling, and feedforward multimodal architectures supporting diverse control signals. The single-subject personalization branch—covering cross-modal alignment, identity-preserving adapters, pose-driven synthesis, and motion intensity control—represents a more mature area with four distinct leaves. LumosX diverges from these by targeting multi-subject scenarios with explicit relational modeling, whereas most prior work either handles single subjects or addresses compositional layout without fine-grained attribute-level alignment.

Among fifteen candidates examined, none clearly refute the three core contributions. The data collection pipeline with face-attribute dependencies was assessed against four candidates with zero refutations; the relational attention modules against one candidate with zero refutations; and the overall LumosX framework against ten candidates with zero refutations. This suggests that within the limited search scope—primarily top-K semantic matches and citation expansion—the explicit relational modeling strategy and dependency-aware data construction appear relatively novel. The absence of refutable overlaps may reflect both the sparse leaf population and the specific focus on subject-attribute binding rather than broader compositional control.

Based on the limited literature search, LumosX appears to occupy a distinct position emphasizing explicit relational priors for face-attribute alignment in multi-subject video generation. The analysis covers top-fifteen semantic matches and does not constitute an exhaustive survey of all personalized video synthesis methods. The sparse leaf structure and zero refutations among examined candidates suggest potential novelty, though a broader search might reveal additional overlapping work in adjacent areas like compositional synthesis or identity-preserving adapters.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Personalized multi-subject video generation with face-attribute alignment. The field structure reflects a progression from single-subject personalization toward multi-subject scenarios with increasingly sophisticated control mechanisms. At the top level, one branch focuses on Multi-Subject Identity Preservation and Disentanglement, addressing the challenge of maintaining distinct identities and correctly associating attributes when multiple subjects appear together. A second branch emphasizes Temporal and Compositional Control in Subject-Driven Generation, exploring how to orchestrate motion, pose, and layout over time. A third branch covers Single-Subject Personalized Video Synthesis, where methods like CustomVideo[1] and PoseCrafter[4] refine identity consistency and motion guidance for individual characters. Additional branches address Identity Insertion and Attribute Editing in Image Generation—often leveraging face encoders or adapter modules as in StableIdentity[3] and Phantom[2]—and Masked Face Recognition Methods, which provide foundational techniques for robust feature extraction under occlusion. Within the multi-subject identity preservation branch, a particularly active line of work centers on relational modeling for subject-attribute alignment, where the key challenge is ensuring that generated videos correctly bind each subject's identity to the intended attributes (e.g., clothing, accessories, or actions). LumosX[0] sits squarely in this cluster, emphasizing mechanisms that disentangle and align face features with corresponding attributes across multiple subjects. This contrasts with neighboring approaches like CustomVideo[1], which primarily targets single-subject fidelity and temporal consistency, or broader multi-subject frameworks such as OmniVCus[5] and MAGREF[6], which may prioritize compositional layout or cross-modal grounding over fine-grained face-attribute correspondence. The central trade-off remains between scaling to many subjects while preserving precise attribute alignment versus achieving high visual quality and motion coherence in simpler, single-subject settings.

Claimed Contributions

Tailored data collection pipeline with face–attribute dependencies

4 retrieved papers

The authors build a data collection pipeline that extracts captions and visual conditions from independent videos and uses MLLMs to infer explicit face–attribute correspondences. This pipeline produces finer-grained relational priors that enhance personalized video customization and enable the construction of a comprehensive benchmark.

4 retrieved papers

Relational Self-Attention and Relational Cross-Attention modules

1 retrieved paper

The authors introduce two dedicated modules that integrate relational positional encodings (R2PE and CSAM) and structured attention masks (MCAM) to explicitly encode face–attribute bindings. These modules reinforce intra-group coherence, suppress cross-group interference, and ensure semantically consistent multi-subject video generation.

1 retrieved paper

LumosX framework for personalized multi-subject video generation

10 retrieved papers

The authors propose LumosX, a comprehensive framework that combines the tailored data pipeline with the relational attention modules to achieve state-of-the-art performance in fine-grained, identity-consistent, and semantically aligned personalized multi-subject video generation.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects PDF

Zhao Wang, Aoxue Li, Enze Xie, Lingting Zhu, Yong Guo, Wang Zhao, Qi Dou, Li Aoxue, Zhenguo Li, Zhu Lingting, Guo Yong, Dou Qi, LI Zhenguo (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Tailored data collection pipeline with face–attribute dependencies

[21] Multimodal Emotional Talking Face Generation Based on Action Units PDF

Cannot Refute

[22] PersonaX: Multimodal Datasets with LLM-Inferred Behavior Traits PDF

Cannot Refute

[23] User-vlm 360: Personalized vision language models with user-aware tuning for social human-robot interactions PDF

Cannot Refute

[24] FaceEditTalker: Controllable Talking Head Generation with Facial Attribute Editing PDF

Cannot Refute

Contribution

Relational Self-Attention and Relational Cross-Attention modules

[20] R2G: Reasoning to Ground in 3D Scenes PDF

Cannot Refute

Contribution

LumosX framework for personalized multi-subject video generation

[1] CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects PDF

Cannot Refute

[11] Personalvideo: High id-fidelity video customization without dynamic and semantic degradation PDF

Cannot Refute

[12] Dream Video: Composing Your Dream Videos with Customized Subject and Motion PDF

Cannot Refute

[13] Customcrafter: Customized video generation with preserving motion and concept composition abilities PDF

Cannot Refute

[14] Motionbooth: Motion-aware customized text-to-video generation PDF

Cannot Refute

[15] Subject-driven Video Generation via Disentangled Identity and Motion PDF

Cannot Refute

[16] VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence PDF

Cannot Refute

[17] Frame In-N-Out: Unbounded Controllable Image-to-Video Generation PDF

Cannot Refute

[18] DanceTogether! Identity-Preserving Multi-Person Interactive Video Generation PDF

Cannot Refute

[19] ID-Animator: Zero-Shot Identity-Preserving Human Video Generation PDF

Cannot Refute

LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects PDF

Contribution Analysis

Tailored data collection pipeline with face–attribute dependencies

[21] Multimodal Emotional Talking Face Generation Based on Action Units PDF

[22] PersonaX: Multimodal Datasets with LLM-Inferred Behavior Traits PDF

[23] User-vlm 360: Personalized vision language models with user-aware tuning for social human-robot interactions PDF

[24] FaceEditTalker: Controllable Talking Head Generation with Facial Attribute Editing PDF

Relational Self-Attention and Relational Cross-Attention modules

[20] R2G: Reasoning to Ground in 3D Scenes PDF

LumosX framework for personalized multi-subject video generation

[1] CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects PDF

[11] Personalvideo: High id-fidelity video customization without dynamic and semantic degradation PDF

[12] Dream Video: Composing Your Dream Videos with Customized Subject and Motion PDF

[13] Customcrafter: Customized video generation with preserving motion and concept composition abilities PDF

[14] Motionbooth: Motion-aware customized text-to-video generation PDF

[15] Subject-driven Video Generation via Disentangled Identity and Motion PDF

[16] VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence PDF

[17] Frame In-N-Out: Unbounded Controllable Image-to-Video Generation PDF

[18] DanceTogether! Identity-Preserving Multi-Person Interactive Video Generation PDF

[19] ID-Animator: Zero-Shot Identity-Preserving Human Video Generation PDF

Table of Contents