LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation

ICLR 2026 Conference SubmissionAnonymous Authors
Video GenerationVideo CustomizationDiffusion ModelsMulti-Subject GenerationFace-Attribute Alignment
Abstract:

Recent advances in diffusion models have significantly improved text-to-video generation, enabling personalized content creation with fine-grained control over both foreground and background elements. However, precise face–attribute alignment across subjects remains challenging, as existing methods lack explicit mechanisms to ensure intra-group consistency. Addressing this gap requires both explicit modeling strategies and face-attribute-aware data resources. We therefore propose LumosX\textbf{\textit{Lumos{X}}}, a framework that advances both data and model design. On the data side, a tailored collection pipeline orchestrates captions and visual cues from independent videos, while multimodal large language models (MLLMs) infer and assign subject-specific dependencies. These extracted relational priors impose a finer-grained structure that amplifies the expressive control of personalized video generation and enables the construction of a comprehensive benchmark. On the modeling side, Relational Self-Attention and Relational Cross-Attention intertwine position-aware embeddings with refined attention dynamics to inscribe explicit subject–attribute dependencies, enforcing disciplined intra-group cohesion and amplifying the separation between distinct subject clusters. Comprehensive evaluations on our benchmark demonstrate that LumosX\textit{LumosX} achieves state-of-the-art performance in fine-grained, identity-consistent, and semantically aligned personalized multi-subject video generation.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

LumosX proposes a framework for personalized multi-subject video generation with explicit face-attribute alignment, combining a tailored data collection pipeline that extracts subject-specific dependencies via multimodal large language models and novel Relational Self-Attention and Relational Cross-Attention modules. The paper resides in the 'Relational Modeling for Subject-Attribute Alignment' leaf, which contains only two papers total. This leaf sits within the broader 'Multi-Subject Identity Preservation and Disentanglement' branch, indicating a relatively sparse research direction focused on explicit relational priors for intra-group consistency.

The taxonomy reveals that neighboring work divides into several directions: sibling approaches like masked guidance for subject disentanglement, temporal control methods enabling timestamp-based appearance scheduling, and feedforward multimodal architectures supporting diverse control signals. The single-subject personalization branch—covering cross-modal alignment, identity-preserving adapters, pose-driven synthesis, and motion intensity control—represents a more mature area with four distinct leaves. LumosX diverges from these by targeting multi-subject scenarios with explicit relational modeling, whereas most prior work either handles single subjects or addresses compositional layout without fine-grained attribute-level alignment.

Among fifteen candidates examined, none clearly refute the three core contributions. The data collection pipeline with face-attribute dependencies was assessed against four candidates with zero refutations; the relational attention modules against one candidate with zero refutations; and the overall LumosX framework against ten candidates with zero refutations. This suggests that within the limited search scope—primarily top-K semantic matches and citation expansion—the explicit relational modeling strategy and dependency-aware data construction appear relatively novel. The absence of refutable overlaps may reflect both the sparse leaf population and the specific focus on subject-attribute binding rather than broader compositional control.

Based on the limited literature search, LumosX appears to occupy a distinct position emphasizing explicit relational priors for face-attribute alignment in multi-subject video generation. The analysis covers top-fifteen semantic matches and does not constitute an exhaustive survey of all personalized video synthesis methods. The sparse leaf structure and zero refutations among examined candidates suggest potential novelty, though a broader search might reveal additional overlapping work in adjacent areas like compositional synthesis or identity-preserving adapters.

Taxonomy

Core-task Taxonomy Papers
10
3
Claimed Contributions
15
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Personalized multi-subject video generation with face-attribute alignment. The field structure reflects a progression from single-subject personalization toward multi-subject scenarios with increasingly sophisticated control mechanisms. At the top level, one branch focuses on Multi-Subject Identity Preservation and Disentanglement, addressing the challenge of maintaining distinct identities and correctly associating attributes when multiple subjects appear together. A second branch emphasizes Temporal and Compositional Control in Subject-Driven Generation, exploring how to orchestrate motion, pose, and layout over time. A third branch covers Single-Subject Personalized Video Synthesis, where methods like CustomVideo[1] and PoseCrafter[4] refine identity consistency and motion guidance for individual characters. Additional branches address Identity Insertion and Attribute Editing in Image Generation—often leveraging face encoders or adapter modules as in StableIdentity[3] and Phantom[2]—and Masked Face Recognition Methods, which provide foundational techniques for robust feature extraction under occlusion. Within the multi-subject identity preservation branch, a particularly active line of work centers on relational modeling for subject-attribute alignment, where the key challenge is ensuring that generated videos correctly bind each subject's identity to the intended attributes (e.g., clothing, accessories, or actions). LumosX[0] sits squarely in this cluster, emphasizing mechanisms that disentangle and align face features with corresponding attributes across multiple subjects. This contrasts with neighboring approaches like CustomVideo[1], which primarily targets single-subject fidelity and temporal consistency, or broader multi-subject frameworks such as OmniVCus[5] and MAGREF[6], which may prioritize compositional layout or cross-modal grounding over fine-grained face-attribute correspondence. The central trade-off remains between scaling to many subjects while preserving precise attribute alignment versus achieving high visual quality and motion coherence in simpler, single-subject settings.

Claimed Contributions

Tailored data collection pipeline with face–attribute dependencies

The authors build a data collection pipeline that extracts captions and visual conditions from independent videos and uses MLLMs to infer explicit face–attribute correspondences. This pipeline produces finer-grained relational priors that enhance personalized video customization and enable the construction of a comprehensive benchmark.

4 retrieved papers
Relational Self-Attention and Relational Cross-Attention modules

The authors introduce two dedicated modules that integrate relational positional encodings (R2PE and CSAM) and structured attention masks (MCAM) to explicitly encode face–attribute bindings. These modules reinforce intra-group coherence, suppress cross-group interference, and ensure semantically consistent multi-subject video generation.

1 retrieved paper
LumosX framework for personalized multi-subject video generation

The authors propose LumosX, a comprehensive framework that combines the tailored data pipeline with the relational attention modules to achieve state-of-the-art performance in fine-grained, identity-consistent, and semantically aligned personalized multi-subject video generation.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Tailored data collection pipeline with face–attribute dependencies

The authors build a data collection pipeline that extracts captions and visual conditions from independent videos and uses MLLMs to infer explicit face–attribute correspondences. This pipeline produces finer-grained relational priors that enhance personalized video customization and enable the construction of a comprehensive benchmark.

Contribution

Relational Self-Attention and Relational Cross-Attention modules

The authors introduce two dedicated modules that integrate relational positional encodings (R2PE and CSAM) and structured attention masks (MCAM) to explicitly encode face–attribute bindings. These modules reinforce intra-group coherence, suppress cross-group interference, and ensure semantically consistent multi-subject video generation.

Contribution

LumosX framework for personalized multi-subject video generation

The authors propose LumosX, a comprehensive framework that combines the tailored data pipeline with the relational attention modules to achieve state-of-the-art performance in fine-grained, identity-consistent, and semantically aligned personalized multi-subject video generation.