WithAnyone: Toward Controllable and ID Consistent Image Generation

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

AIGCID-Consistent Generation

Identity-consistent (ID-consistent) generation has become an important focus in text-to-image research, with recent models achieving notable success in producing images aligned with a reference identity. Yet, the scarcity of large-scale paired datasets—containing multiple images of the same individual—forces most approaches to adopt reconstruction-based training. This reliance often leads to a failure mode we term copy-paste, where the model directly replicates the reference face rather than preserving identity across natural variations in pose, expression, or lighting. Such over-similarity undermines controllability and limits the expressive power of generation. To address these limitations, we (1) construct a large-scale paired dataset, MultiID-2M, tailored for multi-person scenarios, providing diverse references for each identity; (2) introduce a benchmark that quantifies both copy-paste artifacts and the trade-off between identity fidelity and variation; and (3) propose a novel training paradigm with a contrastive identity loss that leverages paired data to balance fidelity with diversity. These contributions culminate in WithAnyone, a diffusion-based model that effectively mitigates copy-paste while preserving high identity similarity. Extensive experiments—both qualitative and quantitative—demonstrate that WithAnyone substantially reduces copy-paste artifacts, improves controllability over pose and expression, and maintains strong perceptual quality. User studies further validate that our method achieves high identity fidelity while enabling expressive, controllable generation.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces WithAnyone, a diffusion model targeting identity-consistent generation with controllable variations, and contributes the MultiID-2M dataset plus a contrastive identity loss. It resides in the Identity-Controllability Trade-off Optimization leaf, which contains only two papers in the entire 50-paper taxonomy. This sparse positioning suggests the explicit framing of the identity-fidelity versus variation balance as a core optimization problem is relatively underexplored. Most prior work either prioritizes strong identity preservation (e.g., tuning-free or fine-tuning-based methods) or emphasizes attribute control (pose, garment) without systematically addressing the trade-off itself.

The taxonomy reveals neighboring directions: Identity-Preserving Text-to-Image Generation (tuning-free and fine-tuning branches with seven papers) focuses on embedding identities into diffusion models, while Controllable Human Image Synthesis (ten papers across pose, garment, and attribute control) emphasizes manipulation without explicit trade-off optimization. Disentangled Representation Learning (six papers) separates identity from attributes in latent space but does not directly tackle the copy-paste failure mode described here. WithAnyone's contrastive loss and paired-data paradigm diverge from these by explicitly penalizing over-similarity, a mechanism not prominent in sibling or adjacent categories.

Among 30 candidates examined, none clearly refute the three contributions. The MultiID-2M dataset (10 candidates, 0 refutable) appears novel in its multi-person paired structure tailored for identity diversity. The ID-contrastive training approach (10 candidates, 0 refutable) leverages paired data in a way not documented among the examined papers, though the limited search scope means exhaustive dataset or loss-function surveys were not conducted. The WithAnyone model (10 candidates, 0 refutable) integrates these elements into a unified framework, with no examined work presenting an equivalent combination of dataset, loss, and benchmark for the copy-paste problem.

Based on top-30 semantic matches and the sparse taxonomy leaf, the work appears to occupy a distinct niche. The explicit focus on the identity-controllability trade-off, paired-data training, and copy-paste quantification are not prominently addressed in the examined literature. However, the limited search scope and the small number of papers in the target leaf mean a broader survey could reveal additional relevant methods or datasets. The analysis covers the immediate semantic neighborhood but does not claim exhaustive coverage of all identity-consistent generation research.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Identity-consistent image generation with controllable variations. The field addresses the challenge of synthesizing images that preserve a subject's identity while allowing flexible control over pose, expression, scene composition, and other attributes. The taxonomy reveals a rich landscape organized around several complementary directions. Identity-Preserving Text-to-Image Generation focuses on injecting reference identities into diffusion models via text prompts, often using adapter modules or embedding techniques (e.g., Blip-diffusion[6], PortraitBooth[38]). Identity-Consistent Video Generation extends these ideas to temporal sequences, ensuring frame-to-frame coherence (e.g., Animate Anyone[5], Motion-I2V[4]). Controllable Human Image Synthesis emphasizes pose and garment transfer, while Cross-Modal Identity Transfer tackles domain shifts such as photo-to-sketch or infrared modalities (Infrared Person Generation[2]). Disentangled Representation Learning seeks to separate identity from other factors in latent space (Tedigan[15]), and Spatial and Compositional Control provides fine-grained layout or multi-subject orchestration (MasaCtrl[12], LayerCraft[9]). Specialized applications target niche domains like cartoon faces or digital apparel, and Synthetic Identity Generation supports recognition tasks by creating diverse training data. A central tension across these branches is the identity-controllability trade-off: stronger identity preservation can limit flexibility in pose, lighting, or scene composition, while aggressive control may dilute recognizable features. Recent works explore adaptive weighting schemes, frequency-domain decomposition (Frequency Decomposition[1]), and multi-stage pipelines to balance these competing goals. WithAnyone[0] sits squarely in the Identity-Controllability Trade-off Optimization branch, addressing how to maintain fidelity across varied conditions without sacrificing creative freedom. It shares thematic concerns with ID-Booth[3], which also targets robust identity encoding under diverse prompts, and contrasts with more application-specific methods like Multi-Garments[7] or Blendshape-Guided[8] that prioritize domain constraints over general trade-off tuning. This positioning highlights an ongoing effort to develop principled mechanisms—whether through loss formulations, architectural choices, or training strategies—that let practitioners dial identity strength and controllability according to downstream needs.

Claimed Contributions

MultiID-2M dataset for multi-person ID-consistent generation

9 retrieved papers

The authors build a large-scale paired dataset called MultiID-2M specifically designed for multi-person scenarios. This dataset addresses the scarcity of paired data containing multiple images of the same individual, enabling better training for identity-consistent generation.

9 retrieved papers

ID-contrastive training approach

10 retrieved papers

The authors introduce an ID-contrastive training method that helps the model preserve identity across natural variations in pose, expression, and lighting, while avoiding the copy-paste failure mode where models directly replicate reference faces.

10 retrieved papers

WithAnyone model for controllable ID-consistent generation

9 retrieved papers

The authors develop WithAnyone, a model that generates high-quality images with controllable attributes while maintaining identity consistency. The model addresses limitations of reconstruction-based training that lead to over-similarity and reduced controllability.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MultiID-2M dataset for multi-person ID-consistent generation

[29] Concat-ID: Towards Universal Identity-Preserving Video Synthesis PDF

Cannot Refute

[38] PortraitBooth: A Versatile Portrait Model for Fast Identity-Preserved Personalization PDF

Cannot Refute

[50] ContextGen: Contextual Layout Anchoring for Identity-Consistent Multi-Instance Generation PDF

Cannot Refute

[51] DanceTogether! Identity-Preserving Multi-Person Interactive Video Generation PDF

Cannot Refute

[52] Trackformer: Multi-object tracking with transformers PDF

Cannot Refute

[53] MultiHuman-Testbench: Benchmarking Image Generation for Multiple Humans PDF

Cannot Refute

[54] CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects PDF

Cannot Refute

[55] Diffusion Self-Distillation for Zero-Shot Customized Image Generation PDF

Cannot Refute

[56] Id-patch: Robust id association for group photo personalization PDF

Cannot Refute

Contribution

ID-contrastive training approach

[11] Disentangled and Controllable Face Image Generation via 3D Imitative-Contrastive Learning PDF

Cannot Refute

[57] Pose-disentangled Contrastive Learning for Self-supervised Facial Representation PDF

Cannot Refute

[58] Sample-cohesive pose-aware contrastive facial representation learning PDF

Cannot Refute

[59] Cross-Age Contrastive Learning for Age-Invariant Face Recognition PDF

Cannot Refute

[60] Audio-driven emotion-aware 3d talking face generation from single image PDF

Cannot Refute

[61] Real-time, high-fidelity face identity swapping with a vision foundation model PDF

Cannot Refute

[62] Contrastive viewpoint-aware shape learning for long-term person re-identification PDF

Cannot Refute

[63] Face Swapping via Reverse Contrastive Learning and Explicit Identity-Attribute Disentanglement PDF

Cannot Refute

[64] Identity-aware convolutional neural network for facial expression recognition PDF

Cannot Refute

[65] Supervised contrastive learning with identity-label embeddings for facial action unit recognition PDF

Cannot Refute

Contribution

WithAnyone model for controllable ID-consistent generation

[2] Identity-aware infrared person image generation and re-identification via controllable diffusion model PDF

Cannot Refute

[3] ID-Booth: Identity-consistent Face Generation with Diffusion Models PDF

Cannot Refute

[5] Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation PDF

Cannot Refute

[12] MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing PDF

Cannot Refute

[16] Motioncharacter: Identity-preserving and motion controllable human video generation PDF

Cannot Refute

[40] FLUXSynID: A Framework for Identity-Controlled Synthetic Face Generation with Document and Live Images PDF

Cannot Refute

[66] Ominicontrol: Minimal and universal control for diffusion transformer PDF

Cannot Refute

[67] Uniportrait: A unified framework for identity-preserving single-and multi-human image personalization PDF

Cannot Refute

[68] SmartAvatar: Text- and Image-Guided Human Avatar Generation with VLM AI Agents PDF

Cannot Refute

WithAnyone: Toward Controllable and ID Consistent Image Generation

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

MultiID-2M dataset for multi-person ID-consistent generation

[29] Concat-ID: Towards Universal Identity-Preserving Video Synthesis PDF

[38] PortraitBooth: A Versatile Portrait Model for Fast Identity-Preserved Personalization PDF

[50] ContextGen: Contextual Layout Anchoring for Identity-Consistent Multi-Instance Generation PDF

[51] DanceTogether! Identity-Preserving Multi-Person Interactive Video Generation PDF

[52] Trackformer: Multi-object tracking with transformers PDF

[53] MultiHuman-Testbench: Benchmarking Image Generation for Multiple Humans PDF

[54] CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects PDF

[55] Diffusion Self-Distillation for Zero-Shot Customized Image Generation PDF

[56] Id-patch: Robust id association for group photo personalization PDF

ID-contrastive training approach

[11] Disentangled and Controllable Face Image Generation via 3D Imitative-Contrastive Learning PDF

[57] Pose-disentangled Contrastive Learning for Self-supervised Facial Representation PDF

[58] Sample-cohesive pose-aware contrastive facial representation learning PDF

[59] Cross-Age Contrastive Learning for Age-Invariant Face Recognition PDF

[60] Audio-driven emotion-aware 3d talking face generation from single image PDF

[61] Real-time, high-fidelity face identity swapping with a vision foundation model PDF

[62] Contrastive viewpoint-aware shape learning for long-term person re-identification PDF

[63] Face Swapping via Reverse Contrastive Learning and Explicit Identity-Attribute Disentanglement PDF

[64] Identity-aware convolutional neural network for facial expression recognition PDF

[65] Supervised contrastive learning with identity-label embeddings for facial action unit recognition PDF

WithAnyone model for controllable ID-consistent generation

[2] Identity-aware infrared person image generation and re-identification via controllable diffusion model PDF

[3] ID-Booth: Identity-consistent Face Generation with Diffusion Models PDF

[5] Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation PDF

[12] MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing PDF

[16] Motioncharacter: Identity-preserving and motion controllable human video generation PDF

[40] FLUXSynID: A Framework for Identity-Controlled Synthetic Face Generation with Document and Live Images PDF

[66] Ominicontrol: Minimal and universal control for diffusion transformer PDF

[67] Uniportrait: A unified framework for identity-preserving single-and multi-human image personalization PDF

[68] SmartAvatar: Text- and Image-Guided Human Avatar Generation with VLM AI Agents PDF

Table of Contents