WithAnyone: Toward Controllable and ID Consistent Image Generation

ICLR 2026 Conference SubmissionAnonymous Authors
AIGCID-Consistent Generation
Abstract:

Identity-consistent (ID-consistent) generation has become an important focus in text-to-image research, with recent models achieving notable success in producing images aligned with a reference identity. Yet, the scarcity of large-scale paired datasets—containing multiple images of the same individual—forces most approaches to adopt reconstruction-based training. This reliance often leads to a failure mode we term copy-paste, where the model directly replicates the reference face rather than preserving identity across natural variations in pose, expression, or lighting. Such over-similarity undermines controllability and limits the expressive power of generation. To address these limitations, we (1) construct a large-scale paired dataset, MultiID-2M, tailored for multi-person scenarios, providing diverse references for each identity; (2) introduce a benchmark that quantifies both copy-paste artifacts and the trade-off between identity fidelity and variation; and (3) propose a novel training paradigm with a contrastive identity loss that leverages paired data to balance fidelity with diversity. These contributions culminate in WithAnyone, a diffusion-based model that effectively mitigates copy-paste while preserving high identity similarity. Extensive experiments—both qualitative and quantitative—demonstrate that WithAnyone substantially reduces copy-paste artifacts, improves controllability over pose and expression, and maintains strong perceptual quality. User studies further validate that our method achieves high identity fidelity while enabling expressive, controllable generation.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces WithAnyone, a diffusion model targeting identity-consistent generation with controllable variations, and contributes the MultiID-2M dataset plus a contrastive identity loss. It resides in the Identity-Controllability Trade-off Optimization leaf, which contains only two papers in the entire 50-paper taxonomy. This sparse positioning suggests the explicit framing of the identity-fidelity versus variation balance as a core optimization problem is relatively underexplored. Most prior work either prioritizes strong identity preservation (e.g., tuning-free or fine-tuning-based methods) or emphasizes attribute control (pose, garment) without systematically addressing the trade-off itself.

The taxonomy reveals neighboring directions: Identity-Preserving Text-to-Image Generation (tuning-free and fine-tuning branches with seven papers) focuses on embedding identities into diffusion models, while Controllable Human Image Synthesis (ten papers across pose, garment, and attribute control) emphasizes manipulation without explicit trade-off optimization. Disentangled Representation Learning (six papers) separates identity from attributes in latent space but does not directly tackle the copy-paste failure mode described here. WithAnyone's contrastive loss and paired-data paradigm diverge from these by explicitly penalizing over-similarity, a mechanism not prominent in sibling or adjacent categories.

Among 30 candidates examined, none clearly refute the three contributions. The MultiID-2M dataset (10 candidates, 0 refutable) appears novel in its multi-person paired structure tailored for identity diversity. The ID-contrastive training approach (10 candidates, 0 refutable) leverages paired data in a way not documented among the examined papers, though the limited search scope means exhaustive dataset or loss-function surveys were not conducted. The WithAnyone model (10 candidates, 0 refutable) integrates these elements into a unified framework, with no examined work presenting an equivalent combination of dataset, loss, and benchmark for the copy-paste problem.

Based on top-30 semantic matches and the sparse taxonomy leaf, the work appears to occupy a distinct niche. The explicit focus on the identity-controllability trade-off, paired-data training, and copy-paste quantification are not prominently addressed in the examined literature. However, the limited search scope and the small number of papers in the target leaf mean a broader survey could reveal additional relevant methods or datasets. The analysis covers the immediate semantic neighborhood but does not claim exhaustive coverage of all identity-consistent generation research.

Taxonomy

Core-task Taxonomy Papers
49
3
Claimed Contributions
28
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Identity-consistent image generation with controllable variations. The field addresses the challenge of synthesizing images that preserve a subject's identity while allowing flexible control over pose, expression, scene composition, and other attributes. The taxonomy reveals a rich landscape organized around several complementary directions. Identity-Preserving Text-to-Image Generation focuses on injecting reference identities into diffusion models via text prompts, often using adapter modules or embedding techniques (e.g., Blip-diffusion[6], PortraitBooth[38]). Identity-Consistent Video Generation extends these ideas to temporal sequences, ensuring frame-to-frame coherence (e.g., Animate Anyone[5], Motion-I2V[4]). Controllable Human Image Synthesis emphasizes pose and garment transfer, while Cross-Modal Identity Transfer tackles domain shifts such as photo-to-sketch or infrared modalities (Infrared Person Generation[2]). Disentangled Representation Learning seeks to separate identity from other factors in latent space (Tedigan[15]), and Spatial and Compositional Control provides fine-grained layout or multi-subject orchestration (MasaCtrl[12], LayerCraft[9]). Specialized applications target niche domains like cartoon faces or digital apparel, and Synthetic Identity Generation supports recognition tasks by creating diverse training data. A central tension across these branches is the identity-controllability trade-off: stronger identity preservation can limit flexibility in pose, lighting, or scene composition, while aggressive control may dilute recognizable features. Recent works explore adaptive weighting schemes, frequency-domain decomposition (Frequency Decomposition[1]), and multi-stage pipelines to balance these competing goals. WithAnyone[0] sits squarely in the Identity-Controllability Trade-off Optimization branch, addressing how to maintain fidelity across varied conditions without sacrificing creative freedom. It shares thematic concerns with ID-Booth[3], which also targets robust identity encoding under diverse prompts, and contrasts with more application-specific methods like Multi-Garments[7] or Blendshape-Guided[8] that prioritize domain constraints over general trade-off tuning. This positioning highlights an ongoing effort to develop principled mechanisms—whether through loss formulations, architectural choices, or training strategies—that let practitioners dial identity strength and controllability according to downstream needs.

Claimed Contributions

MultiID-2M dataset for multi-person ID-consistent generation

The authors build a large-scale paired dataset called MultiID-2M specifically designed for multi-person scenarios. This dataset addresses the scarcity of paired data containing multiple images of the same individual, enabling better training for identity-consistent generation.

9 retrieved papers
ID-contrastive training approach

The authors introduce an ID-contrastive training method that helps the model preserve identity across natural variations in pose, expression, and lighting, while avoiding the copy-paste failure mode where models directly replicate reference faces.

10 retrieved papers
WithAnyone model for controllable ID-consistent generation

The authors develop WithAnyone, a model that generates high-quality images with controllable attributes while maintaining identity consistency. The model addresses limitations of reconstruction-based training that lead to over-similarity and reduced controllability.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MultiID-2M dataset for multi-person ID-consistent generation

The authors build a large-scale paired dataset called MultiID-2M specifically designed for multi-person scenarios. This dataset addresses the scarcity of paired data containing multiple images of the same individual, enabling better training for identity-consistent generation.

Contribution

ID-contrastive training approach

The authors introduce an ID-contrastive training method that helps the model preserve identity across natural variations in pose, expression, and lighting, while avoiding the copy-paste failure mode where models directly replicate reference faces.

Contribution

WithAnyone model for controllable ID-consistent generation

The authors develop WithAnyone, a model that generates high-quality images with controllable attributes while maintaining identity consistency. The model addresses limitations of reconstruction-based training that lead to over-similarity and reduced controllability.

WithAnyone: Toward Controllable and ID Consistent Image Generation | Novelty Validation