IC-Custom: Diverse Image Customization via In-Context Learning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

image customizationimage generationimage editingdiffusion modeldiffusion transformer

Image customization, a crucial technique for industrial media production, aims to generate content that is consistent with reference images. However, current approaches conventionally separate image customization into position-aware and position-free customization paradigms and lack a universal framework for diverse customization, limiting their applications across various scenarios. To overcome these limitations, we propose IC-Custom, a unified framework that seamlessly integrates position-aware and position-free image customization through in-context learning. IC-Custom concatenates reference images with target images to a polyptych, leveraging DiT's multi-modal attention mechanism for fine-grained token-level interactions. We propose the In-context Multi-Modal Attention (ICMA) mechanism, which employs learnable task-oriented register tokens and boundary-aware positional embeddings to enable the model to effectively handle diverse tasks and distinguish between inputs in polyptych configurations. To address the data gap, we curated a 12K identity-consistent dataset with 8K real-world and 4K high-quality synthetic samples, avoiding the overly glossy, oversaturated look typical of synthetic data. IC-Custom supports various industrial applications, including try-on, image insertion, and creative IP customization. Extensive evaluations on our proposed ProductBench and the publicly available DreamBench demonstrate that IC-Custom significantly outperforms community workflows, closed-source models, and state-of-the-art open-source approaches. IC-Custom achieves about 73% higher human preference across identity consistency, harmony, and text alignment metrics, while training only 0.4% of the original model parameters.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

IC-Custom proposes a unified framework for image customization that integrates position-aware and position-free paradigms through in-context learning, using a DiT-based architecture with multi-modal attention. The paper resides in the 'Unified Multi-Task Visual Generation' leaf, which contains six papers total, including the original work. This leaf sits within the broader 'In-Context Learning Frameworks for Image Customization' branch, indicating a moderately populated research direction focused on consolidating diverse visual generation tasks under shared architectures. The presence of five sibling papers suggests active exploration of unified approaches, though the area is not yet saturated.

The taxonomy reveals closely related directions in adjacent leaves: 'Identity-Preserving Human Customization' (five papers) addresses human-specific consistency, while 'Object and Scene Insertion' (two papers) handles spatial placement tasks. The 'Image Editing via In-Context Learning' branch explores instruction-driven and transformation-based editing, with methods like InstructPix2Pix and Edit Transfer offering complementary perspectives. IC-Custom's scope note emphasizes unifying diverse tasks through shared architectures, distinguishing it from task-specific models in 'Specialized Customization Tasks' or editing-only approaches. The framework's polyptych concatenation and multi-modal attention mechanism position it at the intersection of generation and editing paradigms.

Among ten candidates examined for the dataset contribution, none clearly refute the 12K identity-consistent dataset claim, though all ten remain non-refutable or unclear given the limited search scope. The IC-Custom framework and ICMA mechanism contributions were not examined against specific candidates in this analysis. The statistics indicate a narrow literature search—ten candidates total—meaning the assessment captures only a small slice of potentially relevant prior work. The dataset contribution appears distinctive within this limited sample, particularly the combination of real-world and synthetic samples designed to avoid typical synthetic artifacts, though broader searches might reveal similar data curation efforts.

Based on the top-ten semantic matches examined, IC-Custom's contributions appear novel within the constrained search scope, particularly the unified framework bridging position-aware and position-free customization. However, the analysis does not cover the full landscape of multi-task visual generation or in-context learning methods, and the absence of refutable candidates for the framework and ICMA mechanism reflects limited candidate examination rather than definitive novelty. A more exhaustive search across the broader taxonomy—especially within the six-paper 'Unified Multi-Task Visual Generation' cluster—would provide stronger evidence for assessing incremental versus substantial contributions.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Unified image customization via in-context learning. The field has coalesced around leveraging in-context learning—where models adapt to new tasks by conditioning on example demonstrations—to handle diverse image customization objectives within a single framework. The taxonomy reveals four main branches: In-Context Learning Frameworks for Image Customization, which encompasses unified multi-task visual generation systems like Generalist Painter[2] and Realgeneral[3] that aim to consolidate multiple editing and generation capabilities; Image Editing via In-Context Learning, focusing on instruction-driven or example-based editing methods such as InstructPix2Pix[8] and Edit Transfer[5]; Specialized Customization Tasks, addressing domain-specific challenges like matting (In-context Matting[7]) or personalized content generation (Personalized Visual Content[13]); and Cross-Domain and Federated Learning Contexts, exploring how in-context mechanisms generalize across cultural or distributed settings (Cross-Cultural Learning[21], M3T Federated[12]). These branches collectively illustrate a shift from narrowly scoped editing tools toward flexible, example-driven architectures that unify disparate customization operations. A particularly active line of work centers on unified multi-task frameworks that balance generality with task-specific fidelity. IC-Custom[0] sits squarely within this cluster, emphasizing a holistic approach to customization by integrating multiple visual tasks under a single in-context learning paradigm. Compared to Generalist Painter[2], which pioneered multi-task visual generation, IC-Custom[0] extends the scope to a broader set of customization scenarios, while Realgeneral[3] focuses more on photorealistic synthesis quality. Nearby efforts like EditVerse[15] and VisualCloze[11] explore complementary angles—EditVerse[15] targeting compositional editing workflows and VisualCloze[11] framing customization as a visual completion problem. The central tension across these works involves trading off architectural simplicity and task coverage: some methods prioritize a lean, shared backbone, while others incorporate task-specific modules to preserve fine-grained control. IC-Custom[0] navigates this trade-off by leveraging in-context demonstrations to guide a unified model, positioning itself as a flexible yet cohesive solution among emerging multi-task customization systems.

Claimed Contributions

IC-Custom unified framework for diverse image customization

0 retrieved papers

The authors introduce IC-Custom, a framework that unifies two previously separate customization paradigms (position-aware and position-free) using in-context learning. The method concatenates reference and target images into a polyptych and leverages DiT's multi-modal attention for token-level interactions.

0 retrieved papers

In-context Multi-Modal Attention (ICMA) mechanism

0 retrieved papers

The authors develop ICMA, a novel attention mechanism that uses learnable task-oriented register tokens and boundary-aware positional embeddings to enable the model to handle diverse tasks and distinguish between inputs in polyptych configurations.

0 retrieved papers

12K identity-consistent dataset with real and synthetic samples

10 retrieved papers

The authors curate a new dataset containing 12K identity-consistent samples, combining 8K real-world images with 4K high-quality synthetic samples designed to avoid common artifacts of synthetic data.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[2] Images Speak in Images: A Generalist Painter for In-Context Visual Learning PDF

Xinlong Wang, Wen Wang, Yue Cao, Chunhua Shen, Tiejun Huang (2023)

[3] Realgeneral: Unifying visual generation via temporal in-context learning with video models PDF

Lin, Yijing, Huang, Mengqi, Yijing Lin, Zhuang Shuhan, Mengqi Huang, Mao, Zhendong, Shuhan Zhuang, Zhendong Mao (2025)

[11] VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning PDF

Li-zhong Yu, Du, Ruoyi, Zhuo Le, Li Zhen, Gao Peng, Ma, Zhanyu, Cheng, Ming-Ming (2025)

[15] EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning PDF

Ju, Xuan, Wang Tianyu, Xuan Ju, Zhou, Yuqian, Tianyu Wang, Zhang He, Yuqian Zhou, Liu Qing, He Zhang, Zhao, Nanxuan, Qing Liu, Zhang Zhi-fei, Nanxuan Zhao, Li, Yijun, Zhifei Zhang, Cai, Yuanhao, Yijun Li, Liu Shaoteng, Yuanhao Cai, Pakhomov, Daniil, Shaoteng Liu, Lin, Zhe, D. Pakhomov, Kim, Soo Ye, Zhe L. Lin, Xu Qiang, Soo Ye Kim, Qiang Xu (2025)

[25] In-Context Translation: Towards Unifying Image Recognition, Processing, and Generation PDF

Xue Han, Sun, Qianru, Qianru Sun, Han Xue, Song Li, Li Song, Zhang Wenjun, Wenjun Zhang, Huang, Zhiwu, Zhiwu Huang (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

IC-Custom unified framework for diverse image customization

Contribution

In-context Multi-Modal Attention (ICMA) mechanism

Contribution

12K identity-consistent dataset with real and synthetic samples

[30] Id-animator: Zero-shot identity-preserving human video generation PDF

Cannot Refute

[31] FlashFace: Human Image Personalization with High-fidelity Identity Preservation PDF

Cannot Refute

[32] Identity-preserving animal image generation for animal individual identification PDF

Cannot Refute

[33] NegFaceDiff: The Power of Negative Context in Identity-Conditioned Diffusion for Synthetic Face Generation PDF

Cannot Refute

[34] StyleIPSB: Identity-Preserving Semantic Basis of StyleGAN for High Fidelity Face Swapping PDF

Cannot Refute

[35] FaceLift: Learning generalizable single image 3D face reconstruction from synthetic heads PDF

Cannot Refute

[36] Identity-driven three-player generative adversarial network for synthetic-based face recognition PDF

Cannot Refute

[37] Facial Demorphing via Identity Preserving Image Decomposition PDF

Cannot Refute

[38] Vividpose: Advancing stable video diffusion for realistic human image animation PDF

Cannot Refute

[39] Preserve Anything: Controllable Image Synthesis with Object Preservation PDF

Cannot Refute

IC-Custom: Diverse Image Customization via In-Context Learning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[2] Images Speak in Images: A Generalist Painter for In-Context Visual Learning PDF

[3] Realgeneral: Unifying visual generation via temporal in-context learning with video models PDF

[11] VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning PDF

[15] EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning PDF

[25] In-Context Translation: Towards Unifying Image Recognition, Processing, and Generation PDF

Contribution Analysis

IC-Custom unified framework for diverse image customization

In-context Multi-Modal Attention (ICMA) mechanism

12K identity-consistent dataset with real and synthetic samples

[30] Id-animator: Zero-shot identity-preserving human video generation PDF

[31] FlashFace: Human Image Personalization with High-fidelity Identity Preservation PDF

[32] Identity-preserving animal image generation for animal individual identification PDF

[33] NegFaceDiff: The Power of Negative Context in Identity-Conditioned Diffusion for Synthetic Face Generation PDF

[34] StyleIPSB: Identity-Preserving Semantic Basis of StyleGAN for High Fidelity Face Swapping PDF

[35] FaceLift: Learning generalizable single image 3D face reconstruction from synthetic heads PDF

[36] Identity-driven three-player generative adversarial network for synthetic-based face recognition PDF

[37] Facial Demorphing via Identity Preserving Image Decomposition PDF

[38] Vividpose: Advancing stable video diffusion for realistic human image animation PDF

[39] Preserve Anything: Controllable Image Synthesis with Object Preservation PDF

Table of Contents