IC-Custom: Diverse Image Customization via In-Context Learning

ICLR 2026 Conference SubmissionAnonymous Authors
image customizationimage generationimage editingdiffusion modeldiffusion transformer
Abstract:

Image customization, a crucial technique for industrial media production, aims to generate content that is consistent with reference images. However, current approaches conventionally separate image customization into position-aware and position-free customization paradigms and lack a universal framework for diverse customization, limiting their applications across various scenarios. To overcome these limitations, we propose IC-Custom, a unified framework that seamlessly integrates position-aware and position-free image customization through in-context learning. IC-Custom concatenates reference images with target images to a polyptych, leveraging DiT's multi-modal attention mechanism for fine-grained token-level interactions. We propose the In-context Multi-Modal Attention (ICMA) mechanism, which employs learnable task-oriented register tokens and boundary-aware positional embeddings to enable the model to effectively handle diverse tasks and distinguish between inputs in polyptych configurations. To address the data gap, we curated a 12K identity-consistent dataset with 8K real-world and 4K high-quality synthetic samples, avoiding the overly glossy, oversaturated look typical of synthetic data. IC-Custom supports various industrial applications, including try-on, image insertion, and creative IP customization. Extensive evaluations on our proposed ProductBench and the publicly available DreamBench demonstrate that IC-Custom significantly outperforms community workflows, closed-source models, and state-of-the-art open-source approaches. IC-Custom achieves about 73% higher human preference across identity consistency, harmony, and text alignment metrics, while training only 0.4% of the original model parameters.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

IC-Custom proposes a unified framework for image customization that integrates position-aware and position-free paradigms through in-context learning, using a DiT-based architecture with multi-modal attention. The paper resides in the 'Unified Multi-Task Visual Generation' leaf, which contains six papers total, including the original work. This leaf sits within the broader 'In-Context Learning Frameworks for Image Customization' branch, indicating a moderately populated research direction focused on consolidating diverse visual generation tasks under shared architectures. The presence of five sibling papers suggests active exploration of unified approaches, though the area is not yet saturated.

The taxonomy reveals closely related directions in adjacent leaves: 'Identity-Preserving Human Customization' (five papers) addresses human-specific consistency, while 'Object and Scene Insertion' (two papers) handles spatial placement tasks. The 'Image Editing via In-Context Learning' branch explores instruction-driven and transformation-based editing, with methods like InstructPix2Pix and Edit Transfer offering complementary perspectives. IC-Custom's scope note emphasizes unifying diverse tasks through shared architectures, distinguishing it from task-specific models in 'Specialized Customization Tasks' or editing-only approaches. The framework's polyptych concatenation and multi-modal attention mechanism position it at the intersection of generation and editing paradigms.

Among ten candidates examined for the dataset contribution, none clearly refute the 12K identity-consistent dataset claim, though all ten remain non-refutable or unclear given the limited search scope. The IC-Custom framework and ICMA mechanism contributions were not examined against specific candidates in this analysis. The statistics indicate a narrow literature search—ten candidates total—meaning the assessment captures only a small slice of potentially relevant prior work. The dataset contribution appears distinctive within this limited sample, particularly the combination of real-world and synthetic samples designed to avoid typical synthetic artifacts, though broader searches might reveal similar data curation efforts.

Based on the top-ten semantic matches examined, IC-Custom's contributions appear novel within the constrained search scope, particularly the unified framework bridging position-aware and position-free customization. However, the analysis does not cover the full landscape of multi-task visual generation or in-context learning methods, and the absence of refutable candidates for the framework and ICMA mechanism reflects limited candidate examination rather than definitive novelty. A more exhaustive search across the broader taxonomy—especially within the six-paper 'Unified Multi-Task Visual Generation' cluster—would provide stronger evidence for assessing incremental versus substantial contributions.

Taxonomy

Core-task Taxonomy Papers
29
3
Claimed Contributions
10
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Unified image customization via in-context learning. The field has coalesced around leveraging in-context learning—where models adapt to new tasks by conditioning on example demonstrations—to handle diverse image customization objectives within a single framework. The taxonomy reveals four main branches: In-Context Learning Frameworks for Image Customization, which encompasses unified multi-task visual generation systems like Generalist Painter[2] and Realgeneral[3] that aim to consolidate multiple editing and generation capabilities; Image Editing via In-Context Learning, focusing on instruction-driven or example-based editing methods such as InstructPix2Pix[8] and Edit Transfer[5]; Specialized Customization Tasks, addressing domain-specific challenges like matting (In-context Matting[7]) or personalized content generation (Personalized Visual Content[13]); and Cross-Domain and Federated Learning Contexts, exploring how in-context mechanisms generalize across cultural or distributed settings (Cross-Cultural Learning[21], M3T Federated[12]). These branches collectively illustrate a shift from narrowly scoped editing tools toward flexible, example-driven architectures that unify disparate customization operations. A particularly active line of work centers on unified multi-task frameworks that balance generality with task-specific fidelity. IC-Custom[0] sits squarely within this cluster, emphasizing a holistic approach to customization by integrating multiple visual tasks under a single in-context learning paradigm. Compared to Generalist Painter[2], which pioneered multi-task visual generation, IC-Custom[0] extends the scope to a broader set of customization scenarios, while Realgeneral[3] focuses more on photorealistic synthesis quality. Nearby efforts like EditVerse[15] and VisualCloze[11] explore complementary angles—EditVerse[15] targeting compositional editing workflows and VisualCloze[11] framing customization as a visual completion problem. The central tension across these works involves trading off architectural simplicity and task coverage: some methods prioritize a lean, shared backbone, while others incorporate task-specific modules to preserve fine-grained control. IC-Custom[0] navigates this trade-off by leveraging in-context demonstrations to guide a unified model, positioning itself as a flexible yet cohesive solution among emerging multi-task customization systems.

Claimed Contributions

IC-Custom unified framework for diverse image customization

The authors introduce IC-Custom, a framework that unifies two previously separate customization paradigms (position-aware and position-free) using in-context learning. The method concatenates reference and target images into a polyptych and leverages DiT's multi-modal attention for token-level interactions.

0 retrieved papers
In-context Multi-Modal Attention (ICMA) mechanism

The authors develop ICMA, a novel attention mechanism that uses learnable task-oriented register tokens and boundary-aware positional embeddings to enable the model to handle diverse tasks and distinguish between inputs in polyptych configurations.

0 retrieved papers
12K identity-consistent dataset with real and synthetic samples

The authors curate a new dataset containing 12K identity-consistent samples, combining 8K real-world images with 4K high-quality synthetic samples designed to avoid common artifacts of synthetic data.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

IC-Custom unified framework for diverse image customization

The authors introduce IC-Custom, a framework that unifies two previously separate customization paradigms (position-aware and position-free) using in-context learning. The method concatenates reference and target images into a polyptych and leverages DiT's multi-modal attention for token-level interactions.

Contribution

In-context Multi-Modal Attention (ICMA) mechanism

The authors develop ICMA, a novel attention mechanism that uses learnable task-oriented register tokens and boundary-aware positional embeddings to enable the model to handle diverse tasks and distinguish between inputs in polyptych configurations.

Contribution

12K identity-consistent dataset with real and synthetic samples

The authors curate a new dataset containing 12K identity-consistent samples, combining 8K real-world images with 4K high-quality synthetic samples designed to avoid common artifacts of synthetic data.