Directional Textual Inversion for Personalized Text-to-Image Generation

ICLR 2026 Conference SubmissionAnonymous Authors
personalized generationtext-to-image modelstextual inversion
Abstract:

Textual Inversion (TI) is an efficient approach to text‑to‑image personalization but often fails on complex prompts. We trace these failures to embedding norm inflation: learned tokens drift to out‑of‑distribution magnitudes, degrading prompt conditioning in pre‑norm Transformers. Empirically, we show semantics are primarily encoded by direction in CLIP token space, while inflated norms harm contextualization; theoretically, we analyze how large magnitudes attenuate positional information and hinder residual updates in pre‑norm blocks. We propose Directional Textual Inversion (DTI), which fixes the embedding magnitude to an in‑distribution scale and optimizes only direction on the unit hypersphere via Riemannian SGD. We cast direction learning as MAP with a von Mises–Fisher prior, yielding a constant‑direction prior gradient that is simple and efficient to incorporate. Across personalization tasks, DTI improves text fidelity over TI and TI‑variants while maintaining subject similarity. Crucially, DTI’s hyperspherical parameterization enables smooth, semantically coherent interpolation between learned concepts (slerp), a capability that is absent in standard TI. Our findings suggest that direction‑only optimization is a robust and scalable path for prompt‑faithful personalization.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Directional Textual Inversion (DTI), which constrains embedding optimization to the unit hypersphere via Riemannian SGD and incorporates a von Mises–Fisher prior. It resides in the 'Constrained Embedding Optimization' leaf, which contains only three papers total. This leaf sits within the broader 'Textual Embedding Optimization' branch, indicating a moderately sparse research direction focused on geometric or semantic constraints during embedding learning. The small sibling set suggests this specific angle—directional constraints with hyperspherical parameterization—is relatively underexplored compared to unconstrained textual inversion methods.

The taxonomy tree shows that neighboring leaves include 'Single-Concept Textual Inversion' (three papers on unconstrained optimization) and 'Disentangled Embedding Learning' (four papers on identity-context separation). The 'Constrained Embedding Optimization' leaf explicitly excludes unconstrained methods and disentanglement-focused approaches, positioning DTI as a middle ground: it imposes geometric constraints without explicit disentanglement objectives. Nearby branches like 'Encoder-Based Personalization' and 'Model Fine-Tuning Approaches' represent alternative paradigms (feed-forward encoders vs. parameter updates), highlighting that DTI's iterative embedding refinement occupies a distinct methodological niche within the field.

Among twelve candidates examined, the MAP formulation with von Mises–Fisher prior (Contribution B) shows two refutable candidates out of ten examined, indicating some prior work on directional priors or hyperspherical embeddings. The DTI framework itself (Contribution A) examined two candidates with zero refutations, suggesting the specific combination of norm-fixing and Riemannian optimization may be novel. The theoretical analysis of norm inflation (Contribution C) was not tested against candidates. The limited search scope—twelve papers total—means these findings reflect top semantic matches rather than exhaustive coverage, and the refutable pairs likely represent overlapping methodological components rather than complete anticipation of the full DTI approach.

Given the sparse taxonomy leaf and the limited refutation rate across contributions, DTI appears to introduce a relatively fresh angle on constrained embedding optimization. However, the presence of two refutable candidates for the von Mises–Fisher prior suggests that directional priors are not entirely unprecedented. The analysis is bounded by the top-12 semantic search scope and does not cover the full landscape of hyperspherical learning or Riemannian optimization in adjacent fields.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
12
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: personalized text-to-image generation with embedding optimization. The field organizes around several complementary strategies for adapting diffusion models to user-specific concepts. Embedding Learning and Optimization Strategies focus on refining textual or latent representations without modifying model weights, encompassing methods like textual inversion (Image Worth One Word[1]) and constrained optimization approaches that guide embeddings toward desired attributes. Encoder-Based Personalization leverages dedicated networks to map reference images into embedding spaces (e.g., Blip-diffusion[7], ELITE[8]), enabling rapid adaptation. Model Fine-Tuning Approaches adjust diffusion model parameters directly, exemplified by DreamBooth[10] and its variants, trading inference speed for fidelity. Compositional and Controllable Generation addresses multi-concept scenarios and attribute disentanglement (Multi-Concept Customization[12], Attribute Disentanglement[2]), while Prompt Engineering and Refinement explores how textual guidance can be iteratively improved. Specialized Generation Tasks extend these techniques to domains like typography or video. Within Embedding Learning, a particularly active line explores constrained optimization to balance identity preservation and editability. Directional Textual Inversion[0] sits in this cluster, emphasizing directional constraints during embedding refinement to maintain semantic coherence while personalizing concepts. Nearby works like Cross Initialization[33] propose alternative initialization strategies to accelerate convergence, and Core[25] investigates core embedding structures for robust personalization. In contrast, encoder-based methods (Taming Encoder[15], ECLIPSE[26]) prioritize zero-shot generalization by learning mappings from large datasets, sacrificing per-concept fine-grained control for speed. The tension between optimization depth and generalization breadth remains central: embedding optimization methods like Directional Textual Inversion[0] offer precise control over individual concepts but require iterative refinement, whereas encoder approaches achieve faster adaptation at the cost of potentially reduced fidelity for out-of-distribution subjects.

Claimed Contributions

Directional Textual Inversion (DTI) framework

DTI is a novel personalization framework that decouples token embeddings into magnitude and direction components. It maintains embedding magnitude at in-distribution scale while optimizing only the directional component on the unit hypersphere using Riemannian SGD, improving text fidelity while preserving subject similarity.

2 retrieved papers
MAP formulation with von Mises–Fisher prior for direction learning

The authors formulate directional optimization as Maximum a Posteriori estimation with a von Mises–Fisher distribution as a directional prior. This yields a constant-direction prior gradient that regularizes embeddings towards semantically meaningful directions in hyperspherical latent space.

10 retrieved papers
Can Refute
Theoretical and empirical analysis of embedding norm inflation

The authors provide both theoretical analysis and empirical evidence showing that excessive embedding norms in standard Textual Inversion attenuate positional information and cause residual update stagnation in pre-norm Transformers, degrading text-prompt alignment.

0 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Directional Textual Inversion (DTI) framework

DTI is a novel personalization framework that decouples token embeddings into magnitude and direction components. It maintains embedding magnitude at in-distribution scale while optimizing only the directional component on the unit hypersphere using Riemannian SGD, improving text fidelity while preserving subject similarity.

Contribution

MAP formulation with von Mises–Fisher prior for direction learning

The authors formulate directional optimization as Maximum a Posteriori estimation with a von Mises–Fisher distribution as a directional prior. This yields a constant-direction prior gradient that regularizes embeddings towards semantically meaningful directions in hyperspherical latent space.

Contribution

Theoretical and empirical analysis of embedding norm inflation

The authors provide both theoretical analysis and empirical evidence showing that excessive embedding norms in standard Textual Inversion attenuate positional information and cause residual update stagnation in pre-norm Transformers, degrading text-prompt alignment.