Directional Textual Inversion for Personalized Text-to-Image Generation

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

personalized generationtext-to-image modelstextual inversion

Textual Inversion (TI) is an efficient approach to text‑to‑image personalization but often fails on complex prompts. We trace these failures to embedding norm inflation: learned tokens drift to out‑of‑distribution magnitudes, degrading prompt conditioning in pre‑norm Transformers. Empirically, we show semantics are primarily encoded by direction in CLIP token space, while inflated norms harm contextualization; theoretically, we analyze how large magnitudes attenuate positional information and hinder residual updates in pre‑norm blocks. We propose Directional Textual Inversion (DTI), which fixes the embedding magnitude to an in‑distribution scale and optimizes only direction on the unit hypersphere via Riemannian SGD. We cast direction learning as MAP with a von Mises–Fisher prior, yielding a constant‑direction prior gradient that is simple and efficient to incorporate. Across personalization tasks, DTI improves text fidelity over TI and TI‑variants while maintaining subject similarity. Crucially, DTI’s hyperspherical parameterization enables smooth, semantically coherent interpolation between learned concepts (slerp), a capability that is absent in standard TI. Our findings suggest that direction‑only optimization is a robust and scalable path for prompt‑faithful personalization.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Directional Textual Inversion (DTI), which constrains embedding optimization to the unit hypersphere via Riemannian SGD and incorporates a von Mises–Fisher prior. It resides in the 'Constrained Embedding Optimization' leaf, which contains only three papers total. This leaf sits within the broader 'Textual Embedding Optimization' branch, indicating a moderately sparse research direction focused on geometric or semantic constraints during embedding learning. The small sibling set suggests this specific angle—directional constraints with hyperspherical parameterization—is relatively underexplored compared to unconstrained textual inversion methods.

The taxonomy tree shows that neighboring leaves include 'Single-Concept Textual Inversion' (three papers on unconstrained optimization) and 'Disentangled Embedding Learning' (four papers on identity-context separation). The 'Constrained Embedding Optimization' leaf explicitly excludes unconstrained methods and disentanglement-focused approaches, positioning DTI as a middle ground: it imposes geometric constraints without explicit disentanglement objectives. Nearby branches like 'Encoder-Based Personalization' and 'Model Fine-Tuning Approaches' represent alternative paradigms (feed-forward encoders vs. parameter updates), highlighting that DTI's iterative embedding refinement occupies a distinct methodological niche within the field.

Among twelve candidates examined, the MAP formulation with von Mises–Fisher prior (Contribution B) shows two refutable candidates out of ten examined, indicating some prior work on directional priors or hyperspherical embeddings. The DTI framework itself (Contribution A) examined two candidates with zero refutations, suggesting the specific combination of norm-fixing and Riemannian optimization may be novel. The theoretical analysis of norm inflation (Contribution C) was not tested against candidates. The limited search scope—twelve papers total—means these findings reflect top semantic matches rather than exhaustive coverage, and the refutable pairs likely represent overlapping methodological components rather than complete anticipation of the full DTI approach.

Given the sparse taxonomy leaf and the limited refutation rate across contributions, DTI appears to introduce a relatively fresh angle on constrained embedding optimization. However, the presence of two refutable candidates for the von Mises–Fisher prior suggests that directional priors are not entirely unprecedented. The analysis is bounded by the top-12 semantic search scope and does not cover the full landscape of hyperspherical learning or Riemannian optimization in adjacent fields.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: personalized text-to-image generation with embedding optimization. The field organizes around several complementary strategies for adapting diffusion models to user-specific concepts. Embedding Learning and Optimization Strategies focus on refining textual or latent representations without modifying model weights, encompassing methods like textual inversion (Image Worth One Word[1]) and constrained optimization approaches that guide embeddings toward desired attributes. Encoder-Based Personalization leverages dedicated networks to map reference images into embedding spaces (e.g., Blip-diffusion[7], ELITE[8]), enabling rapid adaptation. Model Fine-Tuning Approaches adjust diffusion model parameters directly, exemplified by DreamBooth[10] and its variants, trading inference speed for fidelity. Compositional and Controllable Generation addresses multi-concept scenarios and attribute disentanglement (Multi-Concept Customization[12], Attribute Disentanglement[2]), while Prompt Engineering and Refinement explores how textual guidance can be iteratively improved. Specialized Generation Tasks extend these techniques to domains like typography or video. Within Embedding Learning, a particularly active line explores constrained optimization to balance identity preservation and editability. Directional Textual Inversion[0] sits in this cluster, emphasizing directional constraints during embedding refinement to maintain semantic coherence while personalizing concepts. Nearby works like Cross Initialization[33] propose alternative initialization strategies to accelerate convergence, and Core[25] investigates core embedding structures for robust personalization. In contrast, encoder-based methods (Taming Encoder[15], ECLIPSE[26]) prioritize zero-shot generalization by learning mappings from large datasets, sacrificing per-concept fine-grained control for speed. The tension between optimization depth and generalization breadth remains central: embedding optimization methods like Directional Textual Inversion[0] offer precise control over individual concepts but require iterative refinement, whereas encoder approaches achieve faster adaptation at the cost of potentially reduced fidelity for out-of-distribution subjects.

Claimed Contributions

Directional Textual Inversion (DTI) framework

2 retrieved papers

DTI is a novel personalization framework that decouples token embeddings into magnitude and direction components. It maintains embedding magnitude at in-distribution scale while optimizing only the directional component on the unit hypersphere using Riemannian SGD, improving text fidelity while preserving subject similarity.

2 retrieved papers

MAP formulation with von Mises–Fisher prior for direction learning

Can Refute

10 retrieved papers

The authors formulate directional optimization as Maximum a Posteriori estimation with a von Mises–Fisher distribution as a directional prior. This yields a constant-direction prior gradient that regularizes embeddings towards semantically meaningful directions in hyperspherical latent space.

10 retrieved papers

Can Refute

Theoretical and empirical analysis of embedding norm inflation

0 retrieved papers

The authors provide both theoretical analysis and empirical evidence showing that excessive embedding norms in standard Textual Inversion attenuate positional information and cause residual update stagnation in pre-norm Transformers, degrading text-prompt alignment.

0 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[25] Core: Context-regularized text embedding learning for text-to-image personalization PDF

Fei-ze Wu, Yun Pang, Junyi Zhang, Lianyu Pang, Jian Yin, Bao-Quan Zhao, Qing Li, Xudong Mao (2025)

[33] Cross initialization for personalized text-to-image generation PDF

Yin Jian, Lianyu Pang, Xie, Haoran, Jian Yin, Wang Qi-ping, Haoran Xie, Li Qing, Qiping Wang, Mao, Xudong, Qing Li, Xudong Mao (2023)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Directional Textual Inversion (DTI) framework

[61] DKD: Directional Knowledge Distillation for One-Step Text-to-Image Generation PDF

Cannot Refute

[62] Misalignment Attack on Text-to-Image Models via Text Embedding Optimization and Inversion PDF

Cannot Refute

Contribution

MAP formulation with von Mises–Fisher prior for direction learning

[53] von Mises-Fisher Mixture Model-based Deep learning: Application to Face Verification PDF

Can Refute

[60] Mises-Fisher similarity-based boosted additive angular margin loss for breast cancer classification PDF

Can Refute

[51] Deep Adaptive Graph Clustering via von Mises-Fisher Distributions PDF

Cannot Refute

[52] Dino as a von mises-fisher mixture model PDF

Cannot Refute

[54] Dynamic deep clustering of high-dimensional directional data via hyperspherical embeddings with Bayesian nonparametric mixtures PDF

Cannot Refute

[55] Clustering on the Unit Hypersphere using von Mises-Fisher Distributions. PDF

Cannot Refute

[56] vMF-exp: von Mises-Fisher Exploration of Large Action Sets with Hyperspherical Embeddings PDF

Cannot Refute

[57] Statistical modeling of directional data using a robust hierarchical von mises distribution model: perspectives for wind energy PDF

Cannot Refute

[58] vMFER: von Mises-Fisher Experience Resampling Based on Uncertainty of Gradient Directions for Policy Improvement of Actor-Critic Algorithms PDF

Cannot Refute

[59] Non-convex Pose Graph Optimization in SLAM via Proximal Linearized Riemannian ADMM PDF

Cannot Refute

Contribution

Directional Textual Inversion for Personalized Text-to-Image Generation

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[25] Core: Context-regularized text embedding learning for text-to-image personalization PDF

[33] Cross initialization for personalized text-to-image generation PDF

Contribution Analysis

Directional Textual Inversion (DTI) framework

[61] DKD: Directional Knowledge Distillation for One-Step Text-to-Image Generation PDF

[62] Misalignment Attack on Text-to-Image Models via Text Embedding Optimization and Inversion PDF

MAP formulation with von Mises–Fisher prior for direction learning

[53] von Mises-Fisher Mixture Model-based Deep learning: Application to Face Verification PDF

[60] Mises-Fisher similarity-based boosted additive angular margin loss for breast cancer classification PDF

[51] Deep Adaptive Graph Clustering via von Mises-Fisher Distributions PDF

[52] Dino as a von mises-fisher mixture model PDF

[54] Dynamic deep clustering of high-dimensional directional data via hyperspherical embeddings with Bayesian nonparametric mixtures PDF

[55] Clustering on the Unit Hypersphere using von Mises-Fisher Distributions. PDF

[56] vMF-exp: von Mises-Fisher Exploration of Large Action Sets with Hyperspherical Embeddings PDF

[57] Statistical modeling of directional data using a robust hierarchical von mises distribution model: perspectives for wind energy PDF

[58] vMFER: von Mises-Fisher Experience Resampling Based on Uncertainty of Gradient Directions for Policy Improvement of Actor-Critic Algorithms PDF

[59] Non-convex Pose Graph Optimization in SLAM via Proximal Linearized Riemannian ADMM PDF

Theoretical and empirical analysis of embedding norm inflation

Table of Contents