OmniText: A Training-Free Generalist for Controllable Text-Image Manipulation

ICLR 2026 Conference SubmissionAnonymous Authors
diffusion modeltext image manipulationscene text editing
Abstract:

Recent advancements in diffusion-based text synthesis have demonstrated significant performance in inserting and editing text within images via inpainting. However, despite the potential of text inpainting methods, three key limitations hinder their applicability to broader Text Image Manipulation (TIM) tasks: (i) the inability to remove text, (ii) the lack of control over the style of rendered text, and (iii) a tendency to generate duplicated letters. To address these challenges, we propose OmniText, a training-free generalist capable of performing a wide range of TIM tasks. Specifically, we investigate two key properties of cross- and self-attention mechanisms to enable text removal and to provide control over both text styles and content. Our findings reveal that text removal can be achieved by applying self-attention inversion, which mitigates the model's tendency to focus on surrounding text, thus reducing text hallucinations. Additionally, we redistribute cross-attention, as increasing the probability of certain text tokens reduces text hallucination. For controllable inpainting, we introduce novel loss functions in a latent optimization framework: a cross-attention content loss to improve text rendering accuracy and a self-attention style loss to facilitate style customization. Furthermore, we present OmniText-Bench, a benchmark dataset for evaluating diverse TIM tasks. It includes input images, target text with masks, and style references, covering diverse applications such as text removal, rescaling, repositioning, and insertion and editing with various styles. Our OmniText framework is the first generalist method capable of performing diverse TIM tasks. It achieves state-of-the-art performance across multiple tasks and metrics compared to other text inpainting methods and is comparable with specialist methods.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

OmniText proposes a training-free generalist for text image manipulation, addressing text removal, style control, and letter duplication through attention mechanism interventions. The paper resides in the 'Text Rendering and Typography Control' leaf, which contains only two papers total (including OmniText itself). This represents a relatively sparse research direction within the broader taxonomy of fifty papers across thirty-six topics, suggesting that controllable text rendering remains an underexplored niche compared to general text-to-image synthesis or multimodal editing frameworks.

The taxonomy reveals that OmniText's parent branch, 'Specialized Text-Image Manipulation Tasks', sits alongside more populated areas such as 'Text-Conditioned Image Generation and Editing' (which includes diffusion-based synthesis and latent manipulation methods) and 'Multimodal-Conditioned Generation and Editing' (covering text-visual joint conditioning and domain-specific applications). While neighboring leaves address image quality enhancement and user-specified content generation, OmniText's focus on typography control and text removal diverges from these directions by targeting fine-grained textual fidelity rather than global aesthetic refinement or semantic editing.

Among twenty-nine candidates examined, the contribution-level analysis shows varied novelty signals. The core OmniText framework (Contribution A) examined ten candidates with zero refutations, suggesting limited direct overlap in training-free text manipulation approaches. The attention mechanism techniques (Contribution B) similarly examined nine candidates without refutation. However, the OmniText-Bench benchmark (Contribution C) examined ten candidates and found one refutable match, indicating that among the limited search scope, at least one prior benchmark addresses overlapping evaluation needs for text image manipulation tasks.

Based on the top-twenty-nine semantic matches examined, OmniText appears to occupy a relatively novel position within its immediate research area, particularly regarding training-free attention-based text removal and style control. The sparse population of its taxonomy leaf and the limited refutations across most contributions suggest that the work addresses gaps not extensively covered by the examined prior art. However, the analysis does not claim exhaustive coverage of all relevant literature, and the single refutation for the benchmark component indicates that evaluation infrastructure for text manipulation tasks has received some prior attention.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
29
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: Controllable text-image manipulation. The field encompasses a broad spectrum of methods for generating and editing images under textual or multimodal guidance. At the highest level, the taxonomy divides into six major branches: Text-Conditioned Image Generation and Editing focuses on purely language-driven synthesis and modification; Multimodal-Conditioned Generation and Editing extends control by incorporating additional modalities such as sketches, layouts, or reference images; Visual-Guided Manipulation and Retrieval emphasizes retrieval-based or vision-centric approaches; Specialized Text-Image Manipulation Tasks targets domain-specific challenges like text rendering, typography, and scene composition; Data Augmentation and Representation Learning explores how controllable generation can improve downstream tasks; and Evaluation, Benchmarking, and Supporting Infrastructure provides the metrics and datasets necessary for rigorous assessment. Representative works such as Visual Autoregressive Modeling[3] and Video Diffusion Models[6] illustrate the diversity of generative paradigms, while methods like Conditional Prompt Learning[2] and Extended Textual Conditioning[7] highlight advances in conditioning mechanisms. Within this landscape, a particularly active line of work addresses the challenge of precise text rendering and layout control in generated images. OmniText[0] sits squarely in the Specialized Text-Image Manipulation Tasks branch, specifically under Text Rendering and Typography Control, where it tackles the notoriously difficult problem of embedding legible, stylistically coherent text into visual scenes. This contrasts with broader generation frameworks that prioritize photorealism or semantic fidelity but often struggle with fine-grained typographic details. A closely related effort, DensityLayout[26], also explores layout-aware generation, yet OmniText[0] distinguishes itself by emphasizing end-to-end controllability over both textual content and visual appearance. The central tension in this subfield revolves around balancing expressive flexibility with rendering accuracy, a challenge that remains open as models scale and as applications demand higher fidelity in domains such as advertising, document synthesis, and multimodal content creation.

Claimed Contributions

OmniText: a training-free generalist for text image manipulation

The authors introduce OmniText, a training-free method that can perform diverse text image manipulation tasks including text removal, insertion, editing, and style control, addressing limitations of existing text inpainting methods.

10 retrieved papers
Attention mechanism techniques for text removal and controllable inpainting

The authors propose using self-attention inversion for text removal to reduce text hallucinations, redistributing cross-attention to improve text rendering, and introducing novel loss functions (cross-attention content loss and self-attention style loss) for controllable text inpainting.

9 retrieved papers
OmniText-Bench: a benchmark dataset for text image manipulation

The authors create OmniText-Bench, a new benchmark dataset designed to evaluate various text image manipulation tasks, including input images, target text with masks, and style specifications.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

OmniText: a training-free generalist for text image manipulation

The authors introduce OmniText, a training-free method that can perform diverse text image manipulation tasks including text removal, insertion, editing, and style control, addressing limitations of existing text inpainting methods.

Contribution

Attention mechanism techniques for text removal and controllable inpainting

The authors propose using self-attention inversion for text removal to reduce text hallucinations, redistributing cross-attention to improve text rendering, and introducing novel loss functions (cross-attention content loss and self-attention style loss) for controllable text inpainting.

Contribution

OmniText-Bench: a benchmark dataset for text image manipulation

The authors create OmniText-Bench, a new benchmark dataset designed to evaluate various text image manipulation tasks, including input images, target text with masks, and style specifications.

OmniText: A Training-Free Generalist for Controllable Text-Image Manipulation | Novelty Validation