OmniText: A Training-Free Generalist for Controllable Text-Image Manipulation

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

diffusion modeltext image manipulationscene text editing

Recent advancements in diffusion-based text synthesis have demonstrated significant performance in inserting and editing text within images via inpainting. However, despite the potential of text inpainting methods, three key limitations hinder their applicability to broader Text Image Manipulation (TIM) tasks: (i) the inability to remove text, (ii) the lack of control over the style of rendered text, and (iii) a tendency to generate duplicated letters. To address these challenges, we propose OmniText, a training-free generalist capable of performing a wide range of TIM tasks. Specifically, we investigate two key properties of cross- and self-attention mechanisms to enable text removal and to provide control over both text styles and content. Our findings reveal that text removal can be achieved by applying self-attention inversion, which mitigates the model's tendency to focus on surrounding text, thus reducing text hallucinations. Additionally, we redistribute cross-attention, as increasing the probability of certain text tokens reduces text hallucination. For controllable inpainting, we introduce novel loss functions in a latent optimization framework: a cross-attention content loss to improve text rendering accuracy and a self-attention style loss to facilitate style customization. Furthermore, we present OmniText-Bench, a benchmark dataset for evaluating diverse TIM tasks. It includes input images, target text with masks, and style references, covering diverse applications such as text removal, rescaling, repositioning, and insertion and editing with various styles. Our OmniText framework is the first generalist method capable of performing diverse TIM tasks. It achieves state-of-the-art performance across multiple tasks and metrics compared to other text inpainting methods and is comparable with specialist methods.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

OmniText proposes a training-free generalist for text image manipulation, addressing text removal, style control, and letter duplication through attention mechanism interventions. The paper resides in the 'Text Rendering and Typography Control' leaf, which contains only two papers total (including OmniText itself). This represents a relatively sparse research direction within the broader taxonomy of fifty papers across thirty-six topics, suggesting that controllable text rendering remains an underexplored niche compared to general text-to-image synthesis or multimodal editing frameworks.

The taxonomy reveals that OmniText's parent branch, 'Specialized Text-Image Manipulation Tasks', sits alongside more populated areas such as 'Text-Conditioned Image Generation and Editing' (which includes diffusion-based synthesis and latent manipulation methods) and 'Multimodal-Conditioned Generation and Editing' (covering text-visual joint conditioning and domain-specific applications). While neighboring leaves address image quality enhancement and user-specified content generation, OmniText's focus on typography control and text removal diverges from these directions by targeting fine-grained textual fidelity rather than global aesthetic refinement or semantic editing.

Among twenty-nine candidates examined, the contribution-level analysis shows varied novelty signals. The core OmniText framework (Contribution A) examined ten candidates with zero refutations, suggesting limited direct overlap in training-free text manipulation approaches. The attention mechanism techniques (Contribution B) similarly examined nine candidates without refutation. However, the OmniText-Bench benchmark (Contribution C) examined ten candidates and found one refutable match, indicating that among the limited search scope, at least one prior benchmark addresses overlapping evaluation needs for text image manipulation tasks.

Based on the top-twenty-nine semantic matches examined, OmniText appears to occupy a relatively novel position within its immediate research area, particularly regarding training-free attention-based text removal and style control. The sparse population of its taxonomy leaf and the limited refutations across most contributions suggest that the work addresses gaps not extensively covered by the examined prior art. However, the analysis does not claim exhaustive coverage of all relevant literature, and the single refutation for the benchmark component indicates that evaluation infrastructure for text manipulation tasks has received some prior attention.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Controllable text-image manipulation. The field encompasses a broad spectrum of methods for generating and editing images under textual or multimodal guidance. At the highest level, the taxonomy divides into six major branches: Text-Conditioned Image Generation and Editing focuses on purely language-driven synthesis and modification; Multimodal-Conditioned Generation and Editing extends control by incorporating additional modalities such as sketches, layouts, or reference images; Visual-Guided Manipulation and Retrieval emphasizes retrieval-based or vision-centric approaches; Specialized Text-Image Manipulation Tasks targets domain-specific challenges like text rendering, typography, and scene composition; Data Augmentation and Representation Learning explores how controllable generation can improve downstream tasks; and Evaluation, Benchmarking, and Supporting Infrastructure provides the metrics and datasets necessary for rigorous assessment. Representative works such as Visual Autoregressive Modeling[3] and Video Diffusion Models[6] illustrate the diversity of generative paradigms, while methods like Conditional Prompt Learning[2] and Extended Textual Conditioning[7] highlight advances in conditioning mechanisms. Within this landscape, a particularly active line of work addresses the challenge of precise text rendering and layout control in generated images. OmniText[0] sits squarely in the Specialized Text-Image Manipulation Tasks branch, specifically under Text Rendering and Typography Control, where it tackles the notoriously difficult problem of embedding legible, stylistically coherent text into visual scenes. This contrasts with broader generation frameworks that prioritize photorealism or semantic fidelity but often struggle with fine-grained typographic details. A closely related effort, DensityLayout[26], also explores layout-aware generation, yet OmniText[0] distinguishes itself by emphasizing end-to-end controllability over both textual content and visual appearance. The central tension in this subfield revolves around balancing expressive flexibility with rendering accuracy, a challenge that remains open as models scale and as applications demand higher fidelity in domains such as advertising, document synthesis, and multimodal content creation.

Claimed Contributions

OmniText: a training-free generalist for text image manipulation

10 retrieved papers

The authors introduce OmniText, a training-free method that can perform diverse text image manipulation tasks including text removal, insertion, editing, and style control, addressing limitations of existing text inpainting methods.

10 retrieved papers

Attention mechanism techniques for text removal and controllable inpainting

9 retrieved papers

The authors propose using self-attention inversion for text removal to reduce text hallucinations, redistributing cross-attention to improve text rendering, and introducing novel loss functions (cross-attention content loss and self-attention style loss) for controllable text inpainting.

9 retrieved papers

OmniText-Bench: a benchmark dataset for text image manipulation

Can Refute

10 retrieved papers

The authors create OmniText-Bench, a new benchmark dataset designed to evaluate various text image manipulation tasks, including input images, target text with masks, and style specifications.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[26] DensityLayout: Density-Conditioned Layout GAN for Visual-Textual Presentation Designs PDF

Hsiao-An Hsu, Xiangteng He, HsiaoYuan Hsu, Yuxin Peng (2023)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

OmniText: a training-free generalist for text image manipulation

[51] Training-free layout control with cross-attention guidance PDF

Cannot Refute

[52] MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing PDF

Cannot Refute

[53] Training-Free Consistent Text-to-Image Generation PDF

Cannot Refute

[54] Zone: Zero-shot instruction-guided local editing PDF

Cannot Refute

[55] Scaling up GANs for Text-to-Image Synthesis PDF

Cannot Refute

[56] Fastcomposer: Tuning-free multi-subject image generation with localized attention PDF

Cannot Refute

[57] Pick-and-draw: Training-free semantic guidance for text-to-image personalization PDF

Cannot Refute

[58] Refining text-to-image generation: Towards accurate training-free glyph-enhanced image generation PDF

Cannot Refute

[59] Safree: Training-free and adaptive guard for safe text-to-image and video generation PDF

Cannot Refute

[60] A Latent Space of Stochastic Diffusion Models for Zero-Shot Image Editing and Guidance PDF

Cannot Refute

Contribution

Attention mechanism techniques for text removal and controllable inpainting

[64] Textdiffuser: Diffusion models as text painters PDF

Cannot Refute

[69] Magicremover: Tuning-free text-guided image inpainting with diffusion models PDF

Cannot Refute

[70] CHENet: image to image Chinese handwriting eraser PDF

Cannot Refute

[71] Automatic text inpainting and quality elevation in video sequences PDF

Cannot Refute

[72] Don't Forget Me: Accurate Background Recovery for Text Removal via Modeling Local-Global Context PDF

Cannot Refute

[73] FETNet: Feature Erasing and Transferring Network for Scene Text Removal PDF

Cannot Refute

[74] The Surprisingly Straightforward Scene Text Removal Method With Gated Attention and Region of Interest Generation: A Comprehensive Prominent Model Analysis PDF

Cannot Refute

[75] MTRNet++: One-stage Mask-based Scene Text Eraser PDF

Cannot Refute

[76] MSLKANet: A Multi-Scale Large Kernel Attention Network for Scene Text Removal PDF

Cannot Refute

Contribution

OmniText-Bench: a benchmark dataset for text image manipulation

The authors create OmniText-Bench, a new benchmark dataset designed to evaluate various text image manipulation tasks, including input images, target text with masks, and style specifications.

[65] Imagen editor and editbench: Advancing and evaluating text-guided image inpainting PDF

Can Refute

[32] Evaluating generative AI models for image-text modification PDF

Cannot Refute

[52] MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing PDF

Cannot Refute

[61] Diffusion model-based image editing: A survey PDF

Cannot Refute

[62] Instructpix2pix: Learning to follow image editing instructions PDF

Cannot Refute

[63] Adiee: Automatic dataset creation and scorer for instruction-guided image editing evaluation PDF

Cannot Refute

[64] Textdiffuser: Diffusion models as text painters PDF

Cannot Refute

[66] EditInspector: A Benchmark for Evaluation of Text-Guided Image Edits PDF

Cannot Refute

[67] Lmm4edit: Benchmarking and evaluating multimodal image editing with lmms PDF

Cannot Refute

[68] Opengpt-4o-image: A comprehensive dataset for advanced image generation and editing PDF

Cannot Refute

OmniText: A Training-Free Generalist for Controllable Text-Image Manipulation

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[26] DensityLayout: Density-Conditioned Layout GAN for Visual-Textual Presentation Designs PDF

Contribution Analysis

OmniText: a training-free generalist for text image manipulation

[51] Training-free layout control with cross-attention guidance PDF

[52] MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing PDF

[53] Training-Free Consistent Text-to-Image Generation PDF

[54] Zone: Zero-shot instruction-guided local editing PDF

[55] Scaling up GANs for Text-to-Image Synthesis PDF

[56] Fastcomposer: Tuning-free multi-subject image generation with localized attention PDF

[57] Pick-and-draw: Training-free semantic guidance for text-to-image personalization PDF

[58] Refining text-to-image generation: Towards accurate training-free glyph-enhanced image generation PDF

[59] Safree: Training-free and adaptive guard for safe text-to-image and video generation PDF

[60] A Latent Space of Stochastic Diffusion Models for Zero-Shot Image Editing and Guidance PDF

Attention mechanism techniques for text removal and controllable inpainting

[64] Textdiffuser: Diffusion models as text painters PDF

[69] Magicremover: Tuning-free text-guided image inpainting with diffusion models PDF

[70] CHENet: image to image Chinese handwriting eraser PDF

[71] Automatic text inpainting and quality elevation in video sequences PDF

[72] Don't Forget Me: Accurate Background Recovery for Text Removal via Modeling Local-Global Context PDF

[73] FETNet: Feature Erasing and Transferring Network for Scene Text Removal PDF

[74] The Surprisingly Straightforward Scene Text Removal Method With Gated Attention and Region of Interest Generation: A Comprehensive Prominent Model Analysis PDF

[75] MTRNet++: One-stage Mask-based Scene Text Eraser PDF

[76] MSLKANet: A Multi-Scale Large Kernel Attention Network for Scene Text Removal PDF

OmniText-Bench: a benchmark dataset for text image manipulation

[65] Imagen editor and editbench: Advancing and evaluating text-guided image inpainting PDF

[32] Evaluating generative AI models for image-text modification PDF

[52] MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing PDF

[61] Diffusion model-based image editing: A survey PDF

[62] Instructpix2pix: Learning to follow image editing instructions PDF

[63] Adiee: Automatic dataset creation and scorer for instruction-guided image editing evaluation PDF

[64] Textdiffuser: Diffusion models as text painters PDF

[66] EditInspector: A Benchmark for Evaluation of Text-Guided Image Edits PDF

[67] Lmm4edit: Benchmarking and evaluating multimodal image editing with lmms PDF

[68] Opengpt-4o-image: A comprehensive dataset for advanced image generation and editing PDF

Table of Contents