Omni-IML: Towards Unified Interpretable Image Manipulation Localization

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

document analysistampered text detectionvision foundation model

Existing Image Manipulation Localization (IML) methods rely heavily on task-specific designs, making them perform well only on the target IML task, while joint training on multiple IML tasks causes significant performance degradation, hindering real applications. To this end, we propose Omni-IML, the first generalist model designed to unify IML across diverse tasks. Specifically, Omni-IML achieves generalization through three key components: (1) a Modal Gate Encoder, which adaptively selects the optimal encoding modality per sample, (2) a Dynamic Weight Decoder, which dynamically adjusts decoder filters to the task at hand, and (3) an Anomaly Enhancement module that leverages box supervision to highlight the tampered regions and facilitate the learning of task-agnostic features. Beyond localization, to support interpretation of the tampered images, we construct Omni-273k, a large high-quality dataset that includes natural language descriptions of tampered artifacts. It is annotated through our automatic, chain-of-thoughts annotation technique. We also design a simple-yet-effective interpretation module to better utilize these descriptive annotations. Our extensive experiments show that our single Omni-IML model achieves state-of-the-art performance across all four major IML tasks, providing a valuable solution for practical deployment and a promising direction of generalist models in image forensics. We will release our code and dataset.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Omni-IML, a generalist model for unified image manipulation localization across diverse tasks (splicing, copy-move, inpainting, removal). It resides in the Cross-Task Generalist Models leaf, which contains only three papers total, including this work and two siblings. This sparse population suggests the research direction—building truly unified architectures that handle multiple IML tasks without task-specific heads—remains relatively underexplored compared to the broader field of fifty papers spanning specialized detectors, multi-modal fusion, and domain-specific methods.

The taxonomy tree reveals that Omni-IML's parent branch, Unified and Multi-Task Frameworks, also includes Multi-Type Forgery Detection (three papers targeting integrated forgery frameworks) and Multi-Modal and Multi-Stream Architectures (five papers fusing multiple input modalities). Neighboring branches focus on Feature Representation and Extraction (twenty papers across transformer-based, multi-scale, noise-domain, and convolutional approaches) and Attention and Refinement Mechanisms (five papers on attention-guided localization and boundary enhancement). The Cross-Task Generalist Models leaf explicitly excludes task-specific or single-modality approaches, positioning Omni-IML as a departure from the more crowded specialized detection categories.

Among thirty candidates examined, the first contribution—claiming Omni-IML as the first generalist model for unified interpretable IML—shows one refutable candidate out of ten examined, indicating at least one prior work may overlap with the generalist framing. The second contribution, introducing novel modules (Modal Gate Encoder, Dynamic Weight Decoder, Anomaly Enhancement), examined ten candidates with zero refutations, suggesting these architectural components appear more distinctive within the limited search scope. The third contribution, the chain-of-thoughts annotation technique and Omni-273k dataset, also examined ten candidates with no refutations, implying the dataset construction and annotation methodology may be less directly anticipated by prior work.

Based on the limited top-thirty semantic search, the architectural modules and dataset contributions appear more novel than the overarching generalist claim, which faces at least one overlapping prior work. The sparse Cross-Task Generalist Models leaf (three papers) and the broader taxonomy structure (fifty papers, thirty-six topics) suggest the field is still consolidating unified approaches, though the analysis does not cover exhaustive citation networks or domain-specific venues beyond the examined candidates.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: unified image manipulation localization across diverse tasks. The field has evolved from specialized detectors targeting single forgery types toward more comprehensive frameworks that handle multiple manipulation categories simultaneously. The taxonomy reflects this progression through six main branches: Unified and Multi-Task Frameworks address cross-task generalization and joint learning strategies; Feature Representation and Extraction explores multi-scale, frequency-domain, and hierarchical encodings; Attention and Refinement Mechanisms develop spatial and channel-wise modules to highlight suspicious regions; Domain-Specific and Application-Oriented Methods tailor solutions to face manipulation, text tampering, or social media contexts; Robustness and Post-Processing Resilience investigates defenses against compression and adversarial perturbations; and Benchmarking and Comprehensive Analysis establishes evaluation protocols and surveys the landscape. Representative works such as Trufor[4] and Mantra-net[6] illustrate early multi-clue integration, while recent efforts like ObjectFormer[12] and UnionFormer[15] demonstrate transformer-based unification. A central tension emerges between task-specific depth and cross-task breadth: some lines prioritize robustness within narrow domains (e.g., Digital Face Manipulation[1]), whereas others pursue generalist architectures that sacrifice per-task optimality for versatility. Omni-IML[0] sits squarely within the Cross-Task Generalist Models cluster, emphasizing a unified pipeline that localizes splicing, copy-move, inpainting, and removal manipulations without task-specific heads. This design contrasts with neighbors like Omni-IML Unified[8], which also targets multi-task scenarios but may differ in architectural choices or training strategies, and UMIF-Net[49], which explores alternative fusion mechanisms for integrating diverse forensic cues. By consolidating multiple manipulation types under a single framework, Omni-IML[0] addresses the practical need for deployable systems that handle real-world image forgeries without prior knowledge of the specific tampering method, a direction increasingly vital as generative models proliferate and manipulation techniques diversify.

Claimed Contributions

Omni-IML: first generalist model for unified interpretable IML

Can Refute

10 retrieved papers

The authors introduce Omni-IML as the first generalist model capable of performing image manipulation localization across multiple major tasks (natural images, documents, faces, and scene text) simultaneously, while also providing interpretable artifact descriptions in natural language.

10 retrieved papers

Can Refute

Novel modules for unified IML modeling

10 retrieved papers

The authors develop three key architectural components: a Modal Gate Encoder that adaptively selects optimal encoding modality per sample, a Dynamic Weight Decoder that adjusts decoder filters dynamically, and an Anomaly Enhancement module using box supervision to highlight tampered regions and learn task-agnostic features.

10 retrieved papers

Chain-of-thoughts annotation technique and Omni-273k dataset

10 retrieved papers

The authors propose an automatic chain-of-thoughts annotation pipeline to generate high-quality natural language descriptions of tampered artifacts, and use it to construct Omni-273k, a large-scale dataset with artifact descriptions across natural, document, face, and scene text domains.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[8] Omni-IML: Towards Unified Image Manipulation Localization PDF

Zhong, Yiwu, Chenfan Qu, Guo, Fengjun, Yiwu Zhong, Jin Lianwen, Fengjun Guo, Lianwen Jin (2024)

[49] UMIF-Net: unified framework for multi-type image forgery detection and localization PDF

Peng Liang, Jin Gu, Yukuan Liu, Gang Hao, Weili Liu, Jianhua Guo, Xing Yang (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Omni-IML: first generalist model for unified interpretable IML

[8] Omni-IML: Towards Unified Image Manipulation Localization PDF

Can Refute

[4] Trufor: Leveraging all-round clues for trustworthy image forgery detection and localization PDF

Cannot Refute

[15] UnionFormer: Unified-Learning Transformer with Multi-View Representation for Image Manipulation Detection and Localization PDF

Cannot Refute

[60] Emerging properties in unified multimodal pretraining PDF

Cannot Refute

[61] DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation PDF

Cannot Refute

[62] Towards Universal Fake Image Detectors that Generalize Across Generative Models PDF

Cannot Refute

[63] GCA-Net: Utilizing gated context attention for improving image forgery localization and detection PDF

Cannot Refute

[64] TBFormer: Two-Branch Transformer for Image Forgery Localization PDF

Cannot Refute

[65] ME: Multi-Task Edge-Enhanced for Image Forgery Localization PDF

Cannot Refute

[66] NVMS-Net: A Novel Constrained Noise-View Multiscale Network for Detecting General Image Processing Based Manipulations PDF

Cannot Refute

Contribution

Novel modules for unified IML modeling

[9] Hierarchical Fine-Grained Image Forgery Detection and Localization PDF

Cannot Refute

[67] Rethinking Image Forgery Detection and Localization via Regression Perspective PDF

Cannot Refute

[68] Dual-stream enhancement encoder and attention optimization decoder for image manipulation localization PDF

Cannot Refute

[69] Adaptive representation disentanglement network for change captioning PDF

Cannot Refute

[70] Deep learning-based forgery detection and localization for compressed images using a hybrid optimization model PDF

Cannot Refute

[71] MSHRT-Net: Multi-Scale Hierarchical Residual Transfer Network for Image Manipulation Detection and Localization PDF

Cannot Refute

[72] Encoder-decoder based convolutional neural networks for image forgery detection PDF

Cannot Refute

[73] Image forgery classification and localization through vision transformers PDF

Cannot Refute

[74] Multi-scale and deeply supervised network for image splicing localization PDF

Cannot Refute

[75] TALIU: A Novel Decoder and Augmentation Strategy for Boosting Tampered Document Image Detection PDF

Cannot Refute

Contribution

Chain-of-thoughts annotation technique and Omni-273k dataset

[8] Omni-IML: Towards Unified Image Manipulation Localization PDF

Cannot Refute

[51] EmbRACE-3K: Embodied Reasoning and Action in Complex Environments PDF

Cannot Refute

[52] Reinforced Embodied Planning with Verifiable Reward for Real-World Robotic Manipulation PDF

Cannot Refute

[53] Generate subgoal images before act: Unlocking the chain-of-thought reasoning in diffusion model for robot manipulation with multimodal prompts PDF

Cannot Refute

[54] Visualtrans: A benchmark for real-world visual transformation reasoning PDF

Cannot Refute

[55] Robust sequential deepfake detection PDF

Cannot Refute

[56] Detecting and recovering sequential deepfake manipulation PDF

Cannot Refute

[57] Alphablock: Embodied finetuning for vision-language reasoning in robot manipulation PDF

Cannot Refute

[58] VisionCube: 3D-Aware Vision-Language Model for Multi-Step Spatial Reasoning PDF

Cannot Refute

[59] CoIE: Chain-of-Instruct Editing for Multi-Attribute Face Manipulation PDF

Cannot Refute

Omni-IML: Towards Unified Interpretable Image Manipulation Localization

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[8] Omni-IML: Towards Unified Image Manipulation Localization PDF

[49] UMIF-Net: unified framework for multi-type image forgery detection and localization PDF

Contribution Analysis

Omni-IML: first generalist model for unified interpretable IML

[8] Omni-IML: Towards Unified Image Manipulation Localization PDF

[4] Trufor: Leveraging all-round clues for trustworthy image forgery detection and localization PDF

[15] UnionFormer: Unified-Learning Transformer with Multi-View Representation for Image Manipulation Detection and Localization PDF

[60] Emerging properties in unified multimodal pretraining PDF

[61] DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation PDF

[62] Towards Universal Fake Image Detectors that Generalize Across Generative Models PDF

[63] GCA-Net: Utilizing gated context attention for improving image forgery localization and detection PDF

[64] TBFormer: Two-Branch Transformer for Image Forgery Localization PDF

[65] ME: Multi-Task Edge-Enhanced for Image Forgery Localization PDF

[66] NVMS-Net: A Novel Constrained Noise-View Multiscale Network for Detecting General Image Processing Based Manipulations PDF

Novel modules for unified IML modeling

[9] Hierarchical Fine-Grained Image Forgery Detection and Localization PDF

[67] Rethinking Image Forgery Detection and Localization via Regression Perspective PDF

[68] Dual-stream enhancement encoder and attention optimization decoder for image manipulation localization PDF

[69] Adaptive representation disentanglement network for change captioning PDF

[70] Deep learning-based forgery detection and localization for compressed images using a hybrid optimization model PDF

[71] MSHRT-Net: Multi-Scale Hierarchical Residual Transfer Network for Image Manipulation Detection and Localization PDF

[72] Encoder-decoder based convolutional neural networks for image forgery detection PDF

[73] Image forgery classification and localization through vision transformers PDF

[74] Multi-scale and deeply supervised network for image splicing localization PDF

[75] TALIU: A Novel Decoder and Augmentation Strategy for Boosting Tampered Document Image Detection PDF

Chain-of-thoughts annotation technique and Omni-273k dataset

[8] Omni-IML: Towards Unified Image Manipulation Localization PDF

[51] EmbRACE-3K: Embodied Reasoning and Action in Complex Environments PDF

[52] Reinforced Embodied Planning with Verifiable Reward for Real-World Robotic Manipulation PDF

[53] Generate subgoal images before act: Unlocking the chain-of-thought reasoning in diffusion model for robot manipulation with multimodal prompts PDF

[54] Visualtrans: A benchmark for real-world visual transformation reasoning PDF

[55] Robust sequential deepfake detection PDF

[56] Detecting and recovering sequential deepfake manipulation PDF

[57] Alphablock: Embodied finetuning for vision-language reasoning in robot manipulation PDF

[58] VisionCube: 3D-Aware Vision-Language Model for Multi-Step Spatial Reasoning PDF

[59] CoIE: Chain-of-Instruct Editing for Multi-Attribute Face Manipulation PDF

Table of Contents