Omni-IML: Towards Unified Interpretable Image Manipulation Localization
Overview
Overall Novelty Assessment
The paper proposes Omni-IML, a generalist model for unified image manipulation localization across diverse tasks (splicing, copy-move, inpainting, removal). It resides in the Cross-Task Generalist Models leaf, which contains only three papers total, including this work and two siblings. This sparse population suggests the research direction—building truly unified architectures that handle multiple IML tasks without task-specific heads—remains relatively underexplored compared to the broader field of fifty papers spanning specialized detectors, multi-modal fusion, and domain-specific methods.
The taxonomy tree reveals that Omni-IML's parent branch, Unified and Multi-Task Frameworks, also includes Multi-Type Forgery Detection (three papers targeting integrated forgery frameworks) and Multi-Modal and Multi-Stream Architectures (five papers fusing multiple input modalities). Neighboring branches focus on Feature Representation and Extraction (twenty papers across transformer-based, multi-scale, noise-domain, and convolutional approaches) and Attention and Refinement Mechanisms (five papers on attention-guided localization and boundary enhancement). The Cross-Task Generalist Models leaf explicitly excludes task-specific or single-modality approaches, positioning Omni-IML as a departure from the more crowded specialized detection categories.
Among thirty candidates examined, the first contribution—claiming Omni-IML as the first generalist model for unified interpretable IML—shows one refutable candidate out of ten examined, indicating at least one prior work may overlap with the generalist framing. The second contribution, introducing novel modules (Modal Gate Encoder, Dynamic Weight Decoder, Anomaly Enhancement), examined ten candidates with zero refutations, suggesting these architectural components appear more distinctive within the limited search scope. The third contribution, the chain-of-thoughts annotation technique and Omni-273k dataset, also examined ten candidates with no refutations, implying the dataset construction and annotation methodology may be less directly anticipated by prior work.
Based on the limited top-thirty semantic search, the architectural modules and dataset contributions appear more novel than the overarching generalist claim, which faces at least one overlapping prior work. The sparse Cross-Task Generalist Models leaf (three papers) and the broader taxonomy structure (fifty papers, thirty-six topics) suggest the field is still consolidating unified approaches, though the analysis does not cover exhaustive citation networks or domain-specific venues beyond the examined candidates.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors introduce Omni-IML as the first generalist model capable of performing image manipulation localization across multiple major tasks (natural images, documents, faces, and scene text) simultaneously, while also providing interpretable artifact descriptions in natural language.
The authors develop three key architectural components: a Modal Gate Encoder that adaptively selects optimal encoding modality per sample, a Dynamic Weight Decoder that adjusts decoder filters dynamically, and an Anomaly Enhancement module using box supervision to highlight tampered regions and learn task-agnostic features.
The authors propose an automatic chain-of-thoughts annotation pipeline to generate high-quality natural language descriptions of tampered artifacts, and use it to construct Omni-273k, a large-scale dataset with artifact descriptions across natural, document, face, and scene text domains.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[8] Omni-IML: Towards Unified Image Manipulation Localization PDF
[49] UMIF-Net: unified framework for multi-type image forgery detection and localization PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Omni-IML: first generalist model for unified interpretable IML
The authors introduce Omni-IML as the first generalist model capable of performing image manipulation localization across multiple major tasks (natural images, documents, faces, and scene text) simultaneously, while also providing interpretable artifact descriptions in natural language.
[8] Omni-IML: Towards Unified Image Manipulation Localization PDF
[4] Trufor: Leveraging all-round clues for trustworthy image forgery detection and localization PDF
[15] UnionFormer: Unified-Learning Transformer with Multi-View Representation for Image Manipulation Detection and Localization PDF
[60] Emerging properties in unified multimodal pretraining PDF
[61] DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation PDF
[62] Towards Universal Fake Image Detectors that Generalize Across Generative Models PDF
[63] GCA-Net: Utilizing gated context attention for improving image forgery localization and detection PDF
[64] TBFormer: Two-Branch Transformer for Image Forgery Localization PDF
[65] ME: Multi-Task Edge-Enhanced for Image Forgery Localization PDF
[66] NVMS-Net: A Novel Constrained Noise-View Multiscale Network for Detecting General Image Processing Based Manipulations PDF
Novel modules for unified IML modeling
The authors develop three key architectural components: a Modal Gate Encoder that adaptively selects optimal encoding modality per sample, a Dynamic Weight Decoder that adjusts decoder filters dynamically, and an Anomaly Enhancement module using box supervision to highlight tampered regions and learn task-agnostic features.
[9] Hierarchical Fine-Grained Image Forgery Detection and Localization PDF
[67] Rethinking Image Forgery Detection and Localization via Regression Perspective PDF
[68] Dual-stream enhancement encoder and attention optimization decoder for image manipulation localization PDF
[69] Adaptive representation disentanglement network for change captioning PDF
[70] Deep learning-based forgery detection and localization for compressed images using a hybrid optimization model PDF
[71] MSHRT-Net: Multi-Scale Hierarchical Residual Transfer Network for Image Manipulation Detection and Localization PDF
[72] Encoder-decoder based convolutional neural networks for image forgery detection PDF
[73] Image forgery classification and localization through vision transformers PDF
[74] Multi-scale and deeply supervised network for image splicing localization PDF
[75] TALIU: A Novel Decoder and Augmentation Strategy for Boosting Tampered Document Image Detection PDF
Chain-of-thoughts annotation technique and Omni-273k dataset
The authors propose an automatic chain-of-thoughts annotation pipeline to generate high-quality natural language descriptions of tampered artifacts, and use it to construct Omni-273k, a large-scale dataset with artifact descriptions across natural, document, face, and scene text domains.