Omni-IML: Towards Unified Interpretable Image Manipulation Localization

ICLR 2026 Conference SubmissionAnonymous Authors
document analysistampered text detectionvision foundation model
Abstract:

Existing Image Manipulation Localization (IML) methods rely heavily on task-specific designs, making them perform well only on the target IML task, while joint training on multiple IML tasks causes significant performance degradation, hindering real applications. To this end, we propose Omni-IML, the first generalist model designed to unify IML across diverse tasks. Specifically, Omni-IML achieves generalization through three key components: (1) a Modal Gate Encoder, which adaptively selects the optimal encoding modality per sample, (2) a Dynamic Weight Decoder, which dynamically adjusts decoder filters to the task at hand, and (3) an Anomaly Enhancement module that leverages box supervision to highlight the tampered regions and facilitate the learning of task-agnostic features. Beyond localization, to support interpretation of the tampered images, we construct Omni-273k, a large high-quality dataset that includes natural language descriptions of tampered artifacts. It is annotated through our automatic, chain-of-thoughts annotation technique. We also design a simple-yet-effective interpretation module to better utilize these descriptive annotations. Our extensive experiments show that our single Omni-IML model achieves state-of-the-art performance across all four major IML tasks, providing a valuable solution for practical deployment and a promising direction of generalist models in image forensics. We will release our code and dataset.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Omni-IML, a generalist model for unified image manipulation localization across diverse tasks (splicing, copy-move, inpainting, removal). It resides in the Cross-Task Generalist Models leaf, which contains only three papers total, including this work and two siblings. This sparse population suggests the research direction—building truly unified architectures that handle multiple IML tasks without task-specific heads—remains relatively underexplored compared to the broader field of fifty papers spanning specialized detectors, multi-modal fusion, and domain-specific methods.

The taxonomy tree reveals that Omni-IML's parent branch, Unified and Multi-Task Frameworks, also includes Multi-Type Forgery Detection (three papers targeting integrated forgery frameworks) and Multi-Modal and Multi-Stream Architectures (five papers fusing multiple input modalities). Neighboring branches focus on Feature Representation and Extraction (twenty papers across transformer-based, multi-scale, noise-domain, and convolutional approaches) and Attention and Refinement Mechanisms (five papers on attention-guided localization and boundary enhancement). The Cross-Task Generalist Models leaf explicitly excludes task-specific or single-modality approaches, positioning Omni-IML as a departure from the more crowded specialized detection categories.

Among thirty candidates examined, the first contribution—claiming Omni-IML as the first generalist model for unified interpretable IML—shows one refutable candidate out of ten examined, indicating at least one prior work may overlap with the generalist framing. The second contribution, introducing novel modules (Modal Gate Encoder, Dynamic Weight Decoder, Anomaly Enhancement), examined ten candidates with zero refutations, suggesting these architectural components appear more distinctive within the limited search scope. The third contribution, the chain-of-thoughts annotation technique and Omni-273k dataset, also examined ten candidates with no refutations, implying the dataset construction and annotation methodology may be less directly anticipated by prior work.

Based on the limited top-thirty semantic search, the architectural modules and dataset contributions appear more novel than the overarching generalist claim, which faces at least one overlapping prior work. The sparse Cross-Task Generalist Models leaf (three papers) and the broader taxonomy structure (fifty papers, thirty-six topics) suggest the field is still consolidating unified approaches, though the analysis does not cover exhaustive citation networks or domain-specific venues beyond the examined candidates.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: unified image manipulation localization across diverse tasks. The field has evolved from specialized detectors targeting single forgery types toward more comprehensive frameworks that handle multiple manipulation categories simultaneously. The taxonomy reflects this progression through six main branches: Unified and Multi-Task Frameworks address cross-task generalization and joint learning strategies; Feature Representation and Extraction explores multi-scale, frequency-domain, and hierarchical encodings; Attention and Refinement Mechanisms develop spatial and channel-wise modules to highlight suspicious regions; Domain-Specific and Application-Oriented Methods tailor solutions to face manipulation, text tampering, or social media contexts; Robustness and Post-Processing Resilience investigates defenses against compression and adversarial perturbations; and Benchmarking and Comprehensive Analysis establishes evaluation protocols and surveys the landscape. Representative works such as Trufor[4] and Mantra-net[6] illustrate early multi-clue integration, while recent efforts like ObjectFormer[12] and UnionFormer[15] demonstrate transformer-based unification. A central tension emerges between task-specific depth and cross-task breadth: some lines prioritize robustness within narrow domains (e.g., Digital Face Manipulation[1]), whereas others pursue generalist architectures that sacrifice per-task optimality for versatility. Omni-IML[0] sits squarely within the Cross-Task Generalist Models cluster, emphasizing a unified pipeline that localizes splicing, copy-move, inpainting, and removal manipulations without task-specific heads. This design contrasts with neighbors like Omni-IML Unified[8], which also targets multi-task scenarios but may differ in architectural choices or training strategies, and UMIF-Net[49], which explores alternative fusion mechanisms for integrating diverse forensic cues. By consolidating multiple manipulation types under a single framework, Omni-IML[0] addresses the practical need for deployable systems that handle real-world image forgeries without prior knowledge of the specific tampering method, a direction increasingly vital as generative models proliferate and manipulation techniques diversify.

Claimed Contributions

Omni-IML: first generalist model for unified interpretable IML

The authors introduce Omni-IML as the first generalist model capable of performing image manipulation localization across multiple major tasks (natural images, documents, faces, and scene text) simultaneously, while also providing interpretable artifact descriptions in natural language.

10 retrieved papers
Can Refute
Novel modules for unified IML modeling

The authors develop three key architectural components: a Modal Gate Encoder that adaptively selects optimal encoding modality per sample, a Dynamic Weight Decoder that adjusts decoder filters dynamically, and an Anomaly Enhancement module using box supervision to highlight tampered regions and learn task-agnostic features.

10 retrieved papers
Chain-of-thoughts annotation technique and Omni-273k dataset

The authors propose an automatic chain-of-thoughts annotation pipeline to generate high-quality natural language descriptions of tampered artifacts, and use it to construct Omni-273k, a large-scale dataset with artifact descriptions across natural, document, face, and scene text domains.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Omni-IML: first generalist model for unified interpretable IML

The authors introduce Omni-IML as the first generalist model capable of performing image manipulation localization across multiple major tasks (natural images, documents, faces, and scene text) simultaneously, while also providing interpretable artifact descriptions in natural language.

Contribution

Novel modules for unified IML modeling

The authors develop three key architectural components: a Modal Gate Encoder that adaptively selects optimal encoding modality per sample, a Dynamic Weight Decoder that adjusts decoder filters dynamically, and an Anomaly Enhancement module using box supervision to highlight tampered regions and learn task-agnostic features.

Contribution

Chain-of-thoughts annotation technique and Omni-273k dataset

The authors propose an automatic chain-of-thoughts annotation pipeline to generate high-quality natural language descriptions of tampered artifacts, and use it to construct Omni-273k, a large-scale dataset with artifact descriptions across natural, document, face, and scene text domains.