UDIS: A User-query Driven Framework for Image Forgery Localization

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

Image Forgery LocalizationUser-query Driven

The rapid advancement of image editing technologies has amplified the urgency of developing reliable Image Forgery Localization (IFL) methods. Recent approaches based on Multimodal Large Language Models (MLLMs) have shown promise but suffer from $\textbf{weak visual-text alignment}$ : they fail to regulate visual attention to the specific regions mentioned in user queries, leading to irrelevant responses. We argue that this limitation originates from a $\textbf{global outcome driven}$ paradigm that directs interpretability toward forgery localization results and focuses visual attention on the entire image. To address this issue, we propose a paradigm shift: interpretability in IFL ought to be $\textbf{regional user-query driven}$ . Building on this principle and supported by a dataset containing queries related to the authenticity of specific regions, we present the $\textbf{U}$ ser-query $\textbf{D}$ riven $\textbf{I}$ mage $\textbf{S}$ hield (UDIS), a novel framework incorporating two key modules. The $\textbf{Query-Guided Module (QGM)}$ introduces a $\texttt{[QUERY]}$ token and a visual features filtering process based on the queries to strengthen the $\textbf{input-level}$ alignment (focusing on connecting query and MLLM’s visual attention). The $\textbf{Evidence-Aware Module (EAM)}$ introduces an $\texttt{[EVI]}$ token and an auxiliary authenticity evidence classification task to enhance alignment at the $\textbf{output-level}$ (focusing on associating explanatory text knowledge with forgery localization capability). By learning the two special tokens, MLLM’s alignment ability is enhanced, and the modal-consistency knowledge embedded in the tokens further supports the forgery localization process. Extensive experiments demonstrate that the proposed approach provides query-focused authenticity explanations, underscoring its superior practical value, and achieves state-of-the-art IFL performance.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: image forgery localization with multimodal large language models. The field has evolved into a rich ecosystem organized around several complementary directions. At the highest level, one branch focuses on MLLM-based frameworks that directly integrate vision and language for detection and localization, while another addresses domain-specific applications such as face forgery, text manipulation, and fake news. A third branch emphasizes explainability and interpretability, ensuring that models not only identify manipulations but also provide human-understandable reasoning. Additional branches cover vision-language alignment and knowledge integration, evaluation benchmarks and datasets, generalization and robustness enhancement, and auxiliary detection technologies. Works like ForgeryGPT[23] and FakeBench[26] illustrate how these branches intersect, combining robust detection with interpretable outputs and standardized evaluation protocols. Within the explainability branch, a particularly active line of research explores user-query driven and interactive detection, where systems respond to natural language queries about potential forgeries. UDIS[0] exemplifies this direction by enabling users to interactively probe images for manipulations through conversational interfaces. This contrasts with more automated approaches such as FakeShield[1], which emphasizes end-to-end detection without requiring user input, and Forgerysleuth[5], which balances interpretability with broader forensic reasoning. Meanwhile, works like Sida[3] and Mvtamperbench[4] push toward richer multimodal benchmarks that test both localization accuracy and the quality of explanations. The central tension across these efforts lies in balancing automation with user control: fully automated systems offer efficiency but may lack transparency, while interactive frameworks like UDIS[0] prioritize user engagement and interpretability at the cost of requiring more human involvement. This trade-off remains an open question as the field seeks to deploy MLLMs in real-world forensic scenarios.

Claimed Contributions

User-query driven paradigm for IFL interpretability

10 retrieved papers

The authors propose a new conceptual framework where interpretability in image forgery localization should be driven by regional user queries instead of global outcome-based explanations. This paradigm shift establishes a foundation for addressing weak visual-text alignment in existing methods.

10 retrieved papers

UDIS framework with QGM and EAM modules

8 retrieved papers

The authors develop a novel framework called UDIS that implements the user-query driven principle through two specialized modules: QGM aligns user queries with visual attention at the input level, while EAM aligns explanatory textual knowledge with forgery localization capability at the output level.

8 retrieved papers

Dataset with region-specific queries and authenticity evidence

10 retrieved papers

The authors curate a training dataset that includes both generic forensic questions and content-aware queries tailored to specific image regions, along with corresponding authenticity evidence annotations. This dataset enables training models under the user-query driven paradigm.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

User-query driven paradigm for IFL interpretability

[1] FakeShield: Explainable Image Forgery Detection and Localization via Multi-modal Large Language Models PDF

Cannot Refute

[22] Explainable Image-Centric Forgery Detection: A Survey PDF

Cannot Refute

[60] Towards Human Explainable Digital Forensics: Generating Human Interpretable Evidence for Semantic Understanding in Manipulated Images and Text PDF

Cannot Refute

[61] Multi-scale query-based transformer for image forgery localization PDF

Cannot Refute

[62] Rethinking Vision-Language Model in Face Forensics: Multi-Modal Interpretable Forged Face Detector PDF

Cannot Refute

[63] Learning patch-channel correspondence for interpretable face forgery detection PDF

Cannot Refute

[64] Explainable DualâStream Attention Network for Image Forgery Detection and Localisation Using Contrastive Learning PDF

Cannot Refute

[65] Literature Survey of Image Forgery Detection Using Machine Learning PDF

Cannot Refute

[66] IVY-FAKE: A Unified Explainable Framework and Benchmark for Image and Video AIGC Detection PDF

Cannot Refute

[67] So-Fake: Benchmarking and Explaining Social Media Image Forgery Detection PDF

Cannot Refute

Contribution

UDIS framework with QGM and EAM modules

[22] Explainable Image-Centric Forgery Detection: A Survey PDF

Cannot Refute

[69] An information theoretic approach for attention-driven face forgery detection PDF

Cannot Refute

[70] Exploiting multi-domain visual information for fake news detection PDF

Cannot Refute

[71] Generalizing face forgery detection with high-frequency features PDF

Cannot Refute

[72] Lightweight End-to-End Patch-Based Self-Attention Network for Robust Image Forgery Detection PDF

Cannot Refute

[73] Zooming In on Fakes: A Novel Dataset for Localized AI-Generated Image Detection with Forgery Amplification Approach PDF

Cannot Refute

[74] A Similarity-Based Positional Attention-Aided Deep Learning Model for CopyâMove Forgery Detection PDF

Cannot Refute

[76] Forgery-aware adaptive learning with vision transformer for generalized face forgery detection PDF

Cannot Refute

Contribution

Dataset with region-specific queries and authenticity evidence

[1] FakeShield: Explainable Image Forgery Detection and Localization via Multi-modal Large Language Models PDF

Cannot Refute

[51] A systematic literature review on deepfake detection techniques PDF

Cannot Refute

[52] Forgerynet: A versatile benchmark for comprehensive forgery analysis PDF

Cannot Refute

[53] Deep learning for deepfakes creation and detection PDF

Cannot Refute

[54] â¦ and Challenges: A Comprehensive Review of GAN-based Models for the Mitigation of Small Dataset and Texture Sticking Issues in Fake License Plate Recognition PDF

Cannot Refute

[55] Busternet: Detecting copy-move image forgery with source/target localization PDF

Cannot Refute

[56] "Glitch in the Matrix!": A Large Scale Benchmark for Content Driven Audio-Visual Forgery Detection and Localization PDF

Cannot Refute

[57] Do You Really Mean That? Content Driven Audio-Visual Deepfake Dataset and Multimodal Method for Temporal Forgery Localization: Anonymous submission Paper ID 73 PDF

Cannot Refute

[58] Localized Forgery Detection: Integrating DeepLabV3 With Error Level Analysis PDF

Cannot Refute

[59] Region-based reversible medical image watermarking algorithm for privacy protection and integrity authentication PDF

Cannot Refute

UDIS: A User-query Driven Framework for Image Forgery Localization

Overview

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

User-query driven paradigm for IFL interpretability

[1] FakeShield: Explainable Image Forgery Detection and Localization via Multi-modal Large Language Models PDF

[22] Explainable Image-Centric Forgery Detection: A Survey PDF

[60] Towards Human Explainable Digital Forensics: Generating Human Interpretable Evidence for Semantic Understanding in Manipulated Images and Text PDF

[61] Multi-scale query-based transformer for image forgery localization PDF

[62] Rethinking Vision-Language Model in Face Forensics: Multi-Modal Interpretable Forged Face Detector PDF

[63] Learning patch-channel correspondence for interpretable face forgery detection PDF

[64] Explainable DualâStream Attention Network for Image Forgery Detection and Localisation Using Contrastive Learning PDF

[65] Literature Survey of Image Forgery Detection Using Machine Learning PDF

[66] IVY-FAKE: A Unified Explainable Framework and Benchmark for Image and Video AIGC Detection PDF

[67] So-Fake: Benchmarking and Explaining Social Media Image Forgery Detection PDF

UDIS framework with QGM and EAM modules

[22] Explainable Image-Centric Forgery Detection: A Survey PDF

[69] An information theoretic approach for attention-driven face forgery detection PDF

[70] Exploiting multi-domain visual information for fake news detection PDF

[71] Generalizing face forgery detection with high-frequency features PDF

[72] Lightweight End-to-End Patch-Based Self-Attention Network for Robust Image Forgery Detection PDF

[73] Zooming In on Fakes: A Novel Dataset for Localized AI-Generated Image Detection with Forgery Amplification Approach PDF

[74] A Similarity-Based Positional Attention-Aided Deep Learning Model for CopyâMove Forgery Detection PDF

[76] Forgery-aware adaptive learning with vision transformer for generalized face forgery detection PDF

Dataset with region-specific queries and authenticity evidence

[1] FakeShield: Explainable Image Forgery Detection and Localization via Multi-modal Large Language Models PDF

[51] A systematic literature review on deepfake detection techniques PDF

[52] Forgerynet: A versatile benchmark for comprehensive forgery analysis PDF

[53] Deep learning for deepfakes creation and detection PDF

[54] â¦ and Challenges: A Comprehensive Review of GAN-based Models for the Mitigation of Small Dataset and Texture Sticking Issues in Fake License Plate Recognition PDF

[55] Busternet: Detecting copy-move image forgery with source/target localization PDF

[56] "Glitch in the Matrix!": A Large Scale Benchmark for Content Driven Audio-Visual Forgery Detection and Localization PDF

[57] Do You Really Mean That? Content Driven Audio-Visual Deepfake Dataset and Multimodal Method for Temporal Forgery Localization: Anonymous submission Paper ID 73 PDF

[58] Localized Forgery Detection: Integrating DeepLabV3 With Error Level Analysis PDF

[59] Region-based reversible medical image watermarking algorithm for privacy protection and integrity authentication PDF

Table of Contents

[64] Explainable DualâStream Attention Network for Image Forgery Detection and Localisation Using Contrastive Learning PDF

[74] A Similarity-Based Positional Attention-Aided Deep Learning Model for CopyâMove Forgery Detection PDF

[54] â¦ and Challenges: A Comprehensive Review of GAN-based Models for the Mitigation of Small Dataset and Texture Sticking Issues in Fake License Plate Recognition PDF