UDIS: A User-query Driven Framework for Image Forgery Localization

ICLR 2026 Conference SubmissionAnonymous Authors
Image Forgery LocalizationUser-query Driven
Abstract:

The rapid advancement of image editing technologies has amplified the urgency of developing reliable Image Forgery Localization (IFL) methods. Recent approaches based on Multimodal Large Language Models (MLLMs) have shown promise but suffer from weak visual-text alignment\textbf{weak visual-text alignment}: they fail to regulate visual attention to the specific regions mentioned in user queries, leading to irrelevant responses. We argue that this limitation originates from a global outcome driven\textbf{global outcome driven} paradigm that directs interpretability toward forgery localization results and focuses visual attention on the entire image. To address this issue, we propose a paradigm shift: interpretability in IFL ought to be regional user-query driven\textbf{regional user-query driven}. Building on this principle and supported by a dataset containing queries related to the authenticity of specific regions, we present the U\textbf{U}ser-query D\textbf{D}riven I\textbf{I}mage S\textbf{S}hield (UDIS), a novel framework incorporating two key modules. The Query-Guided Module (QGM)\textbf{Query-Guided Module (QGM)} introduces a [QUERY]\texttt{[QUERY]} token and a visual features filtering process based on the queries to strengthen the input-level\textbf{input-level} alignment (focusing on connecting query and MLLM’s visual attention). The Evidence-Aware Module (EAM)\textbf{Evidence-Aware Module (EAM)} introduces an [EVI]\texttt{[EVI]} token and an auxiliary authenticity evidence classification task to enhance alignment at the output-level\textbf{output-level} (focusing on associating explanatory text knowledge with forgery localization capability). By learning the two special tokens, MLLM’s alignment ability is enhanced, and the modal-consistency knowledge embedded in the tokens further supports the forgery localization process. Extensive experiments demonstrate that the proposed approach provides query-focused authenticity explanations, underscoring its superior practical value, and achieves state-of-the-art IFL performance.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
28
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: image forgery localization with multimodal large language models. The field has evolved into a rich ecosystem organized around several complementary directions. At the highest level, one branch focuses on MLLM-based frameworks that directly integrate vision and language for detection and localization, while another addresses domain-specific applications such as face forgery, text manipulation, and fake news. A third branch emphasizes explainability and interpretability, ensuring that models not only identify manipulations but also provide human-understandable reasoning. Additional branches cover vision-language alignment and knowledge integration, evaluation benchmarks and datasets, generalization and robustness enhancement, and auxiliary detection technologies. Works like ForgeryGPT[23] and FakeBench[26] illustrate how these branches intersect, combining robust detection with interpretable outputs and standardized evaluation protocols. Within the explainability branch, a particularly active line of research explores user-query driven and interactive detection, where systems respond to natural language queries about potential forgeries. UDIS[0] exemplifies this direction by enabling users to interactively probe images for manipulations through conversational interfaces. This contrasts with more automated approaches such as FakeShield[1], which emphasizes end-to-end detection without requiring user input, and Forgerysleuth[5], which balances interpretability with broader forensic reasoning. Meanwhile, works like Sida[3] and Mvtamperbench[4] push toward richer multimodal benchmarks that test both localization accuracy and the quality of explanations. The central tension across these efforts lies in balancing automation with user control: fully automated systems offer efficiency but may lack transparency, while interactive frameworks like UDIS[0] prioritize user engagement and interpretability at the cost of requiring more human involvement. This trade-off remains an open question as the field seeks to deploy MLLMs in real-world forensic scenarios.

Claimed Contributions

User-query driven paradigm for IFL interpretability

The authors propose a new conceptual framework where interpretability in image forgery localization should be driven by regional user queries instead of global outcome-based explanations. This paradigm shift establishes a foundation for addressing weak visual-text alignment in existing methods.

10 retrieved papers
UDIS framework with QGM and EAM modules

The authors develop a novel framework called UDIS that implements the user-query driven principle through two specialized modules: QGM aligns user queries with visual attention at the input level, while EAM aligns explanatory textual knowledge with forgery localization capability at the output level.

8 retrieved papers
Dataset with region-specific queries and authenticity evidence

The authors curate a training dataset that includes both generic forensic questions and content-aware queries tailored to specific image regions, along with corresponding authenticity evidence annotations. This dataset enables training models under the user-query driven paradigm.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

User-query driven paradigm for IFL interpretability

The authors propose a new conceptual framework where interpretability in image forgery localization should be driven by regional user queries instead of global outcome-based explanations. This paradigm shift establishes a foundation for addressing weak visual-text alignment in existing methods.

Contribution

UDIS framework with QGM and EAM modules

The authors develop a novel framework called UDIS that implements the user-query driven principle through two specialized modules: QGM aligns user queries with visual attention at the input level, while EAM aligns explanatory textual knowledge with forgery localization capability at the output level.

Contribution

Dataset with region-specific queries and authenticity evidence

The authors curate a training dataset that includes both generic forensic questions and content-aware queries tailored to specific image regions, along with corresponding authenticity evidence annotations. This dataset enables training models under the user-query driven paradigm.