RelayFormer: A Unified Local-Global Attention Framework for Scalable Image and Video Manipulation Localization

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

Image Manipulation Localization; Video Manipulation Localization

Visual manipulation localization (VML) aims to identify tampered regions in images and videos, a task that has become increasingly challenging with the rise of advanced editing tools. Existing methods face two main issues: resolution diversity, where resizing or padding distorts forensic traces and reduces efficiency, and the modality gap, as images and videos often require separate models. To address these challenges, we propose RelayFormer, a unified framework that adapts to varying resolutions and modalities. RelayFormer partitions inputs into fixed-size sub-images and introduces Global-Local Relay (GLR) tokens, which propagate structured context through a global-local relay attention (GLRA) mechanism. This enables efficient exchange of global cues, such as semantic or temporal consistency, while preserving fine-grained manipulation artifacts. Unlike prior methods that rely on uniform resizing or sparse attention, RelayFormer naturally scales to arbitrary resolutions and video sequences without excessive overhead. Experiments across diverse benchmarks demonstrate that RelayFormer achieves state-of-the-art performance with notable efficiency, combining resolution adaptivity without interpolation or excessive padding, unified modeling for both images and videos, and a strong balance between accuracy and computational cost.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

RelayFormer proposes a unified framework for visual manipulation localization that adapts to varying resolutions and modalities through fixed-size sub-image partitioning and Global-Local Relay tokens. The paper resides in the 'Unified and Scalable Manipulation Localization Frameworks' leaf, which contains only this single work among the 50 papers surveyed. This isolation suggests the research direction—addressing resolution diversity and modality gaps simultaneously within one architecture—is relatively unexplored in the current taxonomy, positioning the work in a sparse rather than crowded area of the field.

The taxonomy reveals that neighboring leaves focus on specialized detection strategies: 'Deep Learning-Based Image Tampering Detection' contains six papers emphasizing spatial or frequency features for still images, 'Passive Video Tampering Detection' includes three papers analyzing compression artifacts and motion residuals, and 'Spatial-Temporal Deepfake Detection' groups three papers combining CNN-LSTM architectures for face forgeries. RelayFormer diverges by targeting cross-modality generalization rather than optimizing for a single forgery type or medium, bridging gaps that prior work addresses through separate models or fixed-resolution preprocessing.

Among 25 candidates examined, the unified framework contribution (10 candidates, 0 refutable) and the GLR token mechanism (5 candidates, 0 refutable) show no clear prior overlap within the limited search scope. The query-based mask decoder (10 candidates, 1 refutable) encounters one candidate suggesting overlapping prior work, indicating this component may have precedent in segmentation or detection literature. The statistics reflect a modest search scale—top-K semantic matches plus citation expansion—so the absence of refutation for two contributions does not guarantee exhaustive novelty but suggests limited direct precedent among closely related papers.

Based on the limited search scope, RelayFormer appears to occupy a relatively novel position by unifying resolution adaptivity and modality handling in a single framework, though the query-based decoder component may have more substantial prior work. The taxonomy structure and contribution-level statistics together suggest the core relay-token mechanism and unified architecture are less explored, while acknowledging that the 25-candidate search cannot rule out relevant work outside the examined set.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: visual manipulation localization in images and videos. The field has evolved into several distinct branches that reflect both the nature of the forgery and the detection strategy. Deepfake and Face Forgery Detection focuses on synthetic face generation and identity swaps, often leveraging temporal cues and multi-feature fusion as seen in Face Forgery Multi-Feature[3] and Deepfake CNN LSTM[4]. Image Tampering Detection and Localization addresses copy-move, splicing, and inpainting operations through methods that exploit noise residuals, frequency artifacts, or learned segmentation models such as Tampering Segmentation Model[30] and Multiscale Fusion Detection[2]. Video Tampering Detection and Localization extends these ideas to temporal domains, examining frame insertion, deletion, and inter-frame inconsistencies with approaches like Noise Residuals Temporal[16] and Frame-rate Forgery Detection[17]. Watermarking-Based Authentication embeds fragile or semi-fragile signals into content for tamper localization, exemplified by Multiple Median Watermarking[5] and Chaotic Watermarking[6]. Unified and Scalable Manipulation Localization Frameworks aim to handle diverse forgery types within a single architecture, while Surveys, Reviews, and Application-Specific Studies provide overviews and domain-specific analyses, including Deepfake Analysis Review[7] and Fighting Fake Media[8]. Recent work has increasingly emphasized cross-domain generalization and the integration of spatial and frequency features to improve robustness against unseen manipulations. A handful of studies explore hybrid strategies that combine passive forensic cues with active watermarking signals, as in Content Tampering Watermarking[1] and Dual-Embedded Framework[37]. RelayFormer[0] sits within the Unified and Scalable Manipulation Localization Frameworks branch, proposing a transformer-based architecture designed to localize manipulations across multiple forgery types without retraining for each specific attack. This contrasts with more specialized detectors like Efficient Deepfake Detection[12], which targets face forgeries exclusively, and with watermarking methods such as Chaotic Watermarking[6], which require embedding at capture time. By aiming for a single model that generalizes across image and video tampering scenarios, RelayFormer[0] addresses a key open question: how to scale forensic systems to the growing diversity of generative and editing tools without sacrificing localization precision.

Claimed Contributions

RelayFormer unified framework for resolution-adaptive manipulation localization

10 retrieved papers

The authors introduce RelayFormer, a framework that processes images and videos of arbitrary resolutions without interpolation or padding by partitioning inputs into fixed-size sub-images. This unified architecture handles both image and video manipulation localization tasks within a single model.

10 retrieved papers

Global Local Relay (GLR) tokens with relay-based attention mechanism

5 retrieved papers

The authors propose GLR tokens that act as information bottlenecks to efficiently exchange global scene-level cues (such as semantic or temporal consistency) across sub-images while preserving local manipulation artifacts, avoiding the computational cost of dense full-resolution attention.

5 retrieved papers

Query-based mask decoder for efficient localization

Can Refute

10 retrieved papers

The authors design a lightweight query-based Transformer decoder that avoids computational bottlenecks by using learnable queries to interact with projected feature maps, enabling efficient mask prediction without excessive overhead.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

RelayFormer unified framework for resolution-adaptive manipulation localization

[66] An Experimental Network Analysis-based Approach for Detection of Jamming Attacks in Wireless Sensor Networks PDF

Cannot Refute

[67] Spatial and frequency feature fusion using multi-scale cross attention for enhancing deepfake face detection: M. Uddin et al. PDF

Cannot Refute

[68] Bznet: Unsupervised multi-scale branch zooming network for detecting low-quality deepfake videos PDF

Cannot Refute

[69] Refining localized attention features with multi-scale relationships for enhanced deepfake detection in spatial-frequency domain PDF

Cannot Refute

[70] TinyDF: Tiny and Effective Model for Deepfake Detection PDF

Cannot Refute

[71] MSER-Net: Multi-stage edge refinement network for deepfake detection PDF

Cannot Refute

[72] Deepfake Detection via Spatial-Frequency Attention Network PDF

Cannot Refute

[73] CCM-Net: image splicing localization network based on context-aware and cross-domain multi-scale fusion PDF

Cannot Refute

[74] The detection optimization of low-quality fake face images: feature enhancement and noise suppression strategies PDF

Cannot Refute

[75] High-resolution network-based multi-feature fusion for generalized forgery detection PDF

Cannot Refute

Contribution

Global Local Relay (GLR) tokens with relay-based attention mechanism

[61] Conditional diffusion to enhance performance of object detection in unbalanced data engineering drawings PDF

Cannot Refute

[62] End-to-end object detection with neural networks PDF

Cannot Refute

[63] Inceptive Visual Representation Learning With Diverse Multi-Head Sparse Attention PDF

Cannot Refute

[64] Multi-Disease Detection in Retinal Imaging Using Patch-Based Attention Mechanism PDF

Cannot Refute

[65] Multi-Relation Attention Network for Image Patch Matching PDF

Cannot Refute

Contribution

Query-based mask decoder for efficient localization

[52] Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation PDF

Can Refute

[51] Rethinking Query-Based Transformer for Continual Image Segmentation PDF

Cannot Refute

[53] Query refinement transformer for 3d instance segmentation PDF

Cannot Refute

[54] MGQFormer: Mask-Guided Query-Based Transformer for Image Manipulation Localization PDF

Cannot Refute

[55] Mask transfiner for high-quality instance segmentation PDF

Cannot Refute

[56] An Effective Masked Transformer Model for Automatic Modulation Recognition PDF

Cannot Refute

[57] Multi-scale query-based transformer for image forgery localization PDF

Cannot Refute

[58] FastInst: A Simple Query-Based Model for Real-Time Instance Segmentation PDF

Cannot Refute

[59] An Efficient and Effective Transformer Decoder-Based Framework for Multi-Task Visual Grounding PDF

Cannot Refute

[60] Mp-former: Mask-piloted transformer for image segmentation PDF

Cannot Refute

RelayFormer: A Unified Local-Global Attention Framework for Scalable Image and Video Manipulation Localization

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

RelayFormer unified framework for resolution-adaptive manipulation localization

[66] An Experimental Network Analysis-based Approach for Detection of Jamming Attacks in Wireless Sensor Networks PDF

[67] Spatial and frequency feature fusion using multi-scale cross attention for enhancing deepfake face detection: M. Uddin et al. PDF

[68] Bznet: Unsupervised multi-scale branch zooming network for detecting low-quality deepfake videos PDF

[69] Refining localized attention features with multi-scale relationships for enhanced deepfake detection in spatial-frequency domain PDF

[70] TinyDF: Tiny and Effective Model for Deepfake Detection PDF

[71] MSER-Net: Multi-stage edge refinement network for deepfake detection PDF

[72] Deepfake Detection via Spatial-Frequency Attention Network PDF

[73] CCM-Net: image splicing localization network based on context-aware and cross-domain multi-scale fusion PDF

[74] The detection optimization of low-quality fake face images: feature enhancement and noise suppression strategies PDF

[75] High-resolution network-based multi-feature fusion for generalized forgery detection PDF

Global Local Relay (GLR) tokens with relay-based attention mechanism

[61] Conditional diffusion to enhance performance of object detection in unbalanced data engineering drawings PDF

[62] End-to-end object detection with neural networks PDF

[63] Inceptive Visual Representation Learning With Diverse Multi-Head Sparse Attention PDF

[64] Multi-Disease Detection in Retinal Imaging Using Patch-Based Attention Mechanism PDF

[65] Multi-Relation Attention Network for Image Patch Matching PDF

Query-based mask decoder for efficient localization

[52] Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation PDF

[51] Rethinking Query-Based Transformer for Continual Image Segmentation PDF

[53] Query refinement transformer for 3d instance segmentation PDF

[54] MGQFormer: Mask-Guided Query-Based Transformer for Image Manipulation Localization PDF

[55] Mask transfiner for high-quality instance segmentation PDF

[56] An Effective Masked Transformer Model for Automatic Modulation Recognition PDF

[57] Multi-scale query-based transformer for image forgery localization PDF

[58] FastInst: A Simple Query-Based Model for Real-Time Instance Segmentation PDF

[59] An Efficient and Effective Transformer Decoder-Based Framework for Multi-Task Visual Grounding PDF

[60] Mp-former: Mask-piloted transformer for image segmentation PDF

Table of Contents