RelayFormer: A Unified Local-Global Attention Framework for Scalable Image and Video Manipulation Localization

ICLR 2026 Conference SubmissionAnonymous Authors
Image Manipulation Localization; Video Manipulation Localization
Abstract:

Visual manipulation localization (VML) aims to identify tampered regions in images and videos, a task that has become increasingly challenging with the rise of advanced editing tools. Existing methods face two main issues: resolution diversity, where resizing or padding distorts forensic traces and reduces efficiency, and the modality gap, as images and videos often require separate models. To address these challenges, we propose RelayFormer, a unified framework that adapts to varying resolutions and modalities. RelayFormer partitions inputs into fixed-size sub-images and introduces Global-Local Relay (GLR) tokens, which propagate structured context through a global-local relay attention (GLRA) mechanism. This enables efficient exchange of global cues, such as semantic or temporal consistency, while preserving fine-grained manipulation artifacts. Unlike prior methods that rely on uniform resizing or sparse attention, RelayFormer naturally scales to arbitrary resolutions and video sequences without excessive overhead. Experiments across diverse benchmarks demonstrate that RelayFormer achieves state-of-the-art performance with notable efficiency, combining resolution adaptivity without interpolation or excessive padding, unified modeling for both images and videos, and a strong balance between accuracy and computational cost.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

RelayFormer proposes a unified framework for visual manipulation localization that adapts to varying resolutions and modalities through fixed-size sub-image partitioning and Global-Local Relay tokens. The paper resides in the 'Unified and Scalable Manipulation Localization Frameworks' leaf, which contains only this single work among the 50 papers surveyed. This isolation suggests the research direction—addressing resolution diversity and modality gaps simultaneously within one architecture—is relatively unexplored in the current taxonomy, positioning the work in a sparse rather than crowded area of the field.

The taxonomy reveals that neighboring leaves focus on specialized detection strategies: 'Deep Learning-Based Image Tampering Detection' contains six papers emphasizing spatial or frequency features for still images, 'Passive Video Tampering Detection' includes three papers analyzing compression artifacts and motion residuals, and 'Spatial-Temporal Deepfake Detection' groups three papers combining CNN-LSTM architectures for face forgeries. RelayFormer diverges by targeting cross-modality generalization rather than optimizing for a single forgery type or medium, bridging gaps that prior work addresses through separate models or fixed-resolution preprocessing.

Among 25 candidates examined, the unified framework contribution (10 candidates, 0 refutable) and the GLR token mechanism (5 candidates, 0 refutable) show no clear prior overlap within the limited search scope. The query-based mask decoder (10 candidates, 1 refutable) encounters one candidate suggesting overlapping prior work, indicating this component may have precedent in segmentation or detection literature. The statistics reflect a modest search scale—top-K semantic matches plus citation expansion—so the absence of refutation for two contributions does not guarantee exhaustive novelty but suggests limited direct precedent among closely related papers.

Based on the limited search scope, RelayFormer appears to occupy a relatively novel position by unifying resolution adaptivity and modality handling in a single framework, though the query-based decoder component may have more substantial prior work. The taxonomy structure and contribution-level statistics together suggest the core relay-token mechanism and unified architecture are less explored, while acknowledging that the 25-candidate search cannot rule out relevant work outside the examined set.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
25
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: visual manipulation localization in images and videos. The field has evolved into several distinct branches that reflect both the nature of the forgery and the detection strategy. Deepfake and Face Forgery Detection focuses on synthetic face generation and identity swaps, often leveraging temporal cues and multi-feature fusion as seen in Face Forgery Multi-Feature[3] and Deepfake CNN LSTM[4]. Image Tampering Detection and Localization addresses copy-move, splicing, and inpainting operations through methods that exploit noise residuals, frequency artifacts, or learned segmentation models such as Tampering Segmentation Model[30] and Multiscale Fusion Detection[2]. Video Tampering Detection and Localization extends these ideas to temporal domains, examining frame insertion, deletion, and inter-frame inconsistencies with approaches like Noise Residuals Temporal[16] and Frame-rate Forgery Detection[17]. Watermarking-Based Authentication embeds fragile or semi-fragile signals into content for tamper localization, exemplified by Multiple Median Watermarking[5] and Chaotic Watermarking[6]. Unified and Scalable Manipulation Localization Frameworks aim to handle diverse forgery types within a single architecture, while Surveys, Reviews, and Application-Specific Studies provide overviews and domain-specific analyses, including Deepfake Analysis Review[7] and Fighting Fake Media[8]. Recent work has increasingly emphasized cross-domain generalization and the integration of spatial and frequency features to improve robustness against unseen manipulations. A handful of studies explore hybrid strategies that combine passive forensic cues with active watermarking signals, as in Content Tampering Watermarking[1] and Dual-Embedded Framework[37]. RelayFormer[0] sits within the Unified and Scalable Manipulation Localization Frameworks branch, proposing a transformer-based architecture designed to localize manipulations across multiple forgery types without retraining for each specific attack. This contrasts with more specialized detectors like Efficient Deepfake Detection[12], which targets face forgeries exclusively, and with watermarking methods such as Chaotic Watermarking[6], which require embedding at capture time. By aiming for a single model that generalizes across image and video tampering scenarios, RelayFormer[0] addresses a key open question: how to scale forensic systems to the growing diversity of generative and editing tools without sacrificing localization precision.

Claimed Contributions

RelayFormer unified framework for resolution-adaptive manipulation localization

The authors introduce RelayFormer, a framework that processes images and videos of arbitrary resolutions without interpolation or padding by partitioning inputs into fixed-size sub-images. This unified architecture handles both image and video manipulation localization tasks within a single model.

10 retrieved papers
Global Local Relay (GLR) tokens with relay-based attention mechanism

The authors propose GLR tokens that act as information bottlenecks to efficiently exchange global scene-level cues (such as semantic or temporal consistency) across sub-images while preserving local manipulation artifacts, avoiding the computational cost of dense full-resolution attention.

5 retrieved papers
Query-based mask decoder for efficient localization

The authors design a lightweight query-based Transformer decoder that avoids computational bottlenecks by using learnable queries to interact with projected feature maps, enabling efficient mask prediction without excessive overhead.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

RelayFormer unified framework for resolution-adaptive manipulation localization

The authors introduce RelayFormer, a framework that processes images and videos of arbitrary resolutions without interpolation or padding by partitioning inputs into fixed-size sub-images. This unified architecture handles both image and video manipulation localization tasks within a single model.

Contribution

Global Local Relay (GLR) tokens with relay-based attention mechanism

The authors propose GLR tokens that act as information bottlenecks to efficiently exchange global scene-level cues (such as semantic or temporal consistency) across sub-images while preserving local manipulation artifacts, avoiding the computational cost of dense full-resolution attention.

Contribution

Query-based mask decoder for efficient localization

The authors design a lightweight query-based Transformer decoder that avoids computational bottlenecks by using learnable queries to interact with projected feature maps, enabling efficient mask prediction without excessive overhead.