RelayFormer: A Unified Local-Global Attention Framework for Scalable Image and Video Manipulation Localization
Overview
Overall Novelty Assessment
RelayFormer proposes a unified framework for visual manipulation localization that adapts to varying resolutions and modalities through fixed-size sub-image partitioning and Global-Local Relay tokens. The paper resides in the 'Unified and Scalable Manipulation Localization Frameworks' leaf, which contains only this single work among the 50 papers surveyed. This isolation suggests the research direction—addressing resolution diversity and modality gaps simultaneously within one architecture—is relatively unexplored in the current taxonomy, positioning the work in a sparse rather than crowded area of the field.
The taxonomy reveals that neighboring leaves focus on specialized detection strategies: 'Deep Learning-Based Image Tampering Detection' contains six papers emphasizing spatial or frequency features for still images, 'Passive Video Tampering Detection' includes three papers analyzing compression artifacts and motion residuals, and 'Spatial-Temporal Deepfake Detection' groups three papers combining CNN-LSTM architectures for face forgeries. RelayFormer diverges by targeting cross-modality generalization rather than optimizing for a single forgery type or medium, bridging gaps that prior work addresses through separate models or fixed-resolution preprocessing.
Among 25 candidates examined, the unified framework contribution (10 candidates, 0 refutable) and the GLR token mechanism (5 candidates, 0 refutable) show no clear prior overlap within the limited search scope. The query-based mask decoder (10 candidates, 1 refutable) encounters one candidate suggesting overlapping prior work, indicating this component may have precedent in segmentation or detection literature. The statistics reflect a modest search scale—top-K semantic matches plus citation expansion—so the absence of refutation for two contributions does not guarantee exhaustive novelty but suggests limited direct precedent among closely related papers.
Based on the limited search scope, RelayFormer appears to occupy a relatively novel position by unifying resolution adaptivity and modality handling in a single framework, though the query-based decoder component may have more substantial prior work. The taxonomy structure and contribution-level statistics together suggest the core relay-token mechanism and unified architecture are less explored, while acknowledging that the 25-candidate search cannot rule out relevant work outside the examined set.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors introduce RelayFormer, a framework that processes images and videos of arbitrary resolutions without interpolation or padding by partitioning inputs into fixed-size sub-images. This unified architecture handles both image and video manipulation localization tasks within a single model.
The authors propose GLR tokens that act as information bottlenecks to efficiently exchange global scene-level cues (such as semantic or temporal consistency) across sub-images while preserving local manipulation artifacts, avoiding the computational cost of dense full-resolution attention.
The authors design a lightweight query-based Transformer decoder that avoids computational bottlenecks by using learnable queries to interact with projected feature maps, enabling efficient mask prediction without excessive overhead.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
Contribution Analysis
Detailed comparisons for each claimed contribution
RelayFormer unified framework for resolution-adaptive manipulation localization
The authors introduce RelayFormer, a framework that processes images and videos of arbitrary resolutions without interpolation or padding by partitioning inputs into fixed-size sub-images. This unified architecture handles both image and video manipulation localization tasks within a single model.
[66] An Experimental Network Analysis-based Approach for Detection of Jamming Attacks in Wireless Sensor Networks PDF
[67] Spatial and frequency feature fusion using multi-scale cross attention for enhancing deepfake face detection: M. Uddin et al. PDF
[68] Bznet: Unsupervised multi-scale branch zooming network for detecting low-quality deepfake videos PDF
[69] Refining localized attention features with multi-scale relationships for enhanced deepfake detection in spatial-frequency domain PDF
[70] TinyDF: Tiny and Effective Model for Deepfake Detection PDF
[71] MSER-Net: Multi-stage edge refinement network for deepfake detection PDF
[72] Deepfake Detection via Spatial-Frequency Attention Network PDF
[73] CCM-Net: image splicing localization network based on context-aware and cross-domain multi-scale fusion PDF
[74] The detection optimization of low-quality fake face images: feature enhancement and noise suppression strategies PDF
[75] High-resolution network-based multi-feature fusion for generalized forgery detection PDF
Global Local Relay (GLR) tokens with relay-based attention mechanism
The authors propose GLR tokens that act as information bottlenecks to efficiently exchange global scene-level cues (such as semantic or temporal consistency) across sub-images while preserving local manipulation artifacts, avoiding the computational cost of dense full-resolution attention.
[61] Conditional diffusion to enhance performance of object detection in unbalanced data engineering drawings PDF
[62] End-to-end object detection with neural networks PDF
[63] Inceptive Visual Representation Learning With Diverse Multi-Head Sparse Attention PDF
[64] Multi-Disease Detection in Retinal Imaging Using Patch-Based Attention Mechanism PDF
[65] Multi-Relation Attention Network for Image Patch Matching PDF
Query-based mask decoder for efficient localization
The authors design a lightweight query-based Transformer decoder that avoids computational bottlenecks by using learnable queries to interact with projected feature maps, enabling efficient mask prediction without excessive overhead.