Continuous Space-Time Video Super-Resolution with 3D Fourier Fields

ICLR 2026 Conference SubmissionAnonymous Authors
continuous space-time video super-resolutionarbitrary-scale super-resolutionlow-level vision
Abstract:

We introduce a novel formulation for continuous space-time video super-resolution. Instead of decoupling the representation of a video sequence into separate spatial and temporal components and relying on brittle, explicit frame warping for motion compensation, we encode video as a continuous, spatio-temporally coherent 3D Video Fourier Field (VFF). That representation offers three key advantages: (1) it enables cheap, flexible sampling at arbitrary locations in space and time; (2) it is able to simultaneously capture fine spatial detail and smooth temporal dynamics; and (3) it offers the possibility to include an analytical, Gaussian point spread function in the sampling to ensure aliasing-free reconstruction at arbitrary scale. The coefficients of the proposed, Fourier-like sinusoidal basis are predicted with a neural encoder with a large spatio-temporal receptive field, conditioned on the low-resolution input video. Through extensive experiments, we show that our joint modeling substantially improves both spatial and temporal super-resolution and sets a new state of the art for multiple benchmarks: across a wide range of upscaling factors, it delivers sharper and temporally more consistent reconstructions than existing baselines, while being computationally more efficient. Code will be published upon acceptance.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a Video Fourier Field (VFF) representation that encodes video as a continuous 3D spatio-temporal function in the frequency domain, enabling arbitrary-scale sampling in both space and time. According to the taxonomy, this work resides in the 'Fourier and Frequency Domain INR' leaf under 'Implicit Neural Representation Approaches', which contains only two papers total. This indicates a relatively sparse research direction within the broader field of continuous space-time video super-resolution, suggesting the frequency-domain formulation for implicit neural video representations remains underexplored compared to motion-based or transformer-based approaches.

The taxonomy reveals that neighboring leaves include 'Local Implicit Neural Functions' (focusing on motion trajectory modeling) and 'Arbitrary-Scale Alignment Networks' (emphasizing neural alignment modules). The broader 'Implicit Neural Representation Approaches' branch contrasts sharply with the heavily populated 'Motion Estimation and Compensation Frameworks' branch, which contains multiple subcategories addressing optical flow, deformable convolutions, and bidirectional propagation. The paper's frequency-domain approach diverges from these motion-centric methods by avoiding explicit frame warping, instead relying on learned Fourier coefficients to capture spatio-temporal coherence—a fundamentally different modeling philosophy that positions it at the intersection of signal processing and neural representation learning.

Among the three contributions analyzed, the V3 end-to-end framework shows one refutable candidate among ten examined, suggesting some overlap with existing architectures in the limited search scope. The Video Fourier Field representation itself examined four candidates with zero refutations, indicating potential novelty within the analyzed subset. The analytical Gaussian point spread function for anti-aliasing examined ten candidates without clear refutation. Given that only twenty-four total candidates were examined across all contributions, these statistics reflect a focused but not exhaustive literature comparison, primarily capturing semantically similar works rather than the entire field of implicit neural video representations.

Based on the limited search scope of twenty-four candidates, the work appears to occupy a distinctive position within the sparse Fourier-domain INR cluster. The taxonomy structure suggests this frequency-based formulation represents a less-traveled path compared to motion-compensation or transformer-based alternatives. However, the analysis does not cover the full landscape of implicit neural representations or signal processing techniques for video, leaving open questions about connections to broader frequency-domain methods outside the examined candidate set.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
24
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: continuous space-time video super-resolution aims to enhance video quality by jointly upsampling both spatial resolution and temporal frame rate to arbitrary scales. The field has evolved into several major branches that reflect different modeling philosophies and technical emphases. Implicit Neural Representation Approaches leverage coordinate-based networks to represent video content continuously, enabling flexible querying at any spatial or temporal location; within this branch, some works explore Fourier and frequency domain formulations while others focus on hierarchical or alignment-driven strategies. Motion Estimation and Compensation Frameworks build on classical optical flow techniques, refining alignment and warping mechanisms to propagate information across frames, as seen in methods like BasicVSR++[11] and Multi-Stage Motion[6]. Transformer-Based Architectures exploit self-attention to capture long-range dependencies, with works such as RSTT[13] and Trajectory-Aware Transformer[15] demonstrating the power of global context. Event Camera Enhanced Methods incorporate asynchronous event data to improve temporal fidelity, exemplified by EvSTVSR[1] and EvEnhancer[16]. Additional branches address Specialized Applications (e.g., omnidirectional or satellite video), Auxiliary Task and Multi-Task Learning, and Multimodal Understanding, reflecting the diversity of problem settings and data modalities. Recent research reveals contrasting trade-offs between representation flexibility and computational efficiency. Implicit neural methods offer continuous querying and compact parameterizations, yet often require careful design to handle high-frequency details and temporal consistency. Motion-based frameworks excel at leveraging inter-frame correlations but can struggle with occlusions and complex motion patterns, prompting hybrid designs that combine flow estimation with learned refinement modules. Fourier Fields[0] sits within the Fourier and Frequency Domain INR cluster, emphasizing spectral representations to capture fine-grained spatiotemporal patterns more efficiently than purely spatial coordinate mappings. This approach contrasts with neighboring works like MeshfreeFlowNet[8], which adopts a meshfree interpolation perspective, and HR-INR[2], which focuses on hierarchical implicit structures. By operating in the frequency domain, Fourier Fields[0] aims to balance expressive power with computational tractability, addressing a key challenge in continuous space-time super-resolution where both spatial sharpness and temporal smoothness must be preserved across arbitrary scales.

Claimed Contributions

Video Fourier Field (VFF) representation

A unified continuous video representation based on a 3D trigonometric expansion over joint (x, y, t) space. Unlike prior methods that decouple spatial and temporal components, VFF jointly models space and time using sinusoidal basis functions, enabling flexible sampling at arbitrary spatio-temporal resolutions without explicit warping.

4 retrieved papers
V3 end-to-end framework

An end-to-end trainable system that uses a neural encoder with large spatio-temporal receptive field to predict VFF coefficients from low-resolution input videos. The framework enables continuous space-time video super-resolution by sampling the learned representation at arbitrary scales.

10 retrieved papers
Can Refute
Analytical Gaussian PSF for anti-aliasing

A closed-form mechanism for anti-aliasing that integrates a Gaussian point spread function directly into the VFF sampling process. This enables theoretically correct frequency suppression when super-resolving at different scales, without requiring learned adaptive filtering.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Video Fourier Field (VFF) representation

A unified continuous video representation based on a 3D trigonometric expansion over joint (x, y, t) space. Unlike prior methods that decouple spatial and temporal components, VFF jointly models space and time using sinusoidal basis functions, enabling flexible sampling at arbitrary spatio-temporal resolutions without explicit warping.

Contribution

V3 end-to-end framework

An end-to-end trainable system that uses a neural encoder with large spatio-temporal receptive field to predict VFF coefficients from low-resolution input videos. The framework enables continuous space-time video super-resolution by sampling the learned representation at arbitrary scales.

Contribution

Analytical Gaussian PSF for anti-aliasing

A closed-form mechanism for anti-aliasing that integrates a Gaussian point spread function directly into the VFF sampling process. This enables theoretically correct frequency suppression when super-resolving at different scales, without requiring learned adaptive filtering.