Abstract:

Diffusion models have recently emerged as powerful tools for camera simulation, enabling both geometric transformations and realistic optical effects. Among these, image-based bokeh rendering has shown promising results, but diffusion for video bokeh remains unexplored. Existing image-based methods are plagued by temporal flickering and inconsistent blur transitions, while current video editing methods lack explicit control over the focus plane and bokeh intensity. These issues limit their applicability for controllable video bokeh. In this work, we propose a one-step diffusion framework for generating temporally coherent, depth-aware video bokeh rendering. The framework employs a multi-plane image (MPI) representation adapted to the focal plane to condition the video diffusion model, thereby enabling it to exploit strong 3D priors from pretrained backbones. To further enhance temporal stability, depth robustness, and detail preservation, we introduce a progressive training strategy. Experiments on synthetic and real-world benchmarks demonstrate superior temporal coherence, spatial accuracy, and controllability, outperforming prior baselines. This work represents the first dedicated diffusion framework for video bokeh generation, establishing a new baseline for temporally coherent and controllable depth-of-field effects. Code will be made publicly available.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a one-step diffusion framework for video bokeh rendering that uses multi-plane image (MPI) representations adapted to the focal plane. According to the taxonomy, this work sits in the 'Multi-Plane Image Guided Video Diffusion' leaf, which contains only two papers total (including this one). The sibling paper in this leaf is Any-to-Bokeh One-Step, indicating a sparse research direction. The taxonomy shows the broader field divides into video bokeh generation with temporal coherence versus depth estimation for spatial effects, with this work clearly positioned in the former category.

The taxonomy reveals two main branches: video bokeh generation (temporal coherence) and depth estimation for spatial effects. This paper's leaf sits within the temporal coherence branch, which emphasizes maintaining flicker-free bokeh across frames. The neighboring depth estimation branch (containing Spatial Images Monocular) focuses on robust monocular depth prediction as a foundation for blur rendering. The scope note for this leaf explicitly excludes methods without MPI-based conditioning or those focusing on single images, clarifying that this work's MPI-guided approach distinguishes it from purely depth-driven frame-independent methods.

Among sixteen candidates examined across three contributions, the analysis reveals mixed novelty signals. The first contribution (one-step diffusion with MPI conditioning) examined one candidate with no refutations. The second contribution (progressive training strategy) examined five candidates with no refutations. However, the third contribution (arbitrary focal plane and bokeh intensity control) examined ten candidates and found four that appear to provide overlapping prior work. This suggests the control mechanism may have more substantial precedent in the limited search scope, while the one-step MPI framework and progressive training appear less anticipated by the examined candidates.

Based on the limited search of sixteen candidates, the work appears to occupy a sparse research direction (only two papers in its taxonomy leaf) with mixed novelty across contributions. The MPI-guided one-step framework and progressive training show fewer overlaps in the examined set, while the controllability aspect encounters more prior work. The analysis covers top-K semantic matches and does not represent an exhaustive literature review of all video bokeh or diffusion-based rendering methods.

Taxonomy

Core-task Taxonomy Papers
2
3
Claimed Contributions
16
Contribution Candidate Papers Compared
4
Refutable Paper

Research Landscape Overview

Core task: Video bokeh rendering with depth-aware diffusion models. The field centers on generating realistic depth-of-field effects in video, where the taxonomy divides into two main branches. The first branch, Video Bokeh Generation with Temporal Coherence, focuses on methods that maintain smooth, flicker-free bokeh across frames, often leveraging multi-plane image representations or video diffusion architectures to ensure consistency over time. The second branch, Depth Estimation for Spatial Image Effects, emphasizes robust monocular depth prediction as a foundation for spatially accurate blur rendering. These branches are complementary: temporal methods rely on good depth maps to guide coherent defocus, while depth estimation techniques provide the geometric scaffolding that spatial effects require. Representative works like Any-to-Bokeh[0] and Any-to-Bokeh One-Step[1] illustrate how diffusion-based video models can integrate depth cues to produce temporally stable results. Within the temporal coherence branch, a particularly active line of work explores multi-plane image guided video diffusion, where depth layers are used to decompose the scene and apply selective blur in a temporally consistent manner. This contrasts with purely depth-driven approaches that may treat each frame independently, risking flicker or inconsistent bokeh shapes. Any-to-Bokeh[0] sits squarely in this multi-plane guided cluster, emphasizing depth-aware diffusion to achieve smooth video bokeh. Its approach is closely related to Any-to-Bokeh One-Step[1], which shares the depth-aware diffusion philosophy but explores efficiency trade-offs by reducing inference steps. Meanwhile, methods in the depth estimation branch, such as Spatial Images Monocular[2], prioritize accurate single-image depth maps that can feed into any bokeh renderer. The main open question across these directions is how to balance temporal stability, depth accuracy, and computational cost in real-time or near-real-time video bokeh synthesis.

Claimed Contributions

First one-step diffusion framework for controllable video bokeh with MPI-guided conditioning

The authors introduce a novel one-step diffusion framework specifically designed for video bokeh generation. The framework uses multi-plane image (MPI) representation to condition the video diffusion model, providing explicit geometric guidance for generating depth-aware bokeh effects with spatial accuracy.

1 retrieved paper
Progressive training strategy for temporal stability and detail preservation

The authors develop a three-stage progressive training approach that enhances temporal coherence, reduces flickering through extended temporal windows with data perturbations, and refines subject details using a VAE-based enhancement module to improve overall video quality.

5 retrieved papers
Framework enabling arbitrary focal plane and bokeh intensity control

The framework provides users with explicit control mechanisms to customize both the focal plane location and bokeh intensity in arbitrary input videos, enabling flexible and controllable depth-of-field effects for various content creation applications.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

First one-step diffusion framework for controllable video bokeh with MPI-guided conditioning

The authors introduce a novel one-step diffusion framework specifically designed for video bokeh generation. The framework uses multi-plane image (MPI) representation to condition the video diffusion model, providing explicit geometric guidance for generating depth-aware bokeh effects with spatial accuracy.

Contribution

Progressive training strategy for temporal stability and detail preservation

The authors develop a three-stage progressive training approach that enhances temporal coherence, reduces flickering through extended temporal windows with data perturbations, and refines subject details using a VAE-based enhancement module to improve overall video quality.

Contribution

Framework enabling arbitrary focal plane and bokeh intensity control

The framework provides users with explicit control mechanisms to customize both the focal plane location and bokeh intensity in arbitrary input videos, enabling flexible and controllable depth-of-field effects for various content creation applications.