Any-to-Bokeh: Arbitrary-Subject Video Refocusing with Video Diffusion Model

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Computational photography

Diffusion models have recently emerged as powerful tools for camera simulation, enabling both geometric transformations and realistic optical effects. Among these, image-based bokeh rendering has shown promising results, but diffusion for video bokeh remains unexplored. Existing image-based methods are plagued by temporal flickering and inconsistent blur transitions, while current video editing methods lack explicit control over the focus plane and bokeh intensity. These issues limit their applicability for controllable video bokeh. In this work, we propose a one-step diffusion framework for generating temporally coherent, depth-aware video bokeh rendering. The framework employs a multi-plane image (MPI) representation adapted to the focal plane to condition the video diffusion model, thereby enabling it to exploit strong 3D priors from pretrained backbones. To further enhance temporal stability, depth robustness, and detail preservation, we introduce a progressive training strategy. Experiments on synthetic and real-world benchmarks demonstrate superior temporal coherence, spatial accuracy, and controllability, outperforming prior baselines. This work represents the first dedicated diffusion framework for video bokeh generation, establishing a new baseline for temporally coherent and controllable depth-of-field effects. Code will be made publicly available.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a one-step diffusion framework for video bokeh rendering that uses multi-plane image (MPI) representations adapted to the focal plane. According to the taxonomy, this work sits in the 'Multi-Plane Image Guided Video Diffusion' leaf, which contains only two papers total (including this one). The sibling paper in this leaf is Any-to-Bokeh One-Step, indicating a sparse research direction. The taxonomy shows the broader field divides into video bokeh generation with temporal coherence versus depth estimation for spatial effects, with this work clearly positioned in the former category.

The taxonomy reveals two main branches: video bokeh generation (temporal coherence) and depth estimation for spatial effects. This paper's leaf sits within the temporal coherence branch, which emphasizes maintaining flicker-free bokeh across frames. The neighboring depth estimation branch (containing Spatial Images Monocular) focuses on robust monocular depth prediction as a foundation for blur rendering. The scope note for this leaf explicitly excludes methods without MPI-based conditioning or those focusing on single images, clarifying that this work's MPI-guided approach distinguishes it from purely depth-driven frame-independent methods.

Among sixteen candidates examined across three contributions, the analysis reveals mixed novelty signals. The first contribution (one-step diffusion with MPI conditioning) examined one candidate with no refutations. The second contribution (progressive training strategy) examined five candidates with no refutations. However, the third contribution (arbitrary focal plane and bokeh intensity control) examined ten candidates and found four that appear to provide overlapping prior work. This suggests the control mechanism may have more substantial precedent in the limited search scope, while the one-step MPI framework and progressive training appear less anticipated by the examined candidates.

Based on the limited search of sixteen candidates, the work appears to occupy a sparse research direction (only two papers in its taxonomy leaf) with mixed novelty across contributions. The MPI-guided one-step framework and progressive training show fewer overlaps in the examined set, while the controllability aspect encounters more prior work. The analysis covers top-K semantic matches and does not represent an exhaustive literature review of all video bokeh or diffusion-based rendering methods.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Video bokeh rendering with depth-aware diffusion models. The field centers on generating realistic depth-of-field effects in video, where the taxonomy divides into two main branches. The first branch, Video Bokeh Generation with Temporal Coherence, focuses on methods that maintain smooth, flicker-free bokeh across frames, often leveraging multi-plane image representations or video diffusion architectures to ensure consistency over time. The second branch, Depth Estimation for Spatial Image Effects, emphasizes robust monocular depth prediction as a foundation for spatially accurate blur rendering. These branches are complementary: temporal methods rely on good depth maps to guide coherent defocus, while depth estimation techniques provide the geometric scaffolding that spatial effects require. Representative works like Any-to-Bokeh[0] and Any-to-Bokeh One-Step[1] illustrate how diffusion-based video models can integrate depth cues to produce temporally stable results. Within the temporal coherence branch, a particularly active line of work explores multi-plane image guided video diffusion, where depth layers are used to decompose the scene and apply selective blur in a temporally consistent manner. This contrasts with purely depth-driven approaches that may treat each frame independently, risking flicker or inconsistent bokeh shapes. Any-to-Bokeh[0] sits squarely in this multi-plane guided cluster, emphasizing depth-aware diffusion to achieve smooth video bokeh. Its approach is closely related to Any-to-Bokeh One-Step[1], which shares the depth-aware diffusion philosophy but explores efficiency trade-offs by reducing inference steps. Meanwhile, methods in the depth estimation branch, such as Spatial Images Monocular[2], prioritize accurate single-image depth maps that can feed into any bokeh renderer. The main open question across these directions is how to balance temporal stability, depth accuracy, and computational cost in real-time or near-real-time video bokeh synthesis.

Claimed Contributions

First one-step diffusion framework for controllable video bokeh with MPI-guided conditioning

1 retrieved paper

The authors introduce a novel one-step diffusion framework specifically designed for video bokeh generation. The framework uses multi-plane image (MPI) representation to condition the video diffusion model, providing explicit geometric guidance for generating depth-aware bokeh effects with spatial accuracy.

1 retrieved paper

Progressive training strategy for temporal stability and detail preservation

5 retrieved papers

The authors develop a three-stage progressive training approach that enhances temporal coherence, reduces flickering through extended temporal windows with data perturbations, and refines subject details using a VAE-based enhancement module to improve overall video quality.

5 retrieved papers

Framework enabling arbitrary focal plane and bokeh intensity control

Can Refute

10 retrieved papers

The framework provides users with explicit control mechanisms to customize both the focal plane location and bokeh intensity in arbitrary input videos, enabling flexible and controllable depth-of-field effects for various content creation applications.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] Any-to-Bokeh: One-Step Video Bokeh via Multi-Plane Image Guided Diffusion PDF

Yang Yang, Zheng Siming, Yang, Qirui, Siming Zheng, Chen Jin-wei, Qirui Yang, Jinwei Chen, Wu, Boxi, He Xiaofei, Boxi Wu, Cai, Deng, Xiaofei He, Deng Cai, Li Bo, Jiang, Peng-Tao, Peng-Tao Jiang, Bo Li (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

First one-step diffusion framework for controllable video bokeh with MPI-guided conditioning

[1] Any-to-Bokeh: One-Step Video Bokeh via Multi-Plane Image Guided Diffusion PDF

Cannot Refute

Contribution

Progressive training strategy for temporal stability and detail preservation

[1] Any-to-Bokeh: One-Step Video Bokeh via Multi-Plane Image Guided Diffusion PDF

Cannot Refute

[3] Improving Video Temporal Consistency via Broad Learning System PDF

Cannot Refute

[4] Moblurf: Motion deblurring neural radiance fields for blurry monocular video PDF

Cannot Refute

[5] Dyblurf: Dynamic deblurring neural radiance fields for blurry monocular video PDF

Cannot Refute

[6] Two-stage Depth Video Recovery with Spatiotemporal Coherence PDF

Cannot Refute

Contribution

Framework enabling arbitrary focal plane and bokeh intensity control

[7] DiffCamera: Arbitrary Refocusing on Images PDF

Can Refute

[8] AKiRa: Augmentation Kit on Rays for optical video generation PDF

Can Refute

[9] DOF-GS: Adjustable Depth-of-Field 3D Gaussian Splatting for Refocusing, Defocus Rendering and Blur Removal PDF

Can Refute

[11] Sterefo: Efficient image refocusing with stereo vision PDF

Can Refute

[1] Any-to-Bokeh: One-Step Video Bokeh via Multi-Plane Image Guided Diffusion PDF

Cannot Refute

[10] Diffraction-engineered holography: Beyond the depth representation limit of holographic displays PDF

Cannot Refute

[12] Preserving Photographic Defocus in Stylised Image Synthesis PDF

Cannot Refute

[13] Fine-grained Defocus Blur Control for Generative Image Models PDF

Cannot Refute

[14] Dynamic video deblurring using a locally adaptive blur model PDF

Cannot Refute

[15] DaBiT: Depth and Blur Informed Transformer for Video Focal Deblurring PDF

Cannot Refute

Any-to-Bokeh: Arbitrary-Subject Video Refocusing with Video Diffusion Model

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] Any-to-Bokeh: One-Step Video Bokeh via Multi-Plane Image Guided Diffusion PDF

Contribution Analysis

First one-step diffusion framework for controllable video bokeh with MPI-guided conditioning

[1] Any-to-Bokeh: One-Step Video Bokeh via Multi-Plane Image Guided Diffusion PDF

Progressive training strategy for temporal stability and detail preservation

[1] Any-to-Bokeh: One-Step Video Bokeh via Multi-Plane Image Guided Diffusion PDF

[3] Improving Video Temporal Consistency via Broad Learning System PDF

[4] Moblurf: Motion deblurring neural radiance fields for blurry monocular video PDF

[5] Dyblurf: Dynamic deblurring neural radiance fields for blurry monocular video PDF

[6] Two-stage Depth Video Recovery with Spatiotemporal Coherence PDF

Framework enabling arbitrary focal plane and bokeh intensity control

[7] DiffCamera: Arbitrary Refocusing on Images PDF

[8] AKiRa: Augmentation Kit on Rays for optical video generation PDF

[9] DOF-GS: Adjustable Depth-of-Field 3D Gaussian Splatting for Refocusing, Defocus Rendering and Blur Removal PDF

[11] Sterefo: Efficient image refocusing with stereo vision PDF

[1] Any-to-Bokeh: One-Step Video Bokeh via Multi-Plane Image Guided Diffusion PDF

[10] Diffraction-engineered holography: Beyond the depth representation limit of holographic displays PDF

[12] Preserving Photographic Defocus in Stylised Image Synthesis PDF

[13] Fine-grained Defocus Blur Control for Generative Image Models PDF

[14] Dynamic video deblurring using a locally adaptive blur model PDF

[15] DaBiT: Depth and Blur Informed Transformer for Video Focal Deblurring PDF

Table of Contents