Prioritizing Faithfulness: Efficient Zero-Shot Novel View Synthesis with Adaptive Latent Modulation

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.5 Download Report PDF

NVSZero-shot

The challenge of camera-controlled novel view synthesis (NVS) lies in balancing high visual fidelity with strict faithfulness to the source scene. We argue that current dominant approaches, which rely on finetuning large-scale diffusion models, often over-emphasize fidelity while struggling with faithfulness due to their generative nature. To address this, we propose a zero-shot NVS pipeline that prioritizes faithfulness and efficiency. Our method introduces two key contributions applied during inference: (1) Test-time Latent Homography Deformation, an on-the-fly homography optimization to deform latents for global motion consistency, and (2) Spatially Adaptive RePaint (SA-RePaint), an extension to RePaint that achieves both structural consistency and texture fidelity by introducing a mathematically-grounded, region-wise balancing of these two objectives. Our evaluations demonstrate substantial improvements in faithfulness and camera accuracy with competitive perceptual scores, highlighting a successful integration of faithfulness, quality, and efficiency. This work offers a promising direction for NVS that rebalances the focus towards greater authenticity.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a zero-shot novel view synthesis pipeline combining test-time latent homography deformation and spatially adaptive RePaint to balance faithfulness and fidelity. It resides in the Camera-Conditioned Latent Diffusion leaf, which contains three papers including this work. This leaf sits within the broader Diffusion-Based Novel View Synthesis branch, indicating a moderately active research direction focused on explicit camera parameter conditioning in latent diffusion models. The taxonomy shows this is a growing but not overcrowded area, with sibling leaves exploring video diffusion and text-guided synthesis approaches.

The taxonomy reveals neighboring work in Video Diffusion for Multi-View Generation (five papers) and Zero-Shot Diffusion-Based View Synthesis (one paper), suggesting the field balances between multi-view consistency and zero-shot generalization. The paper's emphasis on zero-shot inference without fine-tuning distinguishes it from camera-conditioned methods requiring task-specific training. Nearby branches like 3D Representation-Based approaches (Gaussian splatting, NeRF) and Learning-Based Image Warping offer alternative paradigms that prioritize explicit geometry over generative priors, highlighting the paper's choice to leverage pretrained diffusion models rather than reconstruct 3D structure.

Among twenty-three candidates examined, no contributions were clearly refuted by prior work. Test-time Latent Homography Deformation examined three candidates with zero refutations, suggesting limited direct precedent for on-the-fly homography optimization in latent space. Spatially Adaptive RePaint examined ten candidates with no refutations, indicating the region-wise balancing mechanism may represent a novel extension to existing inpainting techniques. The zero-shot pipeline contribution also examined ten candidates without refutation, though the limited search scope means potentially relevant work in the broader diffusion or warping literature may not have been captured.

Based on top-twenty-three semantic matches, the work appears to occupy a distinct position combining zero-shot inference, homography-based latent deformation, and spatially adaptive refinement. The taxonomy context suggests this sits at the intersection of camera-conditioned diffusion and zero-shot synthesis, an area with sparse prior exploration. However, the analysis does not cover exhaustive literature in related warping methods or broader diffusion inpainting techniques, leaving open questions about incremental versus transformative novelty relative to the full field.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: camera-controlled novel view synthesis from single images. The field has evolved into a rich landscape of approaches that can be broadly organized around how they represent and generate new viewpoints. Diffusion-Based Novel View Synthesis methods leverage powerful generative priors from pretrained diffusion models, often conditioning on camera parameters to guide the synthesis process, as seen in works like Zero-1-to-3[7] and SV3D[2]. In parallel, 3D Representation-Based approaches build explicit geometric structures such as neural radiance fields or Gaussian splats to enable view interpolation, while Generative Models with 3D Awareness embed camera control directly into GAN or transformer architectures. Learning-Based Image Warping and Refinement techniques focus on transforming input pixels through learned flow or depth, and Transformer-Based methods exploit attention mechanisms for multi-view reasoning. Additional branches address Sparse and Limited Observation scenarios, Specialized Domains like autonomous driving or hand-object interactions, 4D and Dynamic Scene extensions, and the critical challenge of Multi-View Consistency enforced through 3D priors or collaborative diffusion strategies. Recent work has intensified around camera-conditioned latent diffusion, where models inject precise pose information into the generative process to achieve fine-grained viewpoint control. Adaptive Latent Modulation[0] sits squarely in this active cluster, proposing mechanisms to modulate latent features based on camera parameters within a diffusion framework. Neighboring efforts such as CamCtrl3D[4] and ReCamDriving[41] similarly emphasize explicit camera conditioning but may differ in their domain focus or architectural choices for encoding pose. Compared to earlier warping-based methods like Genwarp[5] or TOSS[3], which rely on geometric transformations and refinement networks, Adaptive Latent Modulation[0] leverages the generative flexibility of diffusion models to handle larger viewpoint changes and more complex scene content. This positioning reflects a broader trend toward integrating strong generative priors with structured camera control, balancing photorealism with geometric plausibility across diverse single-image synthesis scenarios.

Claimed Contributions

Test-time Latent Homography Deformation

3 retrieved papers

A lightweight optimization method that resolves drifting synthesis in inpainted regions by deforming the latent tensor during inference to align with rendered images, ensuring the entire scene moves coherently with the camera motion.

3 retrieved papers

Spatially Adaptive RePaint (SA-RePaint)

10 retrieved papers

An extension to RePaint that overcomes the structure-texture trade-off by making the balance between reliance on rendered images and generative freedom spatially auto-adaptive, allowing globally coherent structures while producing rich new textures.

10 retrieved papers

Zero-shot NVS pipeline prioritizing faithfulness and efficiency

10 retrieved papers

A training-free novel view synthesis pipeline that achieves high faithfulness to the source scene and computational efficiency without requiring costly retraining or large-scale finetuning, operating under 11 GB of VRAM.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[4] CamCtrl3D: Single-Image Scene Exploration with Precise 3D Camera Control PDF

Stefan Popov, Amit Raj, Michael Krainin, Yuanzhen Li, William T. Freeman, Michael Rubinstein (2025)

[41] ReCamDriving: LiDAR-Free Camera-Controlled Novel Trajectory Video Generation PDF

Yaokun Li, Shuaixian Wang, Mantang Guo, Jiehui Huang, Taojun Ding, Mu Hu, Kaixuan Wang, Shaojie Shen, Guang Tan (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Test-time Latent Homography Deformation

[61] Epipolar-free 3d gaussian splatting for generalizable novel view synthesis PDF

Cannot Refute

[62] Mcnet: Rethinking the core ingredients for accurate and efficient homography estimation PDF

Cannot Refute

[63] Supplementary material for NeurMiPs: Neural Mixture of Planar Experts for View Synthesis PDF

Cannot Refute

Contribution

Spatially Adaptive RePaint (SA-RePaint)

[51] Novel GAN-based image completion: addressing structure and texture consistency in missing regions PDF

Cannot Refute

[52] Image Inpainting via Conditional Texture and Structure Dual Generation PDF

Cannot Refute

[53] Keys to Better Image Inpainting: Structure and Texture Go Hand in Hand PDF

Cannot Refute

[54] Complex image inpainting of cultural relics integrating multi-stage structural features and spatial textures PDF

Cannot Refute

[55] Generating Diverse Structure for Image Inpainting With Hierarchical VQ-VAE PDF

Cannot Refute

[56] High-Fidelity Pluralistic Image Completion with Transformers PDF

Cannot Refute

[57] Image inpainting based on fusion structure information and pixelwise attention PDF

Cannot Refute

[58] Progressive generative mural image restoration based on adversarial structure learning PDF

Cannot Refute

[59] Image Inpainting Guided by Coherence Priors of Semantics and Textures PDF

Cannot Refute

[60] Constructing a 3D Town from a Single Image PDF

Cannot Refute

Contribution

Zero-shot NVS pipeline prioritizing faithfulness and efficiency

[64] InstantID: Zero-shot Identity-Preserving Generation in Seconds PDF

Cannot Refute

[65] Dream3d: Zero-shot text-to-3d synthesis using 3d shape prior and text-to-image diffusion models PDF

Cannot Refute

[66] View Interpolation for Image Synthesis PDF

Cannot Refute

[67] Textcraft: Zero-shot generation of high fidelity and diverse shapes from text PDF

Cannot Refute

[68] Stylerf: Zero-shot 3d style transfer of neural radiance fields PDF

Cannot Refute

[69] ID-Animator: Zero-Shot Identity-Preserving Human Video Generation PDF

Cannot Refute

[70] CLIP-Sculptor: Zero-Shot Generation of High-Fidelity and Diverse Shapes from Natural Language PDF

Cannot Refute

[71] High-Fidelity Simulated Data Generation for Real-World Zero-Shot Robotic Manipulation Learning with Gaussian Splatting PDF

Cannot Refute

[72] HiFi-Portrait: Zero-shot Identity-preserved Portrait Generation with High-fidelity Multi-face Fusion PDF

Cannot Refute

[73] Zero-Shot 3D Scene Representation With Invertible Generative Neural Radiance Fields PDF

Cannot Refute

Prioritizing Faithfulness: Efficient Zero-Shot Novel View Synthesis with Adaptive Latent Modulation

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[4] CamCtrl3D: Single-Image Scene Exploration with Precise 3D Camera Control PDF

[41] ReCamDriving: LiDAR-Free Camera-Controlled Novel Trajectory Video Generation PDF

Contribution Analysis

Test-time Latent Homography Deformation

[61] Epipolar-free 3d gaussian splatting for generalizable novel view synthesis PDF

[62] Mcnet: Rethinking the core ingredients for accurate and efficient homography estimation PDF

[63] Supplementary material for NeurMiPs: Neural Mixture of Planar Experts for View Synthesis PDF

Spatially Adaptive RePaint (SA-RePaint)

[51] Novel GAN-based image completion: addressing structure and texture consistency in missing regions PDF

[52] Image Inpainting via Conditional Texture and Structure Dual Generation PDF

[53] Keys to Better Image Inpainting: Structure and Texture Go Hand in Hand PDF

[54] Complex image inpainting of cultural relics integrating multi-stage structural features and spatial textures PDF

[55] Generating Diverse Structure for Image Inpainting With Hierarchical VQ-VAE PDF

[56] High-Fidelity Pluralistic Image Completion with Transformers PDF

[57] Image inpainting based on fusion structure information and pixelwise attention PDF

[58] Progressive generative mural image restoration based on adversarial structure learning PDF

[59] Image Inpainting Guided by Coherence Priors of Semantics and Textures PDF

[60] Constructing a 3D Town from a Single Image PDF

Zero-shot NVS pipeline prioritizing faithfulness and efficiency

[64] InstantID: Zero-shot Identity-Preserving Generation in Seconds PDF

[65] Dream3d: Zero-shot text-to-3d synthesis using 3d shape prior and text-to-image diffusion models PDF

[66] View Interpolation for Image Synthesis PDF

[67] Textcraft: Zero-shot generation of high fidelity and diverse shapes from text PDF

[68] Stylerf: Zero-shot 3d style transfer of neural radiance fields PDF

[69] ID-Animator: Zero-Shot Identity-Preserving Human Video Generation PDF

[70] CLIP-Sculptor: Zero-Shot Generation of High-Fidelity and Diverse Shapes from Natural Language PDF

[71] High-Fidelity Simulated Data Generation for Real-World Zero-Shot Robotic Manipulation Learning with Gaussian Splatting PDF

[72] HiFi-Portrait: Zero-shot Identity-preserved Portrait Generation with High-fidelity Multi-face Fusion PDF

[73] Zero-Shot 3D Scene Representation With Invertible Generative Neural Radiance Fields PDF

Table of Contents