Generative View Stitching

ICLR 2026 Conference SubmissionAnonymous Authors
Video GenerationCamera-guided Video GenerationVideo Diffusion Models
Abstract:

Autoregressive video diffusion models are capable of extremely long rollouts that are stable and consistent with history, but they are unable to guide the current generation with conditioning from the future. In camera-guided video generation with a predefined camera trajectory, this limitation leads to collisions with the generated scene, after which autoregression quickly collapses. To address this, we propose Generative View Stitching (GVS), which samples the entire sequence in parallel such that the generated scene is faithful to every part of the predefined camera trajectory. Our main contribution is a sampling algorithm that extends prior work on diffusion stitching for robot planning to video generation. While such stitching methods usually require a specially trained model, GVS is compatible with any off-the-shelf video model trained with Diffusion Forcing, a prevalent sequence diffusion framework that we show already provides the affordances necessary for stitching. We then introduce OmniGuidance, a technique that enhances the temporal consistency in stitching by conditioning on both the past and future, and that enables our proposed loop-closing mechanism for delivering long-range coherence. Overall, GVS achieves camera-guided video generation that is stable, collision-free, frame-to-frame consistent, and closes loops for a variety of predefined camera paths, including Oscar Reutersvärd’s Impossible Staircase.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Generative View Stitching (GVS), a sampling algorithm that enables parallel generation of entire video sequences conditioned on predefined camera trajectories, addressing collision and autoregression collapse issues in camera-guided video synthesis. It resides in the Explicit Camera Pose Conditioning leaf, which contains six papers including CameraCtrl, MotionCtrl, and CamViG. This leaf sits within the broader Camera Trajectory Parameterization and Control Mechanisms branch, indicating a moderately populated research direction focused on direct pose-based conditioning rather than implicit motion encodings.

The taxonomy reveals neighboring leaves addressing related but distinct challenges: Implicit Motion Representations (four papers using optical flow or learned encodings) and Training-Free and Plug-and-Play Control (two papers enabling inference-time guidance without fine-tuning). The paper's emphasis on parallel sampling and stitching connects it to Multi-View and 3D-Consistent Generation themes, particularly the Multi-Camera Synchronized Generation and Depth-Guided methods that also prioritize geometric coherence. The scope notes clarify that explicit pose conditioning excludes trajectory-based or flow-driven approaches, positioning GVS firmly within methods that directly parameterize camera extrinsics.

Among the three contributions analyzed, the GVS sampling algorithm was examined against zero candidates, while OmniGuidance (bidirectional temporal conditioning) was compared to ten candidates with none providing clear refutation, and the loop-closing mechanism was assessed against four candidates, also without refutation. The total search examined fourteen candidates across all contributions, a limited scope that captures immediate neighbors but does not constitute exhaustive coverage. The absence of refutable prior work among these candidates suggests that the specific combination of parallel stitching, bidirectional conditioning, and loop-closing for camera-guided generation may be relatively unexplored within the examined literature.

Based on the limited search scope of fourteen candidates, the work appears to introduce a distinct approach within its taxonomy leaf, particularly in addressing autoregression collapse through parallel sampling and future conditioning. However, the analysis does not cover the full fifty-paper taxonomy or broader diffusion stitching literature beyond top semantic matches. The novelty assessment is therefore constrained to the examined subset, and a more comprehensive search might reveal additional overlapping methods in related diffusion planning or multi-view synthesis domains.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
14
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: camera-guided video generation with predefined trajectories. This field centers on enabling precise control over virtual camera motion in synthesized videos, allowing users to specify viewpoint changes independently of scene content. The taxonomy reveals a rich landscape organized around several complementary themes. Camera Trajectory Parameterization and Control Mechanisms explores how camera paths are represented and injected into generation models, ranging from explicit pose conditioning (as in CameraCtrl[12] and MotionCtrl[9]) to more implicit or learned encodings. Object and Camera Motion Disentanglement addresses the challenge of separating foreground dynamics from viewpoint changes, ensuring that camera movement does not inadvertently alter object behavior. Multi-View and 3D-Consistent Generation focuses on maintaining geometric coherence across frames and viewpoints, often leveraging 3D representations or depth cues. Unified Motion Control Frameworks seek to integrate camera control with other motion signals such as object trajectories or user sketches (e.g., Direct-a-Video[3], Boximator[39]), while User Interaction and Controllability Enhancement emphasizes intuitive interfaces and flexible input modalities. Training Strategies and Optimization, Specialized Applications, and Supporting Methodologies round out the taxonomy by addressing learning paradigms, domain-specific extensions, and foundational tools like datasets or evaluation metrics. Recent work has intensified efforts to achieve robust camera control without sacrificing content quality or temporal consistency. A central tension lies between training-free methods that adapt pretrained models via test-time guidance (Training-free Camera Control[18]) and approaches that fine-tune or train from scratch with camera annotations (CamViG[5], Gen3C[2]). Another active line explores how to handle complex, non-linear trajectories and ensure 3D consistency, with methods like CamCo[45] and CamPVG[41] proposing novel conditioning schemes or geometric constraints. Generative View Stitching[0] situates itself within the Explicit Camera Pose Conditioning cluster, emphasizing direct parameterization of camera extrinsics to guide synthesis. Compared to neighbors such as CamViG[5], which also conditions on explicit poses, and CamCo[45], which integrates camera and object motion jointly, Generative View Stitching[0] appears to prioritize seamless multi-view stitching and coherent trajectory following. This positioning reflects broader debates in the field about the trade-offs between control granularity, generalization to diverse scenes, and computational efficiency.

Claimed Contributions

Generative View Stitching (GVS) sampling algorithm

A training-free diffusion stitching method for camera-guided video generation that extends prior diffusion stitching work to video by sampling all frames in parallel. GVS is compatible with any off-the-shelf video model trained with Diffusion Forcing and enables collision-free generation faithful to predefined camera trajectories.

0 retrieved papers
Omni Guidance technique

A novel guidance mechanism that strengthens conditioning on both past and future frames to improve temporal consistency in stitched video generation. This technique modifies the score function to steer the joint distribution toward the desired conditional distribution and enables partial stochasticity to reduce oversmoothing.

10 retrieved papers
Loop-closing mechanism via cyclic conditioning

A mechanism that achieves long-range consistency and visual loop closure by alternating between temporal windows (conditioning on temporally neighboring chunks) and spatial windows (conditioning on temporally distant but spatially close chunks) during the denoising process.

4 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Generative View Stitching (GVS) sampling algorithm

A training-free diffusion stitching method for camera-guided video generation that extends prior diffusion stitching work to video by sampling all frames in parallel. GVS is compatible with any off-the-shelf video model trained with Diffusion Forcing and enables collision-free generation faithful to predefined camera trajectories.

Contribution

Omni Guidance technique

A novel guidance mechanism that strengthens conditioning on both past and future frames to improve temporal consistency in stitched video generation. This technique modifies the score function to steer the joint distribution toward the desired conditional distribution and enables partial stochasticity to reduce oversmoothing.

Contribution

Loop-closing mechanism via cyclic conditioning

A mechanism that achieves long-range consistency and visual loop closure by alternating between temporal windows (conditioning on temporally neighboring chunks) and spatial windows (conditioning on temporally distant but spatially close chunks) during the denoising process.