True Self-Supervised Novel View Synthesis is Transferable

ICLR 2026 Conference SubmissionAnonymous Authors
Novel View SynthesisSelf-SupervisedUnsupervisedRepresentation Learning
Abstract:

In this paper, we identify that the key criterion for determining whether a model is truly capable of novel view synthesis (NVS) is transferability: Whether any pose representation extracted from one video sequence can be used to re-render the same camera trajectory in another. We analyze prior work on self-supervised NVS and find that their predicted poses do not transfer: The same set of poses lead to different camera trajectories in different 3D scenes. Here, we present XFactor, the first geometry-free self-supervised model capable of true NVS. XFactor combines pair-wise pose estimation with a simple augmentation scheme of the inputs and outputs that jointly enables disentangling camera pose from scene content and facilitates geometric reasoning. Remarkably, we show that XFactor achieves transferability with unconstrained latent pose variables, without any 3D inductive biases or concepts from multi-view geometry — such as an explicit parameterization of poses as elements of SE(3). We introduce a new metric to quantify transferability, and through large-scale experiments, we demonstrate that XFactor significantly outperforms prior pose-free NVS transformers, and show that latent poses are highly correlated with real-world poses through probing experiments.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces XFactor, a self-supervised novel view synthesis model that learns transferable latent pose representations without explicit geometric parameterization. According to the taxonomy tree, this work resides in the 'Transferable Latent Pose Representations' leaf under 'Pose-Free Novel View Synthesis'. Notably, this leaf contains only the original paper itself—no sibling papers are listed—suggesting this is a relatively sparse or newly defined research direction within the broader field of twelve surveyed papers across multiple branches.

The taxonomy reveals that neighboring research directions include 'Self-Supervised Multi-View Reconstruction' (two papers) and 'Visibility-Aware View Synthesis' (one paper), both under the same parent branch. The scope notes clarify that transferable latent pose methods explicitly exclude non-transferable encodings and explicit SE(3) parameterizations, distinguishing them from pose estimation approaches and pose-conditioned synthesis branches. This positioning suggests the work bridges pose-free synthesis with viewpoint learning, diverging from methods requiring ground-truth poses or category-specific annotations.

Among thirty candidates examined, the contribution-level analysis shows mixed novelty signals. The transferability criterion (Contribution A) examined ten candidates with zero refutations, suggesting this framing may be relatively novel. However, the core XFactor model (Contribution B) examined ten candidates and found one refutable prior work, indicating some overlap in the geometry-free self-supervised NVS space. The stereo-monocular architecture with pose-preserving augmentations (Contribution C) also examined ten candidates with no refutations, suggesting this technical approach may be less explored in prior literature.

Given the limited search scope of thirty semantically similar papers, the analysis provides a snapshot rather than exhaustive coverage. The isolated taxonomy position and low refutation rates suggest the work occupies a relatively underexplored niche, though the single refutation for the core model indicates at least one prior effort in geometry-free self-supervised synthesis. The transferability framing and augmentation strategy appear more distinctive within the examined candidate set.

Taxonomy

Core-task Taxonomy Papers
12
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: self-supervised novel view synthesis with transferable pose representations. The field organizes around four main branches that reflect different strategies for learning viewpoint-aware generative models without full supervision. Pose-Free Novel View Synthesis explores methods that sidestep explicit pose estimation by learning latent representations that capture viewpoint transformations implicitly, often relying on consistency losses or contrastive objectives to disentangle identity from pose. Self-Supervised Viewpoint and Pose Estimation focuses on extracting camera parameters or object orientations from unlabeled data, enabling downstream rendering tasks. Pose-Conditioned Synthesis and Animation assumes access to pose labels or skeletons and emphasizes controllable generation, such as animating characters or objects given target poses. Cross-Modal Self-Supervised Learning leverages multiple modalities—such as images, point clouds, or text—to build richer representations that generalize across domains. Representative works like ViewCLR[3] and PointVST[4] illustrate how contrastive learning can bridge modalities, while earlier efforts such as ShapeCodes[11] and Viewpoint Learning[9] laid foundations for disentangling shape and viewpoint. A particularly active line of work within Pose-Free Novel View Synthesis investigates transferable latent pose representations that can generalize across object categories without category-specific pose annotations. Transferable Novel View[0] sits squarely in this cluster, emphasizing the learning of pose codes that transfer to unseen classes, contrasting with approaches like STaR[5] that may rely on stronger geometric priors or ViewCLR[3] that uses contrastive objectives primarily for representation learning rather than explicit synthesis. Meanwhile, methods such as Animatable Gaussians[2] and Weakly Supervised Pose Transfer[6] operate in the pose-conditioned regime, trading off the flexibility of unsupervised pose discovery for tighter control when skeletal or keypoint annotations are available. The central tension across these branches is whether to invest in explicit pose estimation, accept weaker supervision with latent codes, or exploit cross-modal signals to improve generalization—each choice shaping the scalability and transferability of the resulting models.

Claimed Contributions

Transferability as the key criterion for true novel view synthesis

The authors propose that transferability—the ability to apply camera poses extracted from one scene to render the same trajectory in a different scene—is the fundamental property distinguishing true NVS from frame interpolation. They formalize this as a criterion and introduce the True Pose Similarity (TPS) metric to quantify it.

10 retrieved papers
XFactor: first geometry-free self-supervised model for true NVS

The authors introduce XFactor, a novel self-supervised NVS model that achieves transferability without relying on 3D inductive biases or multi-view geometry concepts. It combines a stereo-monocular architecture with a transferability training objective and pose-preserving augmentations to learn disentangled, transferable camera pose representations.

10 retrieved papers
Can Refute
Stereo-monocular model with transferability objective and pose-preserving augmentations

The authors propose a training approach that bootstraps from a two-view (stereo-monocular) model to prevent interpolation, combined with an explicit transferability objective and pose-preserving augmentations (such as inverse masking) that minimize pixel overlap while maintaining pose information. This design prevents information leakage and encourages learning of transferable pose representations.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Transferability as the key criterion for true novel view synthesis

The authors propose that transferability—the ability to apply camera poses extracted from one scene to render the same trajectory in a different scene—is the fundamental property distinguishing true NVS from frame interpolation. They formalize this as a criterion and introduce the True Pose Similarity (TPS) metric to quantify it.

Contribution

XFactor: first geometry-free self-supervised model for true NVS

The authors introduce XFactor, a novel self-supervised NVS model that achieves transferability without relying on 3D inductive biases or multi-view geometry concepts. It combines a stereo-monocular architecture with a transferability training objective and pose-preserving augmentations to learn disentangled, transferable camera pose representations.

Contribution

Stereo-monocular model with transferability objective and pose-preserving augmentations

The authors propose a training approach that bootstraps from a two-view (stereo-monocular) model to prevent interpolation, combined with an explicit transferability objective and pose-preserving augmentations (such as inverse masking) that minimize pixel overlap while maintaining pose information. This design prevents information leakage and encourages learning of transferable pose representations.

True Self-Supervised Novel View Synthesis is Transferable | Novelty Validation