True Self-Supervised Novel View Synthesis is Transferable

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Novel View SynthesisSelf-SupervisedUnsupervisedRepresentation Learning

In this paper, we identify that the key criterion for determining whether a model is truly capable of novel view synthesis (NVS) is transferability: Whether any pose representation extracted from one video sequence can be used to re-render the same camera trajectory in another. We analyze prior work on self-supervised NVS and find that their predicted poses do not transfer: The same set of poses lead to different camera trajectories in different 3D scenes. Here, we present XFactor, the first geometry-free self-supervised model capable of true NVS. XFactor combines pair-wise pose estimation with a simple augmentation scheme of the inputs and outputs that jointly enables disentangling camera pose from scene content and facilitates geometric reasoning. Remarkably, we show that XFactor achieves transferability with unconstrained latent pose variables, without any 3D inductive biases or concepts from multi-view geometry — such as an explicit parameterization of poses as elements of SE(3). We introduce a new metric to quantify transferability, and through large-scale experiments, we demonstrate that XFactor significantly outperforms prior pose-free NVS transformers, and show that latent poses are highly correlated with real-world poses through probing experiments.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces XFactor, a self-supervised novel view synthesis model that learns transferable latent pose representations without explicit geometric parameterization. According to the taxonomy tree, this work resides in the 'Transferable Latent Pose Representations' leaf under 'Pose-Free Novel View Synthesis'. Notably, this leaf contains only the original paper itself—no sibling papers are listed—suggesting this is a relatively sparse or newly defined research direction within the broader field of twelve surveyed papers across multiple branches.

The taxonomy reveals that neighboring research directions include 'Self-Supervised Multi-View Reconstruction' (two papers) and 'Visibility-Aware View Synthesis' (one paper), both under the same parent branch. The scope notes clarify that transferable latent pose methods explicitly exclude non-transferable encodings and explicit SE(3) parameterizations, distinguishing them from pose estimation approaches and pose-conditioned synthesis branches. This positioning suggests the work bridges pose-free synthesis with viewpoint learning, diverging from methods requiring ground-truth poses or category-specific annotations.

Among thirty candidates examined, the contribution-level analysis shows mixed novelty signals. The transferability criterion (Contribution A) examined ten candidates with zero refutations, suggesting this framing may be relatively novel. However, the core XFactor model (Contribution B) examined ten candidates and found one refutable prior work, indicating some overlap in the geometry-free self-supervised NVS space. The stereo-monocular architecture with pose-preserving augmentations (Contribution C) also examined ten candidates with no refutations, suggesting this technical approach may be less explored in prior literature.

Given the limited search scope of thirty semantically similar papers, the analysis provides a snapshot rather than exhaustive coverage. The isolated taxonomy position and low refutation rates suggest the work occupies a relatively underexplored niche, though the single refutation for the core model indicates at least one prior effort in geometry-free self-supervised synthesis. The transferability framing and augmentation strategy appear more distinctive within the examined candidate set.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: self-supervised novel view synthesis with transferable pose representations. The field organizes around four main branches that reflect different strategies for learning viewpoint-aware generative models without full supervision. Pose-Free Novel View Synthesis explores methods that sidestep explicit pose estimation by learning latent representations that capture viewpoint transformations implicitly, often relying on consistency losses or contrastive objectives to disentangle identity from pose. Self-Supervised Viewpoint and Pose Estimation focuses on extracting camera parameters or object orientations from unlabeled data, enabling downstream rendering tasks. Pose-Conditioned Synthesis and Animation assumes access to pose labels or skeletons and emphasizes controllable generation, such as animating characters or objects given target poses. Cross-Modal Self-Supervised Learning leverages multiple modalities—such as images, point clouds, or text—to build richer representations that generalize across domains. Representative works like ViewCLR[3] and PointVST[4] illustrate how contrastive learning can bridge modalities, while earlier efforts such as ShapeCodes[11] and Viewpoint Learning[9] laid foundations for disentangling shape and viewpoint. A particularly active line of work within Pose-Free Novel View Synthesis investigates transferable latent pose representations that can generalize across object categories without category-specific pose annotations. Transferable Novel View[0] sits squarely in this cluster, emphasizing the learning of pose codes that transfer to unseen classes, contrasting with approaches like STaR[5] that may rely on stronger geometric priors or ViewCLR[3] that uses contrastive objectives primarily for representation learning rather than explicit synthesis. Meanwhile, methods such as Animatable Gaussians[2] and Weakly Supervised Pose Transfer[6] operate in the pose-conditioned regime, trading off the flexibility of unsupervised pose discovery for tighter control when skeletal or keypoint annotations are available. The central tension across these branches is whether to invest in explicit pose estimation, accept weaker supervision with latent codes, or exploit cross-modal signals to improve generalization—each choice shaping the scalability and transferability of the resulting models.

Claimed Contributions

Transferability as the key criterion for true novel view synthesis

10 retrieved papers

The authors propose that transferability—the ability to apply camera poses extracted from one scene to render the same trajectory in a different scene—is the fundamental property distinguishing true NVS from frame interpolation. They formalize this as a criterion and introduce the True Pose Similarity (TPS) metric to quantify it.

10 retrieved papers

XFactor: first geometry-free self-supervised model for true NVS

Can Refute

10 retrieved papers

The authors introduce XFactor, a novel self-supervised NVS model that achieves transferability without relying on 3D inductive biases or multi-view geometry concepts. It combines a stereo-monocular architecture with a transferability training objective and pose-preserving augmentations to learn disentangled, transferable camera pose representations.

10 retrieved papers

Can Refute

Stereo-monocular model with transferability objective and pose-preserving augmentations

10 retrieved papers

The authors propose a training approach that bootstraps from a two-view (stereo-monocular) model to prevent interpolation, combined with an explicit transferability objective and pose-preserving augmentations (such as inverse masking) that minimize pixel overlap while maintaining pose information. This design prevents information leakage and encourages learning of transferable pose representations.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Transferability as the key criterion for true novel view synthesis

[32] Sv3d: Novel multi-view synthesis and 3d generation from a single image using latent video diffusion PDF

Cannot Refute

[33] Generative camera dolly: Extreme monocular dynamic novel view synthesis PDF

Cannot Refute

[34] Cameras as Relative Positional Encoding PDF

Cannot Refute

[35] sshELF: Single-Shot Hierarchical Extrapolation of Latent Features for 3D Reconstruction from Sparse-Views PDF

Cannot Refute

[36] AdaptPose: Cross-Dataset Adaptation for 3D Human Pose Estimation by Learnable Motion Generation PDF

Cannot Refute

[37] Enhancing nerf akin to enhancing llms: Generalizable nerf transformer with mixture-of-view-experts PDF

Cannot Refute

[38] Unified Camera Positional Encoding for Controlled Video Generation PDF

Cannot Refute

[39] Augmenting Imitation Experience via Equivariant Representations PDF

Cannot Refute

[40] DGBD: Depth Guided Branched Diffusion for Comprehensive Controllability in Multi-View Generation PDF

Cannot Refute

[41] Towards Viewpoint Robustness in Bird's Eye View Segmentation PDF

Cannot Refute

Contribution

XFactor: first geometry-free self-supervised model for true NVS

[1] RayZer: A Self-supervised Large View Synthesis Model PDF

Can Refute

[23] Free3D: Consistent Novel View Synthesis Without 3D Representation PDF

Cannot Refute

[24] Novel View Synthesis with Diffusion Models PDF

Cannot Refute

[25] LVSM: A Large View Synthesis Model with Minimal 3D Inductive Bias PDF

Cannot Refute

[26] No Pose at All: Self-Supervised Pose-Free 3D Gaussian Splatting from Sparse Views PDF

Cannot Refute

[27] Geometry-free view synthesis: Transformers and no 3d priors PDF

Cannot Refute

[28] Unsupervised novel view synthesis from a single image PDF

Cannot Refute

[29] Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations PDF

Cannot Refute

[30] The Less You Depend, The More You Learn: Synthesizing Novel Views from Sparse, Unposed Images without Any 3D Knowledge PDF

Cannot Refute

[31] Self-supervised Neural Articulated Shape and Appearance Models PDF

Cannot Refute

Contribution

Stereo-monocular model with transferability objective and pose-preserving augmentations

[13] Pose-Diversified Augmentation with Diffusion Model for Person Re-Identification PDF

Cannot Refute

[14] Feature space transfer for data augmentation PDF

Cannot Refute

[15] Human Pose Transfer with Augmented Disentangled Feature Consistency PDF

Cannot Refute

[16] Pose-dIVE: Pose-Diversified Augmentation with Diffusion Model for Person Re-Identification PDF

Cannot Refute

[17] Unsupervised 3D Human Pose Representation with Viewpoint and Pose Disentanglement PDF

Cannot Refute

[18] Data augmentation for NeRF: a geometric consistent solution based on view morphing PDF

Cannot Refute

[19] Gait recognition using identity-aware adversarial data augmentation PDF

Cannot Refute

[20] DECA: Deep viewpoint-Equivariant human pose estimation using Capsule Autoencoders PDF

Cannot Refute

[21] Pose Invariant Person Re-Identification using Robust Pose-transformation GAN PDF

Cannot Refute

[22] PoseContrast: Class-Agnostic Object Viewpoint Estimation in the Wild with Pose-Aware Contrastive Learning PDF

Cannot Refute

True Self-Supervised Novel View Synthesis is Transferable

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Transferability as the key criterion for true novel view synthesis

[32] Sv3d: Novel multi-view synthesis and 3d generation from a single image using latent video diffusion PDF

[33] Generative camera dolly: Extreme monocular dynamic novel view synthesis PDF

[34] Cameras as Relative Positional Encoding PDF

[35] sshELF: Single-Shot Hierarchical Extrapolation of Latent Features for 3D Reconstruction from Sparse-Views PDF

[36] AdaptPose: Cross-Dataset Adaptation for 3D Human Pose Estimation by Learnable Motion Generation PDF

[37] Enhancing nerf akin to enhancing llms: Generalizable nerf transformer with mixture-of-view-experts PDF

[38] Unified Camera Positional Encoding for Controlled Video Generation PDF

[39] Augmenting Imitation Experience via Equivariant Representations PDF

[40] DGBD: Depth Guided Branched Diffusion for Comprehensive Controllability in Multi-View Generation PDF

[41] Towards Viewpoint Robustness in Bird's Eye View Segmentation PDF

XFactor: first geometry-free self-supervised model for true NVS

[1] RayZer: A Self-supervised Large View Synthesis Model PDF

[23] Free3D: Consistent Novel View Synthesis Without 3D Representation PDF

[24] Novel View Synthesis with Diffusion Models PDF

[25] LVSM: A Large View Synthesis Model with Minimal 3D Inductive Bias PDF

[26] No Pose at All: Self-Supervised Pose-Free 3D Gaussian Splatting from Sparse Views PDF

[27] Geometry-free view synthesis: Transformers and no 3d priors PDF

[28] Unsupervised novel view synthesis from a single image PDF

[29] Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations PDF

[30] The Less You Depend, The More You Learn: Synthesizing Novel Views from Sparse, Unposed Images without Any 3D Knowledge PDF

[31] Self-supervised Neural Articulated Shape and Appearance Models PDF

Stereo-monocular model with transferability objective and pose-preserving augmentations

[13] Pose-Diversified Augmentation with Diffusion Model for Person Re-Identification PDF

[14] Feature space transfer for data augmentation PDF

[15] Human Pose Transfer with Augmented Disentangled Feature Consistency PDF

[16] Pose-dIVE: Pose-Diversified Augmentation with Diffusion Model for Person Re-Identification PDF

[17] Unsupervised 3D Human Pose Representation with Viewpoint and Pose Disentanglement PDF

[18] Data augmentation for NeRF: a geometric consistent solution based on view morphing PDF

[19] Gait recognition using identity-aware adversarial data augmentation PDF

[20] DECA: Deep viewpoint-Equivariant human pose estimation using Capsule Autoencoders PDF

[21] Pose Invariant Person Re-Identification using Robust Pose-transformation GAN PDF

[22] PoseContrast: Class-Agnostic Object Viewpoint Estimation in the Wild with Pose-Aware Contrastive Learning PDF

Table of Contents