YoNoSplat: You Only Need One Model for Feedforward 3D Gaussian Splatting

ICLR 2026 Conference SubmissionAnonymous Authors
3D Gaussian splattingfeedforward modelnovel view synthesispose-free
Abstract:

Fast and flexible 3D scene reconstruction from unstructured image collections remains a significant challenge. We present YoNoSplat, a feedforward model that reconstructs high-quality 3D Gaussian Splatting representations from an arbitrary number of images. Our model is highly versatile, operating effectively with both posed and unposed, calibrated and uncalibrated inputs. YoNoSplat predicts local Gaussians and camera poses for each view, which are aggregated into a global representation using either predicted or provided poses. To overcome the inherent difficulty of jointly learning 3D Gaussians and camera parameters, we introduce a novel mixing training strategy. This approach mitigates the entanglement between the two tasks by initially using ground-truth poses to aggregate local Gaussians and gradually transitioning to a mix of predicted and ground-truth poses, which prevents both training instability and exposure bias. We further resolve the scale ambiguity problem by a novel pairwise camera-distance normalization scheme and by embedding camera intrinsics into the network. Moreover, YoNoSplat also predicts intrinsic parameters, making it feasible for uncalibrated inputs. YoNoSplat demonstrates exceptional efficiency, reconstructing a scene from 100 views (at 280×518 resolution) in just 2.69 seconds on an NVIDIA GH200 GPU. It achieves state-of-the-art performance on standard benchmarks in both pose-free and pose-dependent settings. The code and pretrained models will be made public.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

YoNoSplat contributes a feedforward model that reconstructs 3D Gaussian Splatting representations from arbitrary numbers of images, handling both posed and unposed, calibrated and uncalibrated inputs. It resides in the 'Uncalibrated and Pose-Free Gaussian Reconstruction' leaf, which contains only three papers total (including YoNoSplat itself). This is a relatively sparse research direction within the broader Gaussian Splatting-Based Feedforward Reconstruction branch, indicating that joint learning of Gaussians and camera parameters without calibration remains an emerging and challenging area.

The taxonomy tree shows that YoNoSplat's leaf is nested under Multi-View Gaussian Reconstruction, which also includes sibling leaves for Sparse-View Gaussian Splatting (three papers assuming known poses) and Surround-View and Driving Scene Gaussian Reconstruction (one paper for vehicle-mounted scenarios). Neighboring branches address Enhanced Gaussian Reconstruction Techniques (three papers on voxel alignment and super-resolution) and Gaussian-Based Generative and Latent Modeling (three papers using Gaussians for generation). YoNoSplat diverges from these by tackling the uncalibrated setting, whereas sparse-view methods require known camera parameters and generative frameworks focus on synthesis rather than reconstruction from unstructured collections.

Among 29 candidates examined, the first contribution—'YoNoSplat: versatile feedforward model for 3D Gaussian Splatting'—shows one refutable candidate out of nine examined, suggesting some overlap with prior work in the uncalibrated Gaussian reconstruction space. The second contribution, 'Mix-forcing training strategy,' examined ten candidates with zero refutations, indicating this training approach appears more novel within the limited search scope. The third contribution, 'Scale ambiguity resolution through normalization and intrinsic prediction,' also examined ten candidates with no refutations, suggesting these technical solutions may be less directly addressed in the candidate pool.

Based on the top-29 semantic matches, YoNoSplat's core architecture shows some prior overlap, while its training strategy and scale-ambiguity solutions appear more distinctive. The sparse population of its taxonomy leaf (three papers) and the limited search scope mean this assessment captures only a snapshot of the most semantically similar work, not an exhaustive survey of all uncalibrated Gaussian reconstruction methods.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
29
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: feedforward 3D scene reconstruction from unstructured images. The field has evolved from traditional optimization-based pipelines toward modern feedforward architectures that predict geometry in a single pass. The taxonomy reveals several major branches: Gaussian Splatting-Based Feedforward Reconstruction leverages explicit point primitives for efficient rendering and reconstruction, while Volumetric and Implicit Feedforward Reconstruction employs neural fields or voxel grids for dense scene representation. Unified and Multi-Task Feedforward Frameworks integrate multiple objectives such as depth, pose, and appearance into joint models, whereas Traditional and Optimization-Based Reconstruction captures classical structure-from-motion and bundle adjustment methods. Surveys, Reviews, and Methodological Overviews (e.g., Feedforward Review[7], Deep Learning Survey[12]) synthesize progress across these paradigms, and Non-Visual and Alternative Sensing Modalities explore radar, WiFi, and other unconventional inputs (WiFi Scene Reconstruction[18], Radar Face Reconstruction[20]). Within Gaussian splatting approaches, a particularly active line addresses multi-view reconstruction with varying degrees of camera calibration. Works like MVSplat[5] and Flash3d[1] assume known or partially known poses, enabling robust feedforward prediction of splat parameters from sparse views. In contrast, YoNoSplat[0] tackles the harder uncalibrated and pose-free setting, jointly inferring camera geometry and 3D Gaussians without prior calibration—a direction also explored by Pref3r[15] and UniForward[8]. This uncalibrated branch confronts fundamental ambiguities in scale and alignment that calibrated methods sidestep, yet it promises greater flexibility for in-the-wild image collections. YoNoSplat[0] sits squarely in this niche, emphasizing end-to-end learning where both scene structure and camera parameters emerge from unstructured input, distinguishing it from neighbors that rely on at least partial pose supervision.

Claimed Contributions

YoNoSplat: versatile feedforward model for 3D Gaussian Splatting

The authors introduce YoNoSplat, a feedforward model that reconstructs 3D Gaussian Splatting representations from an arbitrary number of unposed and uncalibrated images. The model operates effectively in both pose-free and pose-dependent settings, as well as with calibrated and uncalibrated inputs, achieving state-of-the-art performance across multiple benchmarks.

9 retrieved papers
Can Refute
Mix-forcing training strategy

The authors propose a novel mix-forcing training strategy that addresses the entanglement between learning 3D Gaussians and camera parameters. The approach begins with teacher-forcing using ground-truth poses and gradually transitions to using a mixture of predicted and ground-truth poses, preventing training instability and exposure bias.

10 retrieved papers
Scale ambiguity resolution through normalization and intrinsic prediction

The authors resolve scale ambiguity through two mechanisms: a pairwise camera-distance normalization scheme that normalizes scenes by maximum pairwise distance between camera centers, and an Intrinsic Condition Embedding module that predicts and conditions on camera intrinsics, enabling reconstruction from uncalibrated images.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

YoNoSplat: versatile feedforward model for 3D Gaussian Splatting

The authors introduce YoNoSplat, a feedforward model that reconstructs 3D Gaussian Splatting representations from an arbitrary number of unposed and uncalibrated images. The model operates effectively in both pose-free and pose-dependent settings, as well as with calibrated and uncalibrated inputs, achieving state-of-the-art performance across multiple benchmarks.

Contribution

Mix-forcing training strategy

The authors propose a novel mix-forcing training strategy that addresses the entanglement between learning 3D Gaussians and camera parameters. The approach begins with teacher-forcing using ground-truth poses and gradually transitions to using a mixture of predicted and ground-truth poses, preventing training instability and exposure bias.

Contribution

Scale ambiguity resolution through normalization and intrinsic prediction

The authors resolve scale ambiguity through two mechanisms: a pairwise camera-distance normalization scheme that normalizes scenes by maximum pairwise distance between camera centers, and an Intrinsic Condition Embedding module that predicts and conditions on camera intrinsics, enabling reconstruction from uncalibrated images.