FantasyWorld: Geometry-Consistent World Modeling via Unified Video and 3D Prediction
Overview
Overall Novelty Assessment
FantasyWorld proposes a geometry-enhanced framework that augments frozen video foundation models with a trainable geometric branch, enabling joint video and 3D field modeling in a single forward pass. The paper resides in the 'Cross-Modal Supervision for Joint Video-Geometry Learning' leaf, which contains only three papers total. This represents a relatively sparse research direction within the broader taxonomy of 29 papers across multiple branches, suggesting the specific approach of bidirectional supervision between video and geometry modules is still emerging rather than saturated.
The taxonomy reveals that FantasyWorld's leaf sits within the larger 'Unified Video-3D Representation Learning' branch, which also includes sibling directions like '4D Dynamic Scene Representation' and 'Video Diffusion with Explicit 3D Constraints'. Neighboring branches pursue related but distinct goals: 'Geometry-Conditioned Video Generation' uses pre-extracted geometry as input rather than learning it jointly, while 'Geometry Estimation from Video' focuses on the inverse problem of recovering structure from observations. The scope note for FantasyWorld's leaf explicitly excludes unidirectional geometry-to-video guidance, positioning this work in a narrower methodological space where video priors actively regularize 3D prediction.
Among 30 candidates examined, the contribution-level analysis reveals mixed novelty signals. The core framework and cross-branch supervision mechanism each examined 10 candidates and found 2 potentially refutable prior works, suggesting some overlap with existing joint video-geometry approaches. However, the third contribution—generalizable 3D features for downstream tasks without fine-tuning—examined 10 candidates with zero refutable matches, indicating this aspect may be more distinctive. The limited search scope means these statistics reflect top-30 semantic matches rather than exhaustive coverage, so additional related work may exist beyond this sample.
Given the sparse population of the specific taxonomy leaf and the modest search scale, FantasyWorld appears to occupy a relatively novel position within cross-modal video-geometry learning, though the framework-level contributions show some prior art overlap among the examined candidates. The downstream generalization aspect seems less explored in the sampled literature, while the bidirectional supervision mechanism aligns with a small cluster of recent works pursuing similar joint optimization strategies.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors introduce FantasyWorld, a framework that extends frozen video foundation models by adding a trainable geometric branch. This enables simultaneous prediction of video latents and an implicit 3D field in one forward pass, bridging video generation and 3D perception without per-scene optimization.
The authors propose a cross-branch supervision strategy where geometric cues inform video generation while video priors regularize 3D prediction. This bidirectional constraint mechanism produces consistent and generalizable 3D-aware video representations.
The geometric branch produces latent representations that can serve as versatile features for downstream 3D tasks like novel view synthesis and navigation, eliminating the need for per-scene optimization or task-specific fine-tuning.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[2] Aether: Geometric-aware unified world modeling PDF
[14] Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
FantasyWorld: Geometry-enhanced framework for unified video and 3D modeling
The authors introduce FantasyWorld, a framework that extends frozen video foundation models by adding a trainable geometric branch. This enables simultaneous prediction of video latents and an implicit 3D field in one forward pass, bridging video generation and 3D perception without per-scene optimization.
[2] Aether: Geometric-aware unified world modeling PDF
[14] Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling PDF
[30] Geovideo: Introducing geometric regularization into video generation model PDF
[31] Uni4D: Unifying Visual Foundation Models for 4D Modeling from a Single Video PDF
[32] Harnessing Foundation Models for Robust and Generalizable 6-DOF Bronchoscopy Localization PDF
[33] V3d: Video diffusion models are effective 3d generators PDF
[34] VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction PDF
[35] Bridging the Gap Between Multimodal Foundation Models and World Models PDF
[36] Controllable video generation: A survey PDF
[37] Can Video Diffusion Model Reconstruct 4D Geometry? PDF
Cross-branch supervision mechanism between video and geometry
The authors propose a cross-branch supervision strategy where geometric cues inform video generation while video priors regularize 3D prediction. This bidirectional constraint mechanism produces consistent and generalizable 3D-aware video representations.
[14] Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling PDF
[48] JOG3R: Towards 3D-Consistent Video Generators PDF
[49] Self-supervised learning with geometric constraints in monocular video: Connecting flow, depth, and camera PDF
[50] MGSR: 2D/3D Mutual-boosted Gaussian Splatting for High-fidelity Surface Reconstruction under Various Light Conditions PDF
[51] 3D-Aware Video Stabilization via Reconstruction and Rendering PDF
[52] mmHand: 3D hand pose estimation using millimeter-wave radar PDF
[53] GVLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning PDF
[54] The 3D Modeling Technology for Substation Surface Temperature Distribution Based on Thermal-Geometric Joint PDF
[55] AtlantaSDF: Neural 3D Indoor Scene Reconstruction with the Atlanta-world Assumption PDF
[56] CONSISTENCY ENHANCED DEEP LEARNING FOR VISUAL PERCEPTION DATA OF STRUCTURAL HEALTH MONITORING PDF
Generalizable 3D features for downstream tasks without fine-tuning
The geometric branch produces latent representations that can serve as versatile features for downstream 3D tasks like novel view synthesis and navigation, eliminating the need for per-scene optimization or task-specific fine-tuning.