FantasyWorld: Geometry-Consistent World Modeling via Unified Video and 3D Prediction

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

World ModelVideo Generation3D Genearation3D-aware video generation

High-quality 3D world models are pivotal for embodied intelligence and Artificial General Intelligence (AGI), underpinning applications such as AR/VR content creation and robotic navigation. Despite the established strong imaginative priors, current video foundation models lack explicit 3D grounding capabilities, thus being limited in both spatial consistency and their utility for downstream 3D reasoning tasks. In this work, we present FantasyWorld, a geometry-enhanced framework that augments frozen video foundation models with a trainable geometric branch, enabling joint modeling of video latents and an implicit 3D field in a single forward pass. Our approach introduces cross-branch supervision, where geometry cues guide video generation and video priors regularize 3D prediction, thus yielding consistent and generalizable 3D-aware video representations. Notably, the resulting latents from the geometric branch can potentially serve as versatile representations for downstream 3D tasks such as novel view synthesis and navigation, without requiring per-scene optimization or fine-tuning. Extensive experiments show that FantasyWorld effectively bridges video imagination and 3D perception, outperforming recent geometry-consistent baselines in multi-view coherence and style consistency. Ablation studies further confirm that these gains stem from the unified backbone and cross-branch information exchange.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

FantasyWorld proposes a geometry-enhanced framework that augments frozen video foundation models with a trainable geometric branch, enabling joint video and 3D field modeling in a single forward pass. The paper resides in the 'Cross-Modal Supervision for Joint Video-Geometry Learning' leaf, which contains only three papers total. This represents a relatively sparse research direction within the broader taxonomy of 29 papers across multiple branches, suggesting the specific approach of bidirectional supervision between video and geometry modules is still emerging rather than saturated.

The taxonomy reveals that FantasyWorld's leaf sits within the larger 'Unified Video-3D Representation Learning' branch, which also includes sibling directions like '4D Dynamic Scene Representation' and 'Video Diffusion with Explicit 3D Constraints'. Neighboring branches pursue related but distinct goals: 'Geometry-Conditioned Video Generation' uses pre-extracted geometry as input rather than learning it jointly, while 'Geometry Estimation from Video' focuses on the inverse problem of recovering structure from observations. The scope note for FantasyWorld's leaf explicitly excludes unidirectional geometry-to-video guidance, positioning this work in a narrower methodological space where video priors actively regularize 3D prediction.

Among 30 candidates examined, the contribution-level analysis reveals mixed novelty signals. The core framework and cross-branch supervision mechanism each examined 10 candidates and found 2 potentially refutable prior works, suggesting some overlap with existing joint video-geometry approaches. However, the third contribution—generalizable 3D features for downstream tasks without fine-tuning—examined 10 candidates with zero refutable matches, indicating this aspect may be more distinctive. The limited search scope means these statistics reflect top-30 semantic matches rather than exhaustive coverage, so additional related work may exist beyond this sample.

Given the sparse population of the specific taxonomy leaf and the modest search scale, FantasyWorld appears to occupy a relatively novel position within cross-modal video-geometry learning, though the framework-level contributions show some prior art overlap among the examined candidates. The downstream generalization aspect seems less explored in the sampled literature, while the bidirectional supervision mechanism aligns with a small cluster of recent works pursuing similar joint optimization strategies.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Geometry-consistent world modeling via unified video and 3D prediction. This emerging field seeks to bridge the gap between temporal video generation and explicit 3D geometric reasoning, enabling models that can predict future visual observations while maintaining coherent spatial structure. The taxonomy reveals several complementary research directions: Unified Video-3D Representation Learning explores joint embeddings and cross-modal supervision strategies that tie pixel-level dynamics to underlying geometry; Geometry-Conditioned Video Generation focuses on using depth, camera poses, or 3D scene representations to guide synthesis; Geometry Estimation from Video tackles the inverse problem of recovering spatial structure from temporal observations; World Models for Embodied Intelligence emphasizes predictive models for robotic planning and interaction; Controllable 3D Scene Generation addresses user-driven spatial content creation; and Temporal Prediction with Spatial Reasoning combines forecasting with geometric awareness. Representative works like Aether[2] and Infinicube[3] demonstrate how explicit 3D representations can scaffold video prediction, while approaches such as Driving Future[1] and Gigaworld[5] show the value of geometry-aware modeling in autonomous driving contexts. A particularly active line of inquiry centers on cross-modal supervision, where video and geometry signals mutually constrain one another during training. FantasyWorld[0] exemplifies this approach by jointly learning video generation and 3D prediction through shared representations, closely aligning with Aether[2] and Geometry Forcing[14], which similarly enforce geometric consistency across modalities. In contrast, works like Gen3c[6] and GeometryCrafter[7] emphasize conditioning video synthesis on pre-extracted or user-specified geometry rather than learning both modalities end-to-end. Meanwhile, methods such as GaussianPrediction[12] and Novel View Forecasting[13] focus on predicting explicit 3D structures (e.g., Gaussian splats) to enable temporally coherent novel views. FantasyWorld[0] sits within the cross-modal supervision cluster, distinguishing itself by tightly coupling video and geometry learning rather than treating geometry as a fixed input or separate output, thereby enabling richer feedback between spatial and temporal reasoning.

Claimed Contributions

FantasyWorld: Geometry-enhanced framework for unified video and 3D modeling

Can Refute

10 retrieved papers

The authors introduce FantasyWorld, a framework that extends frozen video foundation models by adding a trainable geometric branch. This enables simultaneous prediction of video latents and an implicit 3D field in one forward pass, bridging video generation and 3D perception without per-scene optimization.

10 retrieved papers

Can Refute

Cross-branch supervision mechanism between video and geometry

Can Refute

10 retrieved papers

The authors propose a cross-branch supervision strategy where geometric cues inform video generation while video priors regularize 3D prediction. This bidirectional constraint mechanism produces consistent and generalizable 3D-aware video representations.

10 retrieved papers

Can Refute

Generalizable 3D features for downstream tasks without fine-tuning

10 retrieved papers

The geometric branch produces latent representations that can serve as versatile features for downstream 3D tasks like novel view synthesis and navigation, eliminating the need for per-scene optimization or task-specific fine-tuning.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[2] Aether: Geometric-aware unified world modeling PDF

Zhu, Haoyi, Aether Team, Wang Yifan, Haoyi Zhu, Zhou Jianjun, Yifan Wang, Jianjun Zhou, Zhou Yang, Wenzheng Chang, Li Zizun, Yang Zhou, Chen Junyi, Zizun Li, Shen, Chunhua, Junyi Chen, Pang, Jiangmiao, Chunhua Shen, He Tong, Jiangmiao Pang, Tong He (2025)

[14] Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling PDF

Wu Haoyu, Wu Diankun, Haoyu Wu, He Tianyu, Diankun Wu, Guo, Junliang, Tianyu He, Ye Yang, Junliang Guo, Duan, Yueqi, Yang Ye, Bian Jiang, Yueqi Duan, Jiang Bian (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

FantasyWorld: Geometry-enhanced framework for unified video and 3D modeling

[2] Aether: Geometric-aware unified world modeling PDF

Can Refute

[14] Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling PDF

Can Refute

[30] Geovideo: Introducing geometric regularization into video generation model PDF

Cannot Refute

[31] Uni4D: Unifying Visual Foundation Models for 4D Modeling from a Single Video PDF

Cannot Refute

[32] Harnessing Foundation Models for Robust and Generalizable 6-DOF Bronchoscopy Localization PDF

Cannot Refute

[33] V3d: Video diffusion models are effective 3d generators PDF

Cannot Refute

[34] VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction PDF

Cannot Refute

[35] Bridging the Gap Between Multimodal Foundation Models and World Models PDF

Cannot Refute

[36] Controllable video generation: A survey PDF

Cannot Refute

[37] Can Video Diffusion Model Reconstruct 4D Geometry? PDF

Cannot Refute

Contribution

Cross-branch supervision mechanism between video and geometry

[14] Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling PDF

Can Refute

[48] JOG3R: Towards 3D-Consistent Video Generators PDF

Can Refute

[49] Self-supervised learning with geometric constraints in monocular video: Connecting flow, depth, and camera PDF

Cannot Refute

[50] MGSR: 2D/3D Mutual-boosted Gaussian Splatting for High-fidelity Surface Reconstruction under Various Light Conditions PDF

Cannot Refute

[51] 3D-Aware Video Stabilization via Reconstruction and Rendering PDF

Cannot Refute

[52] mmHand: 3D hand pose estimation using millimeter-wave radar PDF

Cannot Refute

[53] GVLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning PDF

Cannot Refute

[54] The 3D Modeling Technology for Substation Surface Temperature Distribution Based on Thermal-Geometric Joint PDF

Cannot Refute

[55] AtlantaSDF: Neural 3D Indoor Scene Reconstruction with the Atlanta-world Assumption PDF

Cannot Refute

[56] CONSISTENCY ENHANCED DEEP LEARNING FOR VISUAL PERCEPTION DATA OF STRUCTURAL HEALTH MONITORING PDF

Cannot Refute

Contribution

Generalizable 3D features for downstream tasks without fine-tuning

[38] Scene representation transformer: Geometry-free novel view synthesis through set-latent scene representations PDF

Cannot Refute

[39] Generalizable 3D Gaussian Splatting for novel view synthesis PDF

Cannot Refute

[40] Gps-gaussian: Generalizable pixel-wise 3d gaussian splatting for real-time human novel view synthesis PDF

Cannot Refute

[41] Learning generalizable feature fields for mobile manipulation PDF

Cannot Refute

[42] High-fidelity novel view synthesis via splatting-guided diffusion PDF

Cannot Refute

[43] Generalizable human gaussians for sparse view synthesis PDF

Cannot Refute

[44] Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d PDF

Cannot Refute

[45] View-invariant policy learning via zero-shot novel view synthesis PDF

Cannot Refute

[46] Fwd: Real-time novel view synthesis with forward warping and depth PDF

Cannot Refute

[47] Advances in feed-forward 3d reconstruction and view synthesis: A survey PDF

Cannot Refute

FantasyWorld: Geometry-Consistent World Modeling via Unified Video and 3D Prediction

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[2] Aether: Geometric-aware unified world modeling PDF

[14] Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling PDF

Contribution Analysis

FantasyWorld: Geometry-enhanced framework for unified video and 3D modeling

[2] Aether: Geometric-aware unified world modeling PDF

[14] Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling PDF

[30] Geovideo: Introducing geometric regularization into video generation model PDF

[31] Uni4D: Unifying Visual Foundation Models for 4D Modeling from a Single Video PDF

[32] Harnessing Foundation Models for Robust and Generalizable 6-DOF Bronchoscopy Localization PDF

[33] V3d: Video diffusion models are effective 3d generators PDF

[34] VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction PDF

[35] Bridging the Gap Between Multimodal Foundation Models and World Models PDF

[36] Controllable video generation: A survey PDF

[37] Can Video Diffusion Model Reconstruct 4D Geometry? PDF

Cross-branch supervision mechanism between video and geometry

[14] Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling PDF

[48] JOG3R: Towards 3D-Consistent Video Generators PDF

[49] Self-supervised learning with geometric constraints in monocular video: Connecting flow, depth, and camera PDF

[50] MGSR: 2D/3D Mutual-boosted Gaussian Splatting for High-fidelity Surface Reconstruction under Various Light Conditions PDF

[51] 3D-Aware Video Stabilization via Reconstruction and Rendering PDF

[52] mmHand: 3D hand pose estimation using millimeter-wave radar PDF

[53] GVLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning PDF

[54] The 3D Modeling Technology for Substation Surface Temperature Distribution Based on Thermal-Geometric Joint PDF

[55] AtlantaSDF: Neural 3D Indoor Scene Reconstruction with the Atlanta-world Assumption PDF

[56] CONSISTENCY ENHANCED DEEP LEARNING FOR VISUAL PERCEPTION DATA OF STRUCTURAL HEALTH MONITORING PDF

Generalizable 3D features for downstream tasks without fine-tuning

[38] Scene representation transformer: Geometry-free novel view synthesis through set-latent scene representations PDF

[39] Generalizable 3D Gaussian Splatting for novel view synthesis PDF

[40] Gps-gaussian: Generalizable pixel-wise 3d gaussian splatting for real-time human novel view synthesis PDF

[41] Learning generalizable feature fields for mobile manipulation PDF

[42] High-fidelity novel view synthesis via splatting-guided diffusion PDF

[43] Generalizable human gaussians for sparse view synthesis PDF

[44] Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d PDF

[45] View-invariant policy learning via zero-shot novel view synthesis PDF

[46] Fwd: Real-time novel view synthesis with forward warping and depth PDF

[47] Advances in feed-forward 3d reconstruction and view synthesis: A survey PDF

Table of Contents