FantasyWorld: Geometry-Consistent World Modeling via Unified Video and 3D Prediction

ICLR 2026 Conference SubmissionAnonymous Authors
World ModelVideo Generation3D Genearation3D-aware video generation
Abstract:

High-quality 3D world models are pivotal for embodied intelligence and Artificial General Intelligence (AGI), underpinning applications such as AR/VR content creation and robotic navigation. Despite the established strong imaginative priors, current video foundation models lack explicit 3D grounding capabilities, thus being limited in both spatial consistency and their utility for downstream 3D reasoning tasks. In this work, we present FantasyWorld, a geometry-enhanced framework that augments frozen video foundation models with a trainable geometric branch, enabling joint modeling of video latents and an implicit 3D field in a single forward pass. Our approach introduces cross-branch supervision, where geometry cues guide video generation and video priors regularize 3D prediction, thus yielding consistent and generalizable 3D-aware video representations. Notably, the resulting latents from the geometric branch can potentially serve as versatile representations for downstream 3D tasks such as novel view synthesis and navigation, without requiring per-scene optimization or fine-tuning. Extensive experiments show that FantasyWorld effectively bridges video imagination and 3D perception, outperforming recent geometry-consistent baselines in multi-view coherence and style consistency. Ablation studies further confirm that these gains stem from the unified backbone and cross-branch information exchange.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

FantasyWorld proposes a geometry-enhanced framework that augments frozen video foundation models with a trainable geometric branch, enabling joint video and 3D field modeling in a single forward pass. The paper resides in the 'Cross-Modal Supervision for Joint Video-Geometry Learning' leaf, which contains only three papers total. This represents a relatively sparse research direction within the broader taxonomy of 29 papers across multiple branches, suggesting the specific approach of bidirectional supervision between video and geometry modules is still emerging rather than saturated.

The taxonomy reveals that FantasyWorld's leaf sits within the larger 'Unified Video-3D Representation Learning' branch, which also includes sibling directions like '4D Dynamic Scene Representation' and 'Video Diffusion with Explicit 3D Constraints'. Neighboring branches pursue related but distinct goals: 'Geometry-Conditioned Video Generation' uses pre-extracted geometry as input rather than learning it jointly, while 'Geometry Estimation from Video' focuses on the inverse problem of recovering structure from observations. The scope note for FantasyWorld's leaf explicitly excludes unidirectional geometry-to-video guidance, positioning this work in a narrower methodological space where video priors actively regularize 3D prediction.

Among 30 candidates examined, the contribution-level analysis reveals mixed novelty signals. The core framework and cross-branch supervision mechanism each examined 10 candidates and found 2 potentially refutable prior works, suggesting some overlap with existing joint video-geometry approaches. However, the third contribution—generalizable 3D features for downstream tasks without fine-tuning—examined 10 candidates with zero refutable matches, indicating this aspect may be more distinctive. The limited search scope means these statistics reflect top-30 semantic matches rather than exhaustive coverage, so additional related work may exist beyond this sample.

Given the sparse population of the specific taxonomy leaf and the modest search scale, FantasyWorld appears to occupy a relatively novel position within cross-modal video-geometry learning, though the framework-level contributions show some prior art overlap among the examined candidates. The downstream generalization aspect seems less explored in the sampled literature, while the bidirectional supervision mechanism aligns with a small cluster of recent works pursuing similar joint optimization strategies.

Taxonomy

Core-task Taxonomy Papers
29
3
Claimed Contributions
30
Contribution Candidate Papers Compared
4
Refutable Paper

Research Landscape Overview

Core task: Geometry-consistent world modeling via unified video and 3D prediction. This emerging field seeks to bridge the gap between temporal video generation and explicit 3D geometric reasoning, enabling models that can predict future visual observations while maintaining coherent spatial structure. The taxonomy reveals several complementary research directions: Unified Video-3D Representation Learning explores joint embeddings and cross-modal supervision strategies that tie pixel-level dynamics to underlying geometry; Geometry-Conditioned Video Generation focuses on using depth, camera poses, or 3D scene representations to guide synthesis; Geometry Estimation from Video tackles the inverse problem of recovering spatial structure from temporal observations; World Models for Embodied Intelligence emphasizes predictive models for robotic planning and interaction; Controllable 3D Scene Generation addresses user-driven spatial content creation; and Temporal Prediction with Spatial Reasoning combines forecasting with geometric awareness. Representative works like Aether[2] and Infinicube[3] demonstrate how explicit 3D representations can scaffold video prediction, while approaches such as Driving Future[1] and Gigaworld[5] show the value of geometry-aware modeling in autonomous driving contexts. A particularly active line of inquiry centers on cross-modal supervision, where video and geometry signals mutually constrain one another during training. FantasyWorld[0] exemplifies this approach by jointly learning video generation and 3D prediction through shared representations, closely aligning with Aether[2] and Geometry Forcing[14], which similarly enforce geometric consistency across modalities. In contrast, works like Gen3c[6] and GeometryCrafter[7] emphasize conditioning video synthesis on pre-extracted or user-specified geometry rather than learning both modalities end-to-end. Meanwhile, methods such as GaussianPrediction[12] and Novel View Forecasting[13] focus on predicting explicit 3D structures (e.g., Gaussian splats) to enable temporally coherent novel views. FantasyWorld[0] sits within the cross-modal supervision cluster, distinguishing itself by tightly coupling video and geometry learning rather than treating geometry as a fixed input or separate output, thereby enabling richer feedback between spatial and temporal reasoning.

Claimed Contributions

FantasyWorld: Geometry-enhanced framework for unified video and 3D modeling

The authors introduce FantasyWorld, a framework that extends frozen video foundation models by adding a trainable geometric branch. This enables simultaneous prediction of video latents and an implicit 3D field in one forward pass, bridging video generation and 3D perception without per-scene optimization.

10 retrieved papers
Can Refute
Cross-branch supervision mechanism between video and geometry

The authors propose a cross-branch supervision strategy where geometric cues inform video generation while video priors regularize 3D prediction. This bidirectional constraint mechanism produces consistent and generalizable 3D-aware video representations.

10 retrieved papers
Can Refute
Generalizable 3D features for downstream tasks without fine-tuning

The geometric branch produces latent representations that can serve as versatile features for downstream 3D tasks like novel view synthesis and navigation, eliminating the need for per-scene optimization or task-specific fine-tuning.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

FantasyWorld: Geometry-enhanced framework for unified video and 3D modeling

The authors introduce FantasyWorld, a framework that extends frozen video foundation models by adding a trainable geometric branch. This enables simultaneous prediction of video latents and an implicit 3D field in one forward pass, bridging video generation and 3D perception without per-scene optimization.

Contribution

Cross-branch supervision mechanism between video and geometry

The authors propose a cross-branch supervision strategy where geometric cues inform video generation while video priors regularize 3D prediction. This bidirectional constraint mechanism produces consistent and generalizable 3D-aware video representations.

Contribution

Generalizable 3D features for downstream tasks without fine-tuning

The geometric branch produces latent representations that can serve as versatile features for downstream 3D tasks like novel view synthesis and navigation, eliminating the need for per-scene optimization or task-specific fine-tuning.

FantasyWorld: Geometry-Consistent World Modeling via Unified Video and 3D Prediction | Novelty Validation