Streaming Visual Geometry Transformer

ICLR 2026 Conference SubmissionAnonymous Authors
3D reconstructiongeometry transformer
Abstract:

Perceiving and reconstructing 3D geometry from videos is a fundamental yet challenging computer vision task. To facilitate interactive and low-latency applications, we propose a streaming visual geometry transformer that shares a similar philosophy with autoregressive large language models. We explore a simple and efficient design and employ a causal transformer architecture to process the input sequence in an online manner. We use temporal causal attention and cache the historical keys and values as implicit memory to enable efficient streaming long-term 3D reconstruction. This design can handle low-latency 3D reconstruction by incrementally integrating historical information while maintaining high-quality spatial consistency. For efficient training, we propose to distill knowledge from the dense bidirectional visual geometry grounded transformer (VGGT) to our causal model. For inference, our model supports the migration of optimized efficient attention operators (e.g., FlashAttention) from large language models. Extensive experiments on various 3D geometry perception benchmarks demonstrate that our model enhances inference speed in online scenarios while maintaining competitive performance, thereby facilitating scalable and interactive 3D vision systems.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a streaming visual geometry transformer that processes video frames incrementally using temporal causal attention and cached historical keys/values. It resides in the 'Streaming and Sequential Reconstruction Architectures' leaf, which contains six papers total (including this one). This leaf represents a relatively focused research direction within the broader taxonomy of fifty papers spanning nine major categories. The concentration of only six papers in this leaf suggests a moderately sparse area compared to more crowded domains like Monocular Video Reconstruction, which encompasses multiple subtopics and over fifteen papers.

The taxonomy tree reveals that neighboring research directions include Monocular Video Reconstruction (with real-time dense reconstruction and dynamic object modeling), RGB-D and Depth-Based Reconstruction (leveraging sensor data for immediate geometry), and Stereo and Multi-View Reconstruction (exploiting multiple viewpoints). The paper's streaming architecture distinguishes it from these sensor-modality-focused branches by emphasizing causal processing and autoregressive design principles borrowed from large language models. While sibling papers like StreamSplat and STream3R also target low-latency incremental updates, the paper's explicit use of transformer-based causal attention and knowledge distillation from bidirectional models positions it at the intersection of streaming reconstruction and modern deep learning architectures.

Among twenty candidates examined across three contributions, five refutable pairs were identified. The streaming transformer with temporal causal attention examined ten candidates and found two potentially overlapping works. The cached historical token memory mechanism examined nine candidates with two refutable matches. The knowledge distillation training strategy examined only one candidate, which appeared to provide prior work. These statistics indicate that within the limited search scope, each contribution encounters some degree of prior overlap, though the majority of examined candidates (fifteen out of twenty) did not clearly refute the contributions. The relatively small candidate pool means the analysis captures top semantic matches rather than exhaustive coverage.

Given the limited search scope of twenty candidates, the paper appears to operate in a moderately explored area where streaming architectures for 3D reconstruction are gaining traction but remain less saturated than traditional SLAM or RGB-D fusion methods. The contribution-level statistics suggest incremental novelty rather than entirely uncharted territory, with each technical component finding at least one overlapping prior work among the examined candidates. A more exhaustive literature search would be necessary to assess whether the specific combination of causal transformers, cached memory, and distillation training constitutes a genuinely novel synthesis or a natural extension of existing streaming reconstruction paradigms.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
20
Contribution Candidate Papers Compared
5
Refutable Paper

Research Landscape Overview

Core task: streaming 3D geometry reconstruction from video. The field addresses how to recover three-dimensional structure incrementally as video frames arrive, rather than waiting for complete sequences. The taxonomy reflects a rich landscape organized around input modality, architectural strategy, and application domain. Monocular Video Reconstruction tackles the challenge of depth ambiguity from single cameras, while RGB-D and Depth-Based Reconstruction exploits sensor data for denser, more immediate geometry (e.g., KinectFusion[33]). Stereo and Multi-View Reconstruction leverages multiple viewpoints to triangulate structure, and Urban and Large-Scale Scene Reconstruction focuses on city-scale environments where efficiency and scalability are paramount (Urban 3D Reconstruction[2]). Semantic and Open-Vocabulary 3D Reconstruction integrates language or category labels into geometry, and Specialized Application Domains span medical endoscopy (Endo3R[9], Colonoscopy Reconstruction[19]) and other niche settings. Facial and Human Body Reconstruction targets articulated or deformable subjects (Video People Models[10], Morphable Models Video[36]), while Foundational Techniques and System Surveys provide overarching methodological reviews (Dynamic Scenes Review[38]). Within Streaming and Sequential Reconstruction Architectures, a central tension emerges between real-time responsiveness and reconstruction fidelity. Works such as StreamSplat[25] and STream3R[41] emphasize low-latency processing pipelines that update geometry on-the-fly, often trading off global consistency for immediate feedback. Streaming Visual Geometry[0] sits squarely in this branch, prioritizing incremental updates and efficient data structures that accommodate continuous frame arrival. Compared to Streaming 4D Geometry[3], which extends the problem to dynamic scenes with temporal deformation, Streaming Visual Geometry[0] focuses on static or slowly changing environments where sequential integration is the main challenge. Meanwhile, ReCon-GS[46] explores Gaussian splatting representations for streaming contexts, highlighting how representation choice—voxel grids, surfels, or splats—shapes both speed and quality. These architectural decisions remain an active area of exploration, balancing the need for real-time operation against the desire for high-fidelity, globally consistent reconstructions.

Claimed Contributions

Streaming visual geometry transformer with temporal causal attention

The authors introduce StreamVGGT, a causal transformer architecture that replaces global self-attention with temporal causal attention to enable efficient, low-latency streaming 3D reconstruction. This design allows incremental processing of video frames without reprocessing entire sequences.

10 retrieved papers
Can Refute
Cached historical token memory mechanism

The authors propose caching historical keys and values as implicit memory tokens, allowing the model to maintain long-term context while supporting efficient incremental reconstruction. This mechanism enables the model to replicate temporal causal attention behavior during streaming inference.

9 retrieved papers
Can Refute
Knowledge distillation training strategy from bidirectional teacher

The authors develop a distillation-based training approach that transfers geometric understanding from the bidirectional VGGT teacher to the causal student model. This strategy unifies multi-task supervision through teacher-generated pseudo-ground truth, reducing training costs while maintaining high accuracy.

1 retrieved paper
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Streaming visual geometry transformer with temporal causal attention

The authors introduce StreamVGGT, a causal transformer architecture that replaces global self-attention with temporal causal attention to enable efficient, low-latency streaming 3D reconstruction. This design allows incremental processing of video frames without reprocessing entire sequences.

Contribution

Cached historical token memory mechanism

The authors propose caching historical keys and values as implicit memory tokens, allowing the model to maintain long-term context while supporting efficient incremental reconstruction. This mechanism enables the model to replicate temporal causal attention behavior during streaming inference.

Contribution

Knowledge distillation training strategy from bidirectional teacher

The authors develop a distillation-based training approach that transfers geometric understanding from the bidirectional VGGT teacher to the causal student model. This strategy unifies multi-task supervision through teacher-generated pseudo-ground truth, reducing training costs while maintaining high accuracy.