Quantized Visual Geometry Grounded Transformer

ICLR 2026 Conference SubmissionAnonymous Authors
Geometry GroundedModel Quantization
Abstract:

Learning-based 3D reconstruction models, represented by Visual Geometry Grounded Transformers (VGGTs), have achieved remarkable progress with large-scale transformers. Their prohibitive computational and memory costs severely hinder real-world deployment. Post-Training Quantization (PTQ) has emerged as a common practice to compress and accelerate models. However, we empirically observe that PTQ faces unique obstacles when compressing billion-scale VGGTs: the data-independent special tokens induce heavy-tailed activation distributions, while the multi-view nature of 3D data makes calibration sample selection highly unstable. This paper proposes the first Quantization framework for VGGTs, namely QuantVGGT. This mainly relies on two technical contributions: First, we introduce Dual-Smoothed Fine-Grained Quantization, which integrates pre-global Hadamard rotation and post-local channel smoothing to robustly mitigate heavy-tailed distributions and inter-channel variance. Second, we design Noise-Filtered Diverse Sampling, which filters outliers via deep-layer statistics and constructs frame-aware diverse calibration clusters to ensure stable quantization ranges. Comprehensive experiments demonstrate that QuantVGGT achieves the state-of-the-art results across different benchmarks and bit-width, surpassing the previous state-of-the-art generic quantization method with a great margin. We highlight that our 4-bit QuantVGGT can deliver a 3.7×\times memory reduction and 2.5×\times acceleration in real-hardware inference, while preserving over 98% reconstruction accuracy of the full-precision counterparts. This demonstrates the vast advantages and practicality of QuantVGGT in resource-constrained scenarios.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces QuantVGGT, a post-training quantization framework specifically designed for billion-scale Visual Geometry Grounded Transformers. According to the taxonomy tree, this work resides in the 'Post-Training Quantization for Vision Models' leaf under 'Model Compression and Optimization'. Notably, this leaf contains only the original paper itself with no sibling papers, indicating a relatively sparse research direction within the broader 50-paper taxonomy. The work addresses unique challenges in quantizing large-scale 3D reconstruction models, particularly heavy-tailed activation distributions from special tokens and calibration instability from multi-view data.

The taxonomy reveals that the broader 'Model Compression and Optimization' branch contains only two leaves: the original paper's leaf and 'Test-Time Compute Scaling'. This suggests model compression for visual geometry models is an emerging area with limited prior work in the surveyed literature. Neighboring branches include 'Machine Learning Foundations' (covering foundation models and multi-task learning) and 'Application Domains' (spanning embodied AI to NLP shared tasks). The scope_note for the parent branch explicitly focuses on reducing computational costs while preserving performance, excluding training-time methods and architectural innovations, which helps contextualize this work's post-training focus.

Among the three contributions analyzed, the literature search examined 13 total candidates. The 'Dual-Smoothed Fine-Grained Quantization' contribution examined 4 candidates with 1 appearing to provide overlapping prior work, suggesting some precedent for smoothing-based quantization techniques. The 'Noise-Filtered Diverse Sampling' contribution examined only 1 candidate with no refutation, while the overarching 'First PTQ framework for VGGTs' claim examined 8 candidates with none providing clear refutation. These statistics reflect a limited search scope rather than exhaustive coverage, indicating the novelty assessment is based on top-K semantic matches within a constrained candidate pool.

Based on the limited 13-candidate search, the framework appears to occupy a relatively unexplored niche at the intersection of post-training quantization and large-scale visual geometry models. The sparse taxonomy leaf and absence of sibling papers suggest this specific application domain has received minimal attention in the surveyed literature. However, the partial overlap found for the smoothing technique indicates that while the overall framework may be novel, some underlying mechanisms draw from established quantization practices. The analysis covers semantic neighbors but cannot claim exhaustive field coverage.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
13
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

The field of 3D reconstruction and visual geometry encompasses a broad spectrum of research directions, organized into five major branches: Model Compression and Optimization, Machine Learning Foundations and Paradigms, Application Domains and Task-Specific Methods, Research Methodology and Meta-Science, and Domain-Specific Empirical Studies. The Model Compression and Optimization branch focuses on techniques to reduce computational and memory costs of vision models while preserving accuracy, including post-training quantization strategies that enable efficient deployment. Machine Learning Foundations explores core algorithmic paradigms such as multi-task learning frameworks and test-time compute scaling approaches like Scaling Test-Time Compute[2]. Application Domains spans diverse task-specific methods ranging from embodied AI systems to quality control in manufacturing, while Research Methodology addresses meta-scientific concerns including systematic review practices and research design frameworks. Domain-Specific Empirical Studies contribute datasets and findings from specialized areas such as medical imaging, agriculture, and cybersecurity. Within the Model Compression and Optimization branch, post-training quantization for vision models represents a particularly active area addressing the challenge of deploying large-scale visual geometry systems under resource constraints. Quantized Visual Geometry[0] situates itself squarely in this compression-focused cluster, emphasizing efficient representation of geometric features through quantization techniques. This work contrasts with broader methodological studies like Systematic Literature Reviews[5] or Research Methodology Guide[15], which provide meta-level guidance on conducting research rather than proposing specific technical solutions. While application-oriented papers such as Quality Control Defects[3] demonstrate quantization benefits in industrial settings, Quantized Visual Geometry[0] appears more concerned with the fundamental compression mechanisms themselves, exploring trade-offs between model size, inference speed, and geometric reconstruction fidelity that are central to practical deployment of 3D vision systems.

Claimed Contributions

Dual-Smoothed Fine-Grained Quantization (DSFQ)

A quantization architecture that combines pre-global Hadamard rotation to disperse outliers and smooth heavy-tailed distributions with post-local channel smoothing to normalize channel-level variance. This dual-stage approach addresses the skewed activation distributions caused by data-independent special tokens in VGGT.

4 retrieved papers
Can Refute
Noise-Filtered Diverse Sampling (NFDS)

A calibration dataset construction strategy that filters noisy outlier samples using deep-layer activation statistics and employs frame-aware clustering aligned with VGGT's inductive biases. This ensures a representative and stable calibration set for post-training quantization.

1 retrieved paper
First PTQ framework for VGGTs (QuantVGGT)

The first systematic post-training quantization framework specifically designed for Visual Geometry Grounded Transformers. It addresses unique challenges in quantizing billion-scale 3D reconstruction models through specialized techniques for handling data-independent tokens and multi-view data complexity.

8 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Dual-Smoothed Fine-Grained Quantization (DSFQ)

A quantization architecture that combines pre-global Hadamard rotation to disperse outliers and smooth heavy-tailed distributions with post-local channel smoothing to normalize channel-level variance. This dual-stage approach addresses the skewed activation distributions caused by data-independent special tokens in VGGT.

Contribution

Noise-Filtered Diverse Sampling (NFDS)

A calibration dataset construction strategy that filters noisy outlier samples using deep-layer activation statistics and employs frame-aware clustering aligned with VGGT's inductive biases. This ensures a representative and stable calibration set for post-training quantization.

Contribution

First PTQ framework for VGGTs (QuantVGGT)

The first systematic post-training quantization framework specifically designed for Visual Geometry Grounded Transformers. It addresses unique challenges in quantizing billion-scale 3D reconstruction models through specialized techniques for handling data-independent tokens and multi-view data complexity.