Quantized Visual Geometry Grounded Transformer

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.8 Download Report PDF

Geometry GroundedModel Quantization

Learning-based 3D reconstruction models, represented by Visual Geometry Grounded Transformers (VGGTs), have achieved remarkable progress with large-scale transformers. Their prohibitive computational and memory costs severely hinder real-world deployment. Post-Training Quantization (PTQ) has emerged as a common practice to compress and accelerate models. However, we empirically observe that PTQ faces unique obstacles when compressing billion-scale VGGTs: the data-independent special tokens induce heavy-tailed activation distributions, while the multi-view nature of 3D data makes calibration sample selection highly unstable. This paper proposes the first Quantization framework for VGGTs, namely QuantVGGT. This mainly relies on two technical contributions: First, we introduce Dual-Smoothed Fine-Grained Quantization, which integrates pre-global Hadamard rotation and post-local channel smoothing to robustly mitigate heavy-tailed distributions and inter-channel variance. Second, we design Noise-Filtered Diverse Sampling, which filters outliers via deep-layer statistics and constructs frame-aware diverse calibration clusters to ensure stable quantization ranges. Comprehensive experiments demonstrate that QuantVGGT achieves the state-of-the-art results across different benchmarks and bit-width, surpassing the previous state-of-the-art generic quantization method with a great margin. We highlight that our 4-bit QuantVGGT can deliver a 3.7 $\times$ memory reduction and 2.5 $\times$ acceleration in real-hardware inference, while preserving over 98% reconstruction accuracy of the full-precision counterparts. This demonstrates the vast advantages and practicality of QuantVGGT in resource-constrained scenarios.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces QuantVGGT, a post-training quantization framework specifically designed for billion-scale Visual Geometry Grounded Transformers. According to the taxonomy tree, this work resides in the 'Post-Training Quantization for Vision Models' leaf under 'Model Compression and Optimization'. Notably, this leaf contains only the original paper itself with no sibling papers, indicating a relatively sparse research direction within the broader 50-paper taxonomy. The work addresses unique challenges in quantizing large-scale 3D reconstruction models, particularly heavy-tailed activation distributions from special tokens and calibration instability from multi-view data.

The taxonomy reveals that the broader 'Model Compression and Optimization' branch contains only two leaves: the original paper's leaf and 'Test-Time Compute Scaling'. This suggests model compression for visual geometry models is an emerging area with limited prior work in the surveyed literature. Neighboring branches include 'Machine Learning Foundations' (covering foundation models and multi-task learning) and 'Application Domains' (spanning embodied AI to NLP shared tasks). The scope_note for the parent branch explicitly focuses on reducing computational costs while preserving performance, excluding training-time methods and architectural innovations, which helps contextualize this work's post-training focus.

Among the three contributions analyzed, the literature search examined 13 total candidates. The 'Dual-Smoothed Fine-Grained Quantization' contribution examined 4 candidates with 1 appearing to provide overlapping prior work, suggesting some precedent for smoothing-based quantization techniques. The 'Noise-Filtered Diverse Sampling' contribution examined only 1 candidate with no refutation, while the overarching 'First PTQ framework for VGGTs' claim examined 8 candidates with none providing clear refutation. These statistics reflect a limited search scope rather than exhaustive coverage, indicating the novelty assessment is based on top-K semantic matches within a constrained candidate pool.

Based on the limited 13-candidate search, the framework appears to occupy a relatively unexplored niche at the intersection of post-training quantization and large-scale visual geometry models. The sparse taxonomy leaf and absence of sibling papers suggest this specific application domain has received minimal attention in the surveyed literature. However, the partial overlap found for the smoothing technique indicates that while the overall framework may be novel, some underlying mechanisms draw from established quantization practices. The analysis covers semantic neighbors but cannot claim exhaustive field coverage.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

The field of 3D reconstruction and visual geometry encompasses a broad spectrum of research directions, organized into five major branches: Model Compression and Optimization, Machine Learning Foundations and Paradigms, Application Domains and Task-Specific Methods, Research Methodology and Meta-Science, and Domain-Specific Empirical Studies. The Model Compression and Optimization branch focuses on techniques to reduce computational and memory costs of vision models while preserving accuracy, including post-training quantization strategies that enable efficient deployment. Machine Learning Foundations explores core algorithmic paradigms such as multi-task learning frameworks and test-time compute scaling approaches like Scaling Test-Time Compute[2]. Application Domains spans diverse task-specific methods ranging from embodied AI systems to quality control in manufacturing, while Research Methodology addresses meta-scientific concerns including systematic review practices and research design frameworks. Domain-Specific Empirical Studies contribute datasets and findings from specialized areas such as medical imaging, agriculture, and cybersecurity. Within the Model Compression and Optimization branch, post-training quantization for vision models represents a particularly active area addressing the challenge of deploying large-scale visual geometry systems under resource constraints. Quantized Visual Geometry[0] situates itself squarely in this compression-focused cluster, emphasizing efficient representation of geometric features through quantization techniques. This work contrasts with broader methodological studies like Systematic Literature Reviews[5] or Research Methodology Guide[15], which provide meta-level guidance on conducting research rather than proposing specific technical solutions. While application-oriented papers such as Quality Control Defects[3] demonstrate quantization benefits in industrial settings, Quantized Visual Geometry[0] appears more concerned with the fundamental compression mechanisms themselves, exploring trade-offs between model size, inference speed, and geometric reconstruction fidelity that are central to practical deployment of 3D vision systems.

Claimed Contributions

Dual-Smoothed Fine-Grained Quantization (DSFQ)

Can Refute

4 retrieved papers

A quantization architecture that combines pre-global Hadamard rotation to disperse outliers and smooth heavy-tailed distributions with post-local channel smoothing to normalize channel-level variance. This dual-stage approach addresses the skewed activation distributions caused by data-independent special tokens in VGGT.

4 retrieved papers

Can Refute

Noise-Filtered Diverse Sampling (NFDS)

1 retrieved paper

A calibration dataset construction strategy that filters noisy outlier samples using deep-layer activation statistics and employs frame-aware clustering aligned with VGGT's inductive biases. This ensures a representative and stable calibration set for post-training quantization.

1 retrieved paper

First PTQ framework for VGGTs (QuantVGGT)

8 retrieved papers

The first systematic post-training quantization framework specifically designed for Visual Geometry Grounded Transformers. It addresses unique challenges in quantizing billion-scale 3D reconstruction models through specialized techniques for handling data-independent tokens and multi-view data complexity.

8 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Dual-Smoothed Fine-Grained Quantization (DSFQ)

[62] SmoothRot: Combining Channel-Wise Scaling and Rotation for Quantization-Friendly LLMs PDF

Can Refute

[61] RotateKV: Accurate and Robust 2-Bit KV Cache Quantization for LLMs via Outlier-Aware Adaptive Rotations PDF

Cannot Refute

[63] ConvRot: Rotation-Based Plug-and-Play 4-bit Quantization for Diffusion Transformers PDF

Cannot Refute

[64] CLQ: Cross-Layer Guided Orthogonal-based Quantization for Diffusion Transformers PDF

Cannot Refute

Contribution

Noise-Filtered Diverse Sampling (NFDS)

[65] Outliers and Calibration Sets have Diminishing Effect on Quantization of Modern LLMs PDF

Cannot Refute

Contribution

First PTQ framework for VGGTs (QuantVGGT)

[51] D3T: Dual-Domain Diffusion Transformer in Triplanar Latent Space for 3D Incomplete-View CT Reconstruction PDF

Cannot Refute

[52] PTQAT: A Hybrid Parameter-Efficient Quantization Algorithm for 3D Perception Tasks PDF

Cannot Refute

[54] 3dgstreaming: Spatial heterogeneity aware 3D gaussian splatting compression and streaming PDF

Cannot Refute

[55] Mesongs: Post-training compression of 3d gaussians via efficient attribute transformation PDF

Cannot Refute

[56] Empirical Research On Quantization For 3D Multi-Modal Vit Models PDF

Cannot Refute

[57] MGRQ: Post-Training Quantization For Vision Transformer With Mixed Granularity Reconstruction PDF

Cannot Refute

[58] APHQ-ViT: Post-Training Quantization with Average Perturbation Hessian Based Reconstruction for Vision Transformers PDF

Cannot Refute

[59] Pack-PTQ: Advancing Post-training Quantization of Neural Networks by Pack-wise Reconstruction PDF

Cannot Refute

Quantized Visual Geometry Grounded Transformer

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Dual-Smoothed Fine-Grained Quantization (DSFQ)

[62] SmoothRot: Combining Channel-Wise Scaling and Rotation for Quantization-Friendly LLMs PDF

[61] RotateKV: Accurate and Robust 2-Bit KV Cache Quantization for LLMs via Outlier-Aware Adaptive Rotations PDF

[63] ConvRot: Rotation-Based Plug-and-Play 4-bit Quantization for Diffusion Transformers PDF

[64] CLQ: Cross-Layer Guided Orthogonal-based Quantization for Diffusion Transformers PDF

Noise-Filtered Diverse Sampling (NFDS)

[65] Outliers and Calibration Sets have Diminishing Effect on Quantization of Modern LLMs PDF

First PTQ framework for VGGTs (QuantVGGT)

[51] D3T: Dual-Domain Diffusion Transformer in Triplanar Latent Space for 3D Incomplete-View CT Reconstruction PDF

[52] PTQAT: A Hybrid Parameter-Efficient Quantization Algorithm for 3D Perception Tasks PDF

[54] 3dgstreaming: Spatial heterogeneity aware 3D gaussian splatting compression and streaming PDF

[55] Mesongs: Post-training compression of 3d gaussians via efficient attribute transformation PDF

[56] Empirical Research On Quantization For 3D Multi-Modal Vit Models PDF

[57] MGRQ: Post-Training Quantization For Vision Transformer With Mixed Granularity Reconstruction PDF

[58] APHQ-ViT: Post-Training Quantization with Average Perturbation Hessian Based Reconstruction for Vision Transformers PDF

[59] Pack-PTQ: Advancing Post-training Quantization of Neural Networks by Pack-wise Reconstruction PDF

Table of Contents