Scalable Training for Vector-Quantized Networks with 100% Codebook Utilization

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Generative ModelImage QuantizationAutoregressive ModelingImage GenerationImage Synthesis

Vector quantization (VQ) is a key component in discrete tokenizers for image generation, but its training is often unstable due to straight-through estimation bias, one-step-behind updates, and sparse codebook gradients, which lead to suboptimal reconstruction performance and low codebook usage. In this work, we analyze these fundamental challenges and provide a simple yet effective solution. To maintain high codebook usage in VQ networks (VQN) during learning annealing and codebook size expansion, we propose VQBridge, a robust, scalable, and efficient projector based on the map function method. VQBridge optimizes code vectors through a compress–process–recover pipeline, enabling stable and effective codebook training. By combining VQBridge with learning annealing, our VQN achieves full (100%) codebook usage across diverse codebook configurations, which we refer to as FVQ (FullVQ). Through extensive experiments, we demonstrate that FVQ is effective, scalable, and generalizable: it attains 100% codebook usage even with a 262k-codebook, achieves state-of-the-art reconstruction performance, consistently improves with larger codebooks, higher vector channels, or longer training, and remains effective across different VQ variants. Moreover, when integrated with LlamaGen, FVQ significantly enhances image generation performance, surpassing visual autoregressive models (VAR) by 0.5 and diffusion models (DiT) by 0.2 rFID, highlighting the importance of high-quality tokenizers for strong autoregressive image generation.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes VQBridge, a projector-based method to stabilize vector quantization training, and FVQ, a framework achieving full codebook utilization even at 262k codes. It sits within the Codebook Learning and Utilization Enhancement leaf, which contains only three papers total. This is a relatively sparse research direction within the broader taxonomy, suggesting the specific problem of maximizing codebook usage during training has received focused but limited attention. The sibling papers in this leaf address related training stability and utilization challenges, indicating the work targets a recognized but not overcrowded niche.

The taxonomy reveals neighboring leaves addressing Quantization Scheme Design (four papers on regularization and stochastic methods) and Semantic and Multi-Modal Codebook Alignment (four papers on semantic supervision). The parent category, Vector Quantization Training Optimization, encompasses twelve papers across these three leaves. The paper's focus on training dynamics and codebook collapse connects it to optimization-centric work like Stochastic VQ Optimization, while remaining distinct from semantic alignment approaches. The taxonomy structure shows this work occupies a training-focused branch separate from architectural innovations (Encoder-Decoder Architecture Design) and application-specific adaptations (Video Tokenization, Communication Systems).

Among thirty candidates examined, the analysis found four refutable pairs across three contributions. The VQBridge projector contribution examined ten candidates with zero refutations, suggesting novelty in the specific compress-process-recover pipeline design. The FVQ framework contribution examined ten candidates with one refutation, indicating some prior work on full codebook utilization exists within the limited search scope. The analysis of fundamental VQ training challenges examined ten candidates with three refutations, reflecting that training instability, gradient estimation bias, and codebook collapse have been previously studied, though the specific combination and solutions may differ.

Based on the top-thirty semantic matches examined, the work appears to introduce novel technical mechanisms (VQBridge) while addressing well-recognized training challenges. The limited search scope means the analysis captures nearby prior work but cannot claim exhaustive coverage of all VQ training literature. The sparse taxonomy leaf and low refutation counts for the core technical contribution suggest meaningful novelty, though the broader problem space of VQ training stability has established foundations.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Scalable training for vector quantization in discrete image tokenizers. The field of discrete image tokenization via vector quantization has grown into a rich landscape with several major branches. Vector Quantization Training Optimization focuses on improving codebook learning, utilization, and convergence strategies, often addressing issues like codebook collapse and training instability. Tokenizer Architecture and Representation explores different encoder-decoder designs and multi-scale or hierarchical token structures. Objective Function and Reconstruction Quality investigates loss formulations and perceptual metrics that balance fidelity with compactness. Application-Specific Tokenization tailors VQ methods to domains such as video, medical imaging, or remote sensing, while Generation with Discrete Tokens examines how tokenized representations feed into autoregressive or diffusion models. Classical and Optimization-Based VQ Methods revisit foundational algorithms and heuristic search techniques, and Multi-Vector and Product Quantization decomposes representations into multiple codebooks for greater expressiveness. Domain-Specific Non-Image Applications extends VQ ideas to speech, gesture, and other modalities. Representative works like Improved VQGAN[2] and BEiT v2[4] illustrate how tokenizers serve both reconstruction and self-supervised learning, while Unitok[5] and Vqtoken[7] push toward unified or more efficient token designs. A particularly active line of work centers on codebook learning and utilization enhancement, where methods like Stochastic VQ Optimization[3] and EdVAE[31] tackle training stability and codebook collapse through novel optimization strategies or entropy regularization. Scalable VQ Training[0] sits squarely in this cluster, emphasizing scalable techniques to improve codebook usage and convergence at larger scales. Compared to Taming Visual Tokenizer[18], which also addresses codebook utilization and training dynamics, Scalable VQ Training[0] places greater emphasis on computational efficiency and scalability for high-resolution or large-batch regimes. Meanwhile, EdVAE[31] introduces entropy-driven regularization to encourage diverse code usage, a complementary angle to the optimization-centric focus of Scalable VQ Training[0]. Across these branches, key trade-offs involve balancing reconstruction quality against codebook size, training stability versus expressiveness, and domain-general versus application-specific designs. Open questions remain around optimal codebook initialization, dynamic codebook growth, and the interplay between tokenizer design and downstream generative model performance.

Claimed Contributions

VQBridge projector for stable vector quantization training

10 retrieved papers

The authors introduce VQBridge, a novel projector module that uses a compress-process-recover pipeline with ViT blocks to enable stable and effective codebook training in vector-quantized networks. This projector addresses fundamental challenges in VQ training including straight-through estimation bias, one-step-behind updates, and sparse codebook gradients.

10 retrieved papers

FVQ framework achieving 100% codebook utilization

Can Refute

10 retrieved papers

The authors develop FVQ, a training framework combining VQBridge with learning annealing that consistently achieves complete codebook usage even with very large codebooks (up to 262k entries), addressing the long-standing codebook collapse problem in vector quantization.

10 retrieved papers

Can Refute

Analysis of fundamental VQ training challenges and solutions

Can Refute

10 retrieved papers

The authors provide a systematic analysis of three core challenges in vector quantization training (straight-through estimation bias, one-step-behind updates, and sparse codebook gradients) and derive key observations about learning annealing and projector design that inform their solution.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[18] Taming Scalable Visual Tokenizer for Autoregressive Image Generation PDF

Shi Fengyuan, Luo, Zhuoyan, Fengyuan Shi, Ge, Yixiao, Zhuoyan Luo, Yang, Yujiu, Yixiao Ge, Shan, Ying, Yujiu Yang, Wang Limin, Ying Shan, Limin Wang (2024)

[31] EdVAE: Mitigating Codebook Collapse with Evidential Discrete Variational Autoencoders PDF

Gulcin Baykal, Melih Kandemir, Gozde Unal, GÃ¶zde Ãnal (2023)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

VQBridge projector for stable vector quantization training

[51] Self-supervised learning with random-projection quantizer for speech recognition PDF

Cannot Refute

[52] Memory-Efficient Generative Models via Product Quantization PDF

Cannot Refute

[53] Vector quantization pretraining for eeg time series with random projection and phase alignment PDF

Cannot Refute

[54] A fast LBG codebook training algorithm for vector quantization PDF

Cannot Refute

[55] Dictionary learning PDF

Cannot Refute

[56] Product quantization network for fast image retrieval PDF

Cannot Refute

[57] SACodec: Asymmetric Quantization with Semantic Anchoring for Low-Bitrate High-Fidelity Neural Speech Codecs PDF

Cannot Refute

[58] Q2D2: A Geometry-Aware Audio Codec Leveraging Two-Dimensional Quantization PDF

Cannot Refute

[59] Improvements on the visualization of clusters in geo-referenced data using Self-Organizing Maps PDF

Cannot Refute

[60] Topics in hyperspectral image analysis PDF

Cannot Refute

Contribution

FVQ framework achieving 100% codebook utilization

[62] Scalable image tokenization with index backpropagation quantization PDF

Can Refute

[2] Vector-quantized Image Modeling with Improved VQGAN PDF

Cannot Refute

[12] Regularized Vector Quantization for Tokenized Image Synthesis PDF

Cannot Refute

[61] Rate-Adaptive Quantization: A Multi-Rate Codebook Adaptation for Vector Quantization-based Generative Models PDF

Cannot Refute

[63] ESC-MVQ: End-to-End Semantic Communication With Multi-Codebook Vector Quantization PDF

Cannot Refute

[64] Hiding Information in a Well-Trained Vector Quantization Codebook PDF

Cannot Refute

[65] Residual Quantization with Implicit Neural Codebooks PDF

Cannot Refute

[66] ERVQ: Enhanced residual vector quantization with intra-and-inter-codebook optimization for neural audio codecs PDF

Cannot Refute

[67] Scaling the codebook size of VQ-GAN to 100,000 with a utilization rate of 99% PDF

Cannot Refute

[68] A Streamable Neural Audio Codec With Residual Scalar-Vector Quantization for Real-Time Communication PDF

Cannot Refute

Contribution

Analysis of fundamental VQ training challenges and solutions

[69] Straightening Out the Straight-Through Estimator: Overcoming Optimization Challenges in Vector Quantized Networks PDF

Can Refute

[72] Vector quantization in deep neural networks for speech and image processing PDF

Can Refute

[74] Enhancing Vector Quantization with Distributional Matching: A Theoretical and Empirical Study PDF

Can Refute

[70] Pv-tuning: Beyond straight-through estimation for extreme llm compression PDF

Cannot Refute

[71] Variable Bitrate Residual Vector Quantization for Audio Coding PDF

Cannot Refute

[73] Differentiable Optimized Product Quantization and Beyond PDF

Cannot Refute

[75] Robust training of neural networks at arbitrary precision and sparsity PDF

Cannot Refute

[76] Accurate Deep Representation Quantization with Gradient Snapping Layer for Similarity Search PDF

Cannot Refute

[77] Learning quantized neural nets by coarse gradient method for nonlinear classification PDF

Cannot Refute

[78] Survey of Quantization-Aware Training (QAT) Applications in Deep Learning Quantization PDF

Cannot Refute

Scalable Training for Vector-Quantized Networks with 100% Codebook Utilization

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[18] Taming Scalable Visual Tokenizer for Autoregressive Image Generation PDF

[31] EdVAE: Mitigating Codebook Collapse with Evidential Discrete Variational Autoencoders PDF

Contribution Analysis

VQBridge projector for stable vector quantization training

[51] Self-supervised learning with random-projection quantizer for speech recognition PDF

[52] Memory-Efficient Generative Models via Product Quantization PDF

[53] Vector quantization pretraining for eeg time series with random projection and phase alignment PDF

[54] A fast LBG codebook training algorithm for vector quantization PDF

[55] Dictionary learning PDF

[56] Product quantization network for fast image retrieval PDF

[57] SACodec: Asymmetric Quantization with Semantic Anchoring for Low-Bitrate High-Fidelity Neural Speech Codecs PDF

[58] Q2D2: A Geometry-Aware Audio Codec Leveraging Two-Dimensional Quantization PDF

[59] Improvements on the visualization of clusters in geo-referenced data using Self-Organizing Maps PDF

[60] Topics in hyperspectral image analysis PDF

FVQ framework achieving 100% codebook utilization

[62] Scalable image tokenization with index backpropagation quantization PDF

[2] Vector-quantized Image Modeling with Improved VQGAN PDF

[12] Regularized Vector Quantization for Tokenized Image Synthesis PDF

[61] Rate-Adaptive Quantization: A Multi-Rate Codebook Adaptation for Vector Quantization-based Generative Models PDF

[63] ESC-MVQ: End-to-End Semantic Communication With Multi-Codebook Vector Quantization PDF

[64] Hiding Information in a Well-Trained Vector Quantization Codebook PDF

[65] Residual Quantization with Implicit Neural Codebooks PDF

[66] ERVQ: Enhanced residual vector quantization with intra-and-inter-codebook optimization for neural audio codecs PDF

[67] Scaling the codebook size of VQ-GAN to 100,000 with a utilization rate of 99% PDF

[68] A Streamable Neural Audio Codec With Residual Scalar-Vector Quantization for Real-Time Communication PDF

Analysis of fundamental VQ training challenges and solutions

[69] Straightening Out the Straight-Through Estimator: Overcoming Optimization Challenges in Vector Quantized Networks PDF

[72] Vector quantization in deep neural networks for speech and image processing PDF

[74] Enhancing Vector Quantization with Distributional Matching: A Theoretical and Empirical Study PDF

[70] Pv-tuning: Beyond straight-through estimation for extreme llm compression PDF

[71] Variable Bitrate Residual Vector Quantization for Audio Coding PDF

[73] Differentiable Optimized Product Quantization and Beyond PDF

[75] Robust training of neural networks at arbitrary precision and sparsity PDF

[76] Accurate Deep Representation Quantization with Gradient Snapping Layer for Similarity Search PDF

[77] Learning quantized neural nets by coarse gradient method for nonlinear classification PDF

[78] Survey of Quantization-Aware Training (QAT) Applications in Deep Learning Quantization PDF

Table of Contents