Scalable Training for Vector-Quantized Networks with 100% Codebook Utilization

ICLR 2026 Conference SubmissionAnonymous Authors
Generative ModelImage QuantizationAutoregressive ModelingImage GenerationImage Synthesis
Abstract:

Vector quantization (VQ) is a key component in discrete tokenizers for image generation, but its training is often unstable due to straight-through estimation bias, one-step-behind updates, and sparse codebook gradients, which lead to suboptimal reconstruction performance and low codebook usage. In this work, we analyze these fundamental challenges and provide a simple yet effective solution. To maintain high codebook usage in VQ networks (VQN) during learning annealing and codebook size expansion, we propose VQBridge, a robust, scalable, and efficient projector based on the map function method. VQBridge optimizes code vectors through a compress–process–recover pipeline, enabling stable and effective codebook training. By combining VQBridge with learning annealing, our VQN achieves full (100%) codebook usage across diverse codebook configurations, which we refer to as FVQ (FullVQ). Through extensive experiments, we demonstrate that FVQ is effective, scalable, and generalizable: it attains 100% codebook usage even with a 262k-codebook, achieves state-of-the-art reconstruction performance, consistently improves with larger codebooks, higher vector channels, or longer training, and remains effective across different VQ variants. Moreover, when integrated with LlamaGen, FVQ significantly enhances image generation performance, surpassing visual autoregressive models (VAR) by 0.5 and diffusion models (DiT) by 0.2 rFID, highlighting the importance of high-quality tokenizers for strong autoregressive image generation.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes VQBridge, a projector-based method to stabilize vector quantization training, and FVQ, a framework achieving full codebook utilization even at 262k codes. It sits within the Codebook Learning and Utilization Enhancement leaf, which contains only three papers total. This is a relatively sparse research direction within the broader taxonomy, suggesting the specific problem of maximizing codebook usage during training has received focused but limited attention. The sibling papers in this leaf address related training stability and utilization challenges, indicating the work targets a recognized but not overcrowded niche.

The taxonomy reveals neighboring leaves addressing Quantization Scheme Design (four papers on regularization and stochastic methods) and Semantic and Multi-Modal Codebook Alignment (four papers on semantic supervision). The parent category, Vector Quantization Training Optimization, encompasses twelve papers across these three leaves. The paper's focus on training dynamics and codebook collapse connects it to optimization-centric work like Stochastic VQ Optimization, while remaining distinct from semantic alignment approaches. The taxonomy structure shows this work occupies a training-focused branch separate from architectural innovations (Encoder-Decoder Architecture Design) and application-specific adaptations (Video Tokenization, Communication Systems).

Among thirty candidates examined, the analysis found four refutable pairs across three contributions. The VQBridge projector contribution examined ten candidates with zero refutations, suggesting novelty in the specific compress-process-recover pipeline design. The FVQ framework contribution examined ten candidates with one refutation, indicating some prior work on full codebook utilization exists within the limited search scope. The analysis of fundamental VQ training challenges examined ten candidates with three refutations, reflecting that training instability, gradient estimation bias, and codebook collapse have been previously studied, though the specific combination and solutions may differ.

Based on the top-thirty semantic matches examined, the work appears to introduce novel technical mechanisms (VQBridge) while addressing well-recognized training challenges. The limited search scope means the analysis captures nearby prior work but cannot claim exhaustive coverage of all VQ training literature. The sparse taxonomy leaf and low refutation counts for the core technical contribution suggest meaningful novelty, though the broader problem space of VQ training stability has established foundations.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
4
Refutable Paper

Research Landscape Overview

Core task: Scalable training for vector quantization in discrete image tokenizers. The field of discrete image tokenization via vector quantization has grown into a rich landscape with several major branches. Vector Quantization Training Optimization focuses on improving codebook learning, utilization, and convergence strategies, often addressing issues like codebook collapse and training instability. Tokenizer Architecture and Representation explores different encoder-decoder designs and multi-scale or hierarchical token structures. Objective Function and Reconstruction Quality investigates loss formulations and perceptual metrics that balance fidelity with compactness. Application-Specific Tokenization tailors VQ methods to domains such as video, medical imaging, or remote sensing, while Generation with Discrete Tokens examines how tokenized representations feed into autoregressive or diffusion models. Classical and Optimization-Based VQ Methods revisit foundational algorithms and heuristic search techniques, and Multi-Vector and Product Quantization decomposes representations into multiple codebooks for greater expressiveness. Domain-Specific Non-Image Applications extends VQ ideas to speech, gesture, and other modalities. Representative works like Improved VQGAN[2] and BEiT v2[4] illustrate how tokenizers serve both reconstruction and self-supervised learning, while Unitok[5] and Vqtoken[7] push toward unified or more efficient token designs. A particularly active line of work centers on codebook learning and utilization enhancement, where methods like Stochastic VQ Optimization[3] and EdVAE[31] tackle training stability and codebook collapse through novel optimization strategies or entropy regularization. Scalable VQ Training[0] sits squarely in this cluster, emphasizing scalable techniques to improve codebook usage and convergence at larger scales. Compared to Taming Visual Tokenizer[18], which also addresses codebook utilization and training dynamics, Scalable VQ Training[0] places greater emphasis on computational efficiency and scalability for high-resolution or large-batch regimes. Meanwhile, EdVAE[31] introduces entropy-driven regularization to encourage diverse code usage, a complementary angle to the optimization-centric focus of Scalable VQ Training[0]. Across these branches, key trade-offs involve balancing reconstruction quality against codebook size, training stability versus expressiveness, and domain-general versus application-specific designs. Open questions remain around optimal codebook initialization, dynamic codebook growth, and the interplay between tokenizer design and downstream generative model performance.

Claimed Contributions

VQBridge projector for stable vector quantization training

The authors introduce VQBridge, a novel projector module that uses a compress-process-recover pipeline with ViT blocks to enable stable and effective codebook training in vector-quantized networks. This projector addresses fundamental challenges in VQ training including straight-through estimation bias, one-step-behind updates, and sparse codebook gradients.

10 retrieved papers
FVQ framework achieving 100% codebook utilization

The authors develop FVQ, a training framework combining VQBridge with learning annealing that consistently achieves complete codebook usage even with very large codebooks (up to 262k entries), addressing the long-standing codebook collapse problem in vector quantization.

10 retrieved papers
Can Refute
Analysis of fundamental VQ training challenges and solutions

The authors provide a systematic analysis of three core challenges in vector quantization training (straight-through estimation bias, one-step-behind updates, and sparse codebook gradients) and derive key observations about learning annealing and projector design that inform their solution.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

VQBridge projector for stable vector quantization training

The authors introduce VQBridge, a novel projector module that uses a compress-process-recover pipeline with ViT blocks to enable stable and effective codebook training in vector-quantized networks. This projector addresses fundamental challenges in VQ training including straight-through estimation bias, one-step-behind updates, and sparse codebook gradients.

Contribution

FVQ framework achieving 100% codebook utilization

The authors develop FVQ, a training framework combining VQBridge with learning annealing that consistently achieves complete codebook usage even with very large codebooks (up to 262k entries), addressing the long-standing codebook collapse problem in vector quantization.

Contribution

Analysis of fundamental VQ training challenges and solutions

The authors provide a systematic analysis of three core challenges in vector quantization training (straight-through estimation bias, one-step-behind updates, and sparse codebook gradients) and derive key observations about learning annealing and projector design that inform their solution.

Scalable Training for Vector-Quantized Networks with 100% Codebook Utilization | Novelty Validation