Compute-Optimal Quantization-Aware Training

ICLR 2026 Conference SubmissionAnonymous Authors
quantization-aware trainingQATneural network quantizationcompute optimizationscaling lawslarge language modelsLLMsmodel compressioncompute budget allocationtraining efficiencymodel optimizationquantized neural networksefficient deep learning
Abstract:

Quantization-aware training (QAT) is a leading technique for improving the accuracy of quantized neural networks. Previous work has shown that decomposing training into a full-precision (FP) phase followed by a QAT phase yields superior accuracy compared to QAT alone. However, the optimal allocation of compute between the FP and QAT phases remains unclear. We conduct extensive experiments with various compute budgets, QAT bit widths, and model sizes from 86.0M to 2.2B to investigate how different QAT durations impact final performance. We demonstrate that, contrary to previous findings, the loss-optimal ratio of QAT to FP training increases with the total amount of compute. Moreover, the optimal fraction can be accurately predicted for a wide range of model sizes and quantization widths using the tokens-per-parameter-byte statistic. From experimental data, we derive a loss scaling law that predicts both optimal QAT ratios and final model performance across different QAT/FP compute allocation strategies and QAT bit widths. We use the scaling law to make further predictions, which we verify experimentally, including which QAT bit width is optimal under a given memory constraint and how QAT accuracy with different bit widths compares to full-precision model accuracy. Additionally, we propose a novel cooldown and QAT fusion approach that performs learning rate decay jointly with quantization-aware training, eliminating redundant full-precision model updates and achieving significant compute savings. These findings provide practical insights into efficient QAT planning and enable the training of higher-quality quantized models with the same compute budget.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper investigates optimal compute allocation between full-precision pretraining and quantization-aware training phases, deriving scaling laws to predict loss-optimal QAT fractions across model sizes and bit-widths. It resides in the 'Multi-Phase Training Strategies with Explicit Compute Partitioning' leaf, which contains only two papers total. This represents a sparse research direction within the broader taxonomy, suggesting that principled compute allocation frameworks for QAT remain relatively underexplored despite the practical importance of balancing pretraining and quantization budgets.

The taxonomy reveals neighboring branches addressing mixed-precision bit-width search, uniform-precision gradient optimizations, and domain-specific QAT for edge devices or generative models. The original paper diverges from these by focusing on temporal compute partitioning rather than spatial precision assignment or application-specific tuning. Its sibling work examines domain-adapted continued pretraining before quantization, whereas this paper proposes general scaling laws applicable across architectures. The taxonomy's scope notes clarify that runtime quantization switching and single-phase QAT methods fall outside this leaf, emphasizing the focus on explicit multi-phase decomposition with compute budget considerations.

Among eighteen candidates examined, the 'Comprehensive loss scaling law for QAT' contribution shows one refutable candidate from ten examined, indicating some prior work on scaling behavior in quantization contexts. The 'Compute-dependent optimal QAT fraction discovery' examined only one candidate with no refutations, while the 'QAT and learning rate cooldown fusion technique' examined seven candidates without clear prior overlap. Given the limited search scope—eighteen papers from semantic retrieval, not an exhaustive survey—these statistics suggest the core allocation framework and fusion technique appear relatively novel within the examined subset, though the scaling law concept has partial precedent.

Based on top-eighteen semantic matches and the sparse taxonomy leaf, the work appears to occupy a relatively unexplored niche in compute allocation for multi-phase QAT. The analysis does not cover exhaustive prior art in adjacent fields like neural architecture search or hyperparameter optimization, where compute allocation principles may exist. The contribution-level statistics reflect limited overlap within the examined candidates, though broader literature may contain related insights not captured by this search.

Taxonomy

Core-task Taxonomy Papers
26
3
Claimed Contributions
18
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: optimal compute allocation between full-precision and quantization-aware training. The field addresses how to distribute computational resources when preparing neural networks for deployment under strict precision constraints. The taxonomy reveals several main branches: one focuses explicitly on compute budget allocation and training phase optimization, exploring when and how long to train at different precisions; another examines mixed-precision quantization methods that assign varying bit-widths across layers or operations; a third covers uniform-precision QAT techniques that refine training procedures under a single target bit-width; additional branches address domain-specific applications, post-training quantization (PTQ) comparisons, multi-model compression scenarios, and novel quantization formats. Works such as MPBRQ[2] and Mixed-Precision Federated[7] illustrate how mixed-precision strategies can be tailored to federated or resource-constrained settings, while methods like MobileQuant[9] and Edgeqat[3] demonstrate domain-targeted optimizations for mobile and edge devices. A particularly active line of inquiry concerns multi-phase training strategies with explicit compute partitioning, where researchers investigate the trade-off between initial full-precision pretraining and subsequent quantization-aware fine-tuning. Compute-Optimal QAT[0] sits squarely within this branch, proposing a principled framework for deciding how much compute to allocate to each phase in order to maximize final model quality under a fixed budget. This contrasts with neighboring efforts like Domain-Adapted LLMs[21], which emphasize domain-specific continued pretraining before quantization, and with works such as Block Replacement QAT[5] or Autoencoder Partitioning QAT[1], which partition models spatially rather than temporally. The central open question across these directions is whether compute should be front-loaded in high-precision pretraining or reserved for longer quantization-aware refinement, and Compute-Optimal QAT[0] contributes empirical scaling laws and allocation heuristics to guide this decision in practice.

Claimed Contributions

Compute-dependent optimal QAT fraction discovery

The authors demonstrate that the optimal fraction of training allocated to quantization-aware training (QAT) is not fixed but increases with total compute budget, specifically with the tokens-per-parameter-byte statistic. This challenges prior assumptions that a fixed percentage (e.g., 10%) is universally optimal and shows that suboptimal allocation can waste substantial compute.

1 retrieved paper
Comprehensive loss scaling law for QAT

The authors propose a loss scaling law that models final expected loss as a function of model parameter count, token counts for full-precision and QAT phases, and QAT bit width. This law captures the optimal QAT fraction phenomenon and enables predictions about which QAT bit width is optimal under memory constraints and how QAT accuracy compares to full-precision models.

10 retrieved papers
Can Refute
QAT and learning rate cooldown fusion technique

The authors introduce a training scheme where learning rate decay is performed jointly with quantization-aware training rather than separately. This eliminates redundant full-precision updates during cooldown and achieves better accuracy for the same token count, suggesting that modifications to standard QAT pipelines can improve efficiency.

7 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Compute-dependent optimal QAT fraction discovery

The authors demonstrate that the optimal fraction of training allocated to quantization-aware training (QAT) is not fixed but increases with total compute budget, specifically with the tokens-per-parameter-byte statistic. This challenges prior assumptions that a fixed percentage (e.g., 10%) is universally optimal and shows that suboptimal allocation can waste substantial compute.

Contribution

Comprehensive loss scaling law for QAT

The authors propose a loss scaling law that models final expected loss as a function of model parameter count, token counts for full-precision and QAT phases, and QAT bit width. This law captures the optimal QAT fraction phenomenon and enables predictions about which QAT bit width is optimal under memory constraints and how QAT accuracy compares to full-precision models.

Contribution

QAT and learning rate cooldown fusion technique

The authors introduce a training scheme where learning rate decay is performed jointly with quantization-aware training rather than separately. This eliminates redundant full-precision updates during cooldown and achieves better accuracy for the same token count, suggesting that modifications to standard QAT pipelines can improve efficiency.