Bridging the Gap Between Promise and Performance for FP4 Quantization

ICLR 2026 Conference SubmissionAnonymous Authors
efficiencyquantizationlarge language models
Abstract:

The recent hardware-accelerated microscaling 4-bit floating-point formats such as MXFP4 and NVFP4, supported on NVIDIA and AMD GPUs, promise to revolutionize large language model (LLM) inference. Yet, their practical benefits remain unproven. We present the first comprehensive study of MXFP4 and NVFP4 for post-training quantization, revealing gaps between their promise and real-world performance. Our analysis shows that state-of-the-art methods struggle with FP4, due to two key issues: (1) NVFP4's small group size \emph{provably} neutralizes traditional outlier mitigation techniques; (2) MXFP4's power-of-two scale quantization severely degrades accuracy due to high induced error. To bridge this gap, we introduce Micro-Rotated-GPTQ (MR-GPTQ), a variant of the classic GPTQ quantization algorithm that tailors the quantization process to FP4's unique properties, by using block-wise Hadamard transforms and format-specific optimizations. We support our proposal with a set of high-performance GPU kernels that enable the MR-GPTQ format with negligible overhead, by rotation fusion into the weights, and fast online computation of the activations. This leads to speedups vs. FP16 of up to 3.6x layer-wise, and 2.2x end-to-end on NVIDIA B200, and of 6x layer-wise and 4x end-to-end on RTX5090. Our extensive empirical evaluation demonstrates that MR-GPTQ matches or outperforms state-of-the-art accuracy, significantly boosting MXFP4, to the point where it nears that of NVFP4. We conclude that, while FP4 is not an automatic upgrade over INT4, format-specialized methods like MR-GPTQ can unlock a new frontier of accuracy-performance trade-offs.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers
49
3
Claimed Contributions
14
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: 4-bit floating-point quantization for large language model inference. The field has organized itself around several complementary directions. Post-Training Quantization Methods form the largest branch, encompassing techniques that compress pre-trained models without retraining—ranging from weight-only schemes to joint weight-activation quantization, with a notable sub-cluster dedicated to floating-point format designs. Training-Integrated Quantization explores how quantization-aware training and parameter-efficient fine-tuning (e.g., QLoRA[10]) can preserve or even enhance model quality at ultra-low precision. Hardware-Software Co-Design addresses the end-to-end system challenge, optimizing kernels and memory hierarchies to realize the latency and throughput gains promised by reduced bit-widths. Specialized Quantization Applications target domains such as long-context inference and trustworthiness, while Evaluation and Analysis works provide benchmarking frameworks and comparative studies that help the community understand trade-offs across methods. Within the post-training landscape, a particularly active line of work focuses on floating-point format quantization for weights and activations. Papers such as LLM-FP4[25], ZeroQuant-FP[36], and Integer or Floating[22] investigate whether floating-point representations better capture the dynamic range of LLM parameters than integer formats, often concluding that careful exponent and mantissa allocation can reduce accuracy loss. FP4 Promise Performance[0] sits squarely in this cluster, emphasizing the practical viability of 4-bit floating-point schemes and their interplay with block-level scaling strategies. Nearby works like Lossless FP4[6] and Microscaling FP4 Gap[48] explore lossless compression paths and the remaining performance gaps under microscaling formats, highlighting that while FP4 shows promise, bridging the gap to full-precision accuracy remains an open question. Across these studies, the central tension is balancing representational flexibility against hardware efficiency, with ongoing debate over optimal exponent-mantissa splits and the role of fine-grained versus coarse-grained quantization granularities.

Claimed Contributions

Comprehensive study of MXFP4 and NVFP4 quantization formats

The authors conduct the first thorough empirical and analytical investigation of the recently introduced MXFP4 and NVFP4 microscaling 4-bit floating-point formats for LLM quantization, identifying key limitations of existing methods when applied to these formats and documenting the gap between theoretical promise and practical accuracy.

10 retrieved papers
Micro-Rotated-GPTQ (MR-GPTQ) quantization algorithm

The authors propose MR-GPTQ, a novel variant of the GPTQ algorithm specifically designed for FP4 formats. It incorporates block-wise Hadamard transforms, format-specific scale optimizations, and static activation reordering to maximize accuracy for both MXFP4 and NVFP4.

0 retrieved papers
QuTLASS GPU kernel library for efficient FP4 inference

The authors develop QuTLASS, a high-performance GPU kernel library optimized for NVIDIA Blackwell architecture that implements MR-GPTQ with minimal runtime overhead. The kernels fuse rotations into weights and enable fast online activation computation, achieving near-ideal speedups on B200 and RTX5090 GPUs.

4 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Comprehensive study of MXFP4 and NVFP4 quantization formats

The authors conduct the first thorough empirical and analytical investigation of the recently introduced MXFP4 and NVFP4 microscaling 4-bit floating-point formats for LLM quantization, identifying key limitations of existing methods when applied to these formats and documenting the gap between theoretical promise and practical accuracy.

Contribution

Micro-Rotated-GPTQ (MR-GPTQ) quantization algorithm

The authors propose MR-GPTQ, a novel variant of the GPTQ algorithm specifically designed for FP4 formats. It incorporates block-wise Hadamard transforms, format-specific scale optimizations, and static activation reordering to maximize accuracy for both MXFP4 and NVFP4.

Contribution

QuTLASS GPU kernel library for efficient FP4 inference

The authors develop QuTLASS, a high-performance GPU kernel library optimized for NVIDIA Blackwell architecture that implements MR-GPTQ with minimal runtime overhead. The kernels fuse rotations into weights and enable fast online activation computation, achieving near-ideal speedups on B200 and RTX5090 GPUs.