Bridging the Gap Between Promise and Performance for FP4 Quantization

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

efficiencyquantizationlarge language models

The recent hardware-accelerated microscaling 4-bit floating-point formats such as MXFP4 and NVFP4, supported on NVIDIA and AMD GPUs, promise to revolutionize large language model (LLM) inference. Yet, their practical benefits remain unproven. We present the first comprehensive study of MXFP4 and NVFP4 for post-training quantization, revealing gaps between their promise and real-world performance. Our analysis shows that state-of-the-art methods struggle with FP4, due to two key issues: (1) NVFP4's small group size \emph{provably} neutralizes traditional outlier mitigation techniques; (2) MXFP4's power-of-two scale quantization severely degrades accuracy due to high induced error. To bridge this gap, we introduce Micro-Rotated-GPTQ (MR-GPTQ), a variant of the classic GPTQ quantization algorithm that tailors the quantization process to FP4's unique properties, by using block-wise Hadamard transforms and format-specific optimizations. We support our proposal with a set of high-performance GPU kernels that enable the MR-GPTQ format with negligible overhead, by rotation fusion into the weights, and fast online computation of the activations. This leads to speedups vs. FP16 of up to 3.6x layer-wise, and 2.2x end-to-end on NVIDIA B200, and of 6x layer-wise and 4x end-to-end on RTX5090. Our extensive empirical evaluation demonstrates that MR-GPTQ matches or outperforms state-of-the-art accuracy, significantly boosting MXFP4, to the point where it nears that of NVFP4. We conclude that, while FP4 is not an automatic upgrade over INT4, format-specialized methods like MR-GPTQ can unlock a new frontier of accuracy-performance trade-offs.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: 4-bit floating-point quantization for large language model inference. The field has organized itself around several complementary directions. Post-Training Quantization Methods form the largest branch, encompassing techniques that compress pre-trained models without retraining—ranging from weight-only schemes to joint weight-activation quantization, with a notable sub-cluster dedicated to floating-point format designs. Training-Integrated Quantization explores how quantization-aware training and parameter-efficient fine-tuning (e.g., QLoRA[10]) can preserve or even enhance model quality at ultra-low precision. Hardware-Software Co-Design addresses the end-to-end system challenge, optimizing kernels and memory hierarchies to realize the latency and throughput gains promised by reduced bit-widths. Specialized Quantization Applications target domains such as long-context inference and trustworthiness, while Evaluation and Analysis works provide benchmarking frameworks and comparative studies that help the community understand trade-offs across methods. Within the post-training landscape, a particularly active line of work focuses on floating-point format quantization for weights and activations. Papers such as LLM-FP4[25], ZeroQuant-FP[36], and Integer or Floating[22] investigate whether floating-point representations better capture the dynamic range of LLM parameters than integer formats, often concluding that careful exponent and mantissa allocation can reduce accuracy loss. FP4 Promise Performance[0] sits squarely in this cluster, emphasizing the practical viability of 4-bit floating-point schemes and their interplay with block-level scaling strategies. Nearby works like Lossless FP4[6] and Microscaling FP4 Gap[48] explore lossless compression paths and the remaining performance gaps under microscaling formats, highlighting that while FP4 shows promise, bridging the gap to full-precision accuracy remains an open question. Across these studies, the central tension is balancing representational flexibility against hardware efficiency, with ongoing debate over optimal exponent-mantissa splits and the role of fine-grained versus coarse-grained quantization granularities.

Claimed Contributions

Comprehensive study of MXFP4 and NVFP4 quantization formats

10 retrieved papers

The authors conduct the first thorough empirical and analytical investigation of the recently introduced MXFP4 and NVFP4 microscaling 4-bit floating-point formats for LLM quantization, identifying key limitations of existing methods when applied to these formats and documenting the gap between theoretical promise and practical accuracy.

10 retrieved papers

Micro-Rotated-GPTQ (MR-GPTQ) quantization algorithm

0 retrieved papers

The authors propose MR-GPTQ, a novel variant of the GPTQ algorithm specifically designed for FP4 formats. It incorporates block-wise Hadamard transforms, format-specific scale optimizations, and static activation reordering to maximize accuracy for both MXFP4 and NVFP4.

0 retrieved papers

QuTLASS GPU kernel library for efficient FP4 inference

4 retrieved papers

The authors develop QuTLASS, a high-performance GPU kernel library optimized for NVIDIA Blackwell architecture that implements MR-GPTQ with minimal runtime overhead. The kernels fuse rotations into weights and enable fast online activation computation, achieving near-ideal speedups on B200 and RTX5090 GPUs.

4 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[6] FP4-Quantization: Lossless 4bit Quantization for Large Language Models PDF

Jie Wang, Huanxi Liu, Dawei Feng, Jie Ding, Bo Ding (2024) • Fall Joint Computer Conference

[22] Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models PDF

Zhang Yi-jia (2023)

[25] LLM-FP4: 4-Bit Floating-Point Quantized Transformers PDF

Cheng, Kwang-Ting, Huang Xijie, Liu, Shih-Yang, Zechun (2023) • Conference on Empirical Methods in Natural Language Processing

[36] Zeroquant-fp: A leap forward in llms post-training w4a8 quantization using floating-point formats PDF

Wu Xiaoxia (2023)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Comprehensive study of MXFP4 and NVFP4 quantization formats

[1] OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models PDF

Cannot Refute

[22] Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models PDF

Cannot Refute

[50] LLM-QAT: Data-Free Quantization Aware Training for Large Language Models PDF

Cannot Refute

[51] QuIP: 2-Bit Quantization of Large Language Models With Guarantees PDF

Cannot Refute

[52] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models PDF

Cannot Refute

[53] Mixed-Precision Graph Neural Quantization for Low Bit Large Language Models PDF

Cannot Refute

[54] Evaluating Quantized Large Language Models PDF

Cannot Refute

[55] RPTQ: Reorder-based Post-training Quantization for Large Language Models PDF

Cannot Refute

[56] QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models PDF

Cannot Refute

[57] Outliertune: Efficient channel-wise quantization for large language models PDF

Cannot Refute

Contribution

Micro-Rotated-GPTQ (MR-GPTQ) quantization algorithm

Contribution

QuTLASS GPU kernel library for efficient FP4 inference

[12] QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs PDF

Cannot Refute

[58] Rotate, Clip, and Partition: Towards W2A4KV4 Quantization by Integrating Rotation and Learnable Non-uniform Quantizer PDF

Cannot Refute

[59] Svdquant: Absorbing outliers by low-rank components for 4-bit diffusion models PDF

Cannot Refute

[60] Breaking the Efficiency-Accuracy: Fusion of Rotation Quantization and N: M Sparsity for LLMs Inference PDF

Cannot Refute

Bridging the Gap Between Promise and Performance for FP4 Quantization

Overview

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[6] FP4-Quantization: Lossless 4bit Quantization for Large Language Models PDF

[22] Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models PDF

[25] LLM-FP4: 4-Bit Floating-Point Quantized Transformers PDF

[36] Zeroquant-fp: A leap forward in llms post-training w4a8 quantization using floating-point formats PDF

Contribution Analysis

Comprehensive study of MXFP4 and NVFP4 quantization formats

[1] OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models PDF

[22] Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models PDF

[50] LLM-QAT: Data-Free Quantization Aware Training for Large Language Models PDF

[51] QuIP: 2-Bit Quantization of Large Language Models With Guarantees PDF

[52] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models PDF

[53] Mixed-Precision Graph Neural Quantization for Low Bit Large Language Models PDF

[54] Evaluating Quantized Large Language Models PDF

[55] RPTQ: Reorder-based Post-training Quantization for Large Language Models PDF

[56] QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models PDF

[57] Outliertune: Efficient channel-wise quantization for large language models PDF

Micro-Rotated-GPTQ (MR-GPTQ) quantization algorithm

QuTLASS GPU kernel library for efficient FP4 inference

[12] QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs PDF

[58] Rotate, Clip, and Partition: Towards W2A4KV4 Quantization by Integrating Rotation and Learnable Non-uniform Quantizer PDF

[59] Svdquant: Absorbing outliers by low-rank components for 4-bit diffusion models PDF

[60] Breaking the Efficiency-Accuracy: Fusion of Rotation Quantization and N: M Sparsity for LLMs Inference PDF

Table of Contents