Bridging the Gap Between Promise and Performance for FP4 Quantization
Overview
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors conduct the first thorough empirical and analytical investigation of the recently introduced MXFP4 and NVFP4 microscaling 4-bit floating-point formats for LLM quantization, identifying key limitations of existing methods when applied to these formats and documenting the gap between theoretical promise and practical accuracy.
The authors propose MR-GPTQ, a novel variant of the GPTQ algorithm specifically designed for FP4 formats. It incorporates block-wise Hadamard transforms, format-specific scale optimizations, and static activation reordering to maximize accuracy for both MXFP4 and NVFP4.
The authors develop QuTLASS, a high-performance GPU kernel library optimized for NVIDIA Blackwell architecture that implements MR-GPTQ with minimal runtime overhead. The kernels fuse rotations into weights and enable fast online activation computation, achieving near-ideal speedups on B200 and RTX5090 GPUs.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[6] FP4-Quantization: Lossless 4bit Quantization for Large Language Models PDF
[22] Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models PDF
[25] LLM-FP4: 4-Bit Floating-Point Quantized Transformers PDF
[36] Zeroquant-fp: A leap forward in llms post-training w4a8 quantization using floating-point formats PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Comprehensive study of MXFP4 and NVFP4 quantization formats
The authors conduct the first thorough empirical and analytical investigation of the recently introduced MXFP4 and NVFP4 microscaling 4-bit floating-point formats for LLM quantization, identifying key limitations of existing methods when applied to these formats and documenting the gap between theoretical promise and practical accuracy.
[1] OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models PDF
[22] Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models PDF
[50] LLM-QAT: Data-Free Quantization Aware Training for Large Language Models PDF
[51] QuIP: 2-Bit Quantization of Large Language Models With Guarantees PDF
[52] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models PDF
[53] Mixed-Precision Graph Neural Quantization for Low Bit Large Language Models PDF
[54] Evaluating Quantized Large Language Models PDF
[55] RPTQ: Reorder-based Post-training Quantization for Large Language Models PDF
[56] QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models PDF
[57] Outliertune: Efficient channel-wise quantization for large language models PDF
Micro-Rotated-GPTQ (MR-GPTQ) quantization algorithm
The authors propose MR-GPTQ, a novel variant of the GPTQ algorithm specifically designed for FP4 formats. It incorporates block-wise Hadamard transforms, format-specific scale optimizations, and static activation reordering to maximize accuracy for both MXFP4 and NVFP4.
QuTLASS GPU kernel library for efficient FP4 inference
The authors develop QuTLASS, a high-performance GPU kernel library optimized for NVIDIA Blackwell architecture that implements MR-GPTQ with minimal runtime overhead. The kernels fuse rotations into weights and enable fast online activation computation, achieving near-ideal speedups on B200 and RTX5090 GPUs.