ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference

ICLR 2026 Conference SubmissionAnonymous Authors
quantizationlarge language modelsmodel compression
Abstract:

Weight-only post-training quantization (PTQ) compresses the weights of Large Language Models (LLMs) into low-precision representations to reduce memory footprint and accelerate inference. However, the presence of outliers in weights and activations often leads to large quantization errors and severe accuracy degradation, especially in recent reasoning LLMs where errors accumulate across long chains of thought. Existing PTQ methods either fail to sufficiently suppress outliers or introduce significant overhead during inference. In this paper, we propose Pairwise Rotation Quantization (ParoQuant), a weight-only PTQ method that combines hardware-efficient and optimizable independent Givens rotations with channel-wise scaling to even out the magnitude across channels and narrow the dynamic range within each quantization group. We also co-design the inference kernel to fully exploit GPU parallelism and keep the rotations and scaling lightweight at runtime. ParoQuant achieves an average 2.4% accuracy improvement over AWQ on reasoning tasks with less than 10% overhead. This paves the way for more efficient and accurate deployment of reasoning LLMs.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

ParoQuant proposes a weight-only post-training quantization method combining independent Givens rotations with channel-wise scaling to equalize weight distributions before quantization. The paper resides in the 'Rotation-Based Distribution Equalization' leaf, which contains only two papers including ParoQuant itself and one sibling (RPTQ). This represents a relatively sparse research direction within the broader taxonomy of fifty papers across thirty-six topics, suggesting rotation-based approaches constitute a focused but less crowded area compared to outlier-aware methods or optimization-based techniques.

The rotation-based leaf sits within the 'Distribution Transformation and Smoothing Techniques' branch, which also includes activation smoothing methods like SmoothQuant and channel reordering approaches. Neighboring branches pursue different strategies: outlier-aware methods explicitly preserve salient weights through mixed-precision or masking, while optimization-based approaches learn quantization parameters through reconstruction objectives. ParoQuant's rotation paradigm diverges from these by globally reshaping distributions rather than selectively protecting outliers or iteratively optimizing parameters, positioning it as a complementary approach to salience-driven techniques like AWQ.

Among twenty-eight candidates examined through semantic search and citation expansion, the core ParoQuant method shows overlap with two prior works, while the scaled pairwise rotation transform and co-designed inference kernel examined ten and nine candidates respectively with no clear refutations. The method-level analysis suggests that while rotation-based quantization has precedent in the limited search scope, the specific combination of pairwise Givens rotations with hardware-efficient kernel design may represent a less-explored configuration. The statistics indicate moderate prior work density for the overall approach but sparser coverage for the implementation-focused contributions.

Based on the limited search scope of top-thirty semantic matches, ParoQuant appears to occupy a moderately novel position within rotation-based quantization, though the small size of this research direction makes comprehensive novelty assessment challenging. The analysis covers rotation-based and distribution transformation methods but may not capture all relevant work in hardware-efficient quantization or kernel optimization, which could provide additional context for the inference kernel contribution.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
28
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: weight-only post-training quantization for large language models. The field has organized itself around several complementary strategies for compressing LLM weights without retraining. Outlier-aware and salience-based methods such as AWQ[5] and OWQ[4] identify and protect critical weights that disproportionately affect model accuracy. Distribution transformation techniques like SmoothQuant[9] and rotation-based approaches including RPTQ[10] reshape weight distributions to make them more amenable to low-bit representation. Optimization-based methods learn quantization parameters through careful calibration, while extreme low-bit quantization pushes toward ternary or binary weights as seen in BiLLM[8] and OneBit[7]. Integer-only and hardware-efficient designs target deployment constraints, mixed-precision strategies allocate bits adaptively across layers, and unified frameworks integrate quantization with other compression techniques. Empirical studies such as Benchmarking PTQ[20] and Evaluating Quantized LLMs[15] systematically compare these diverse approaches. A central tension runs between methods that rely on careful weight selection versus those that transform the entire distribution. Rotation-based equalization methods like RPTQ[10] and DAQ[44] apply learned or fixed rotations to homogenize weight magnitudes before quantization, reducing the burden on per-channel scaling. ParoQuant[0] sits squarely in this rotation-based cluster, emphasizing distribution equalization through orthogonal transformations. Compared to salience-driven approaches like AWQ[5] that preserve outliers through non-uniform scaling, ParoQuant[0] and its rotation-based neighbors pursue a more global reshaping strategy. This contrasts with optimization-heavy methods such as OmniQuant[2] that jointly tune multiple quantization parameters, and with extreme quantization works like BiLLM[8] that accept higher approximation error in exchange for maximal compression. The rotation paradigm offers a middle ground: it avoids expensive per-weight decisions while achieving better uniformity than naive round-to-nearest schemes.

Claimed Contributions

Pairwise Rotation Quantization (ParoQuant) method

ParoQuant is a weight-only post-training quantization method that uses independent Givens rotations combined with channel-wise scaling to suppress outliers in weights. This transform evens out magnitude across channels and narrows the dynamic range within quantization groups, making weights more quantization-friendly.

9 retrieved papers
Can Refute
Scaled pairwise rotation transform

The scaled pairwise rotation is a novel transform that applies a series of independent Givens rotations (pairwise rotations with no dependencies) combined with channel-wise scaling. This design enables effective outlier suppression while maintaining computational efficiency through GPU parallelism.

10 retrieved papers
Co-designed inference kernel for efficient transform computation

The authors developed a specialized CUDA kernel that exploits three levels of parallelism (token, channel group, and pair) to efficiently compute the scaled pairwise rotation transform during inference. This system-level design ensures minimal overhead while maintaining quantization accuracy.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Pairwise Rotation Quantization (ParoQuant) method

ParoQuant is a weight-only post-training quantization method that uses independent Givens rotations combined with channel-wise scaling to suppress outliers in weights. This transform evens out magnitude across channels and narrows the dynamic range within quantization groups, making weights more quantization-friendly.

Contribution

Scaled pairwise rotation transform

The scaled pairwise rotation is a novel transform that applies a series of independent Givens rotations (pairwise rotations with no dependencies) combined with channel-wise scaling. This design enables effective outlier suppression while maintaining computational efficiency through GPU parallelism.

Contribution

Co-designed inference kernel for efficient transform computation

The authors developed a specialized CUDA kernel that exploits three levels of parallelism (token, channel group, and pair) to efficiently compute the scaled pairwise rotation transform during inference. This system-level design ensures minimal overhead while maintaining quantization accuracy.