SSDi8: Accurate and Efficient 8-bit Quantization for State Space Duality

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Mamba-2State Space Duality (SSD)Quantization

Recent advances in sequence modeling have highlighted Mamba as a state space architecture offering efficient long-range dependency modeling and providing a viable alternative to Transformers. Building upon this, Mamba-2 introduces the Structured State Space Duality (SSD), which integrates recurrent and attention modes to achieve efficiency and scalability. However, this architectural expansion substantially increases memory and latency overhead, underscoring the need for efficient compression strategies tailored to SSD. In this work, we present SSDi8, the first post-training quantization framework specifically designed for SSD to maintain a persistent INT8 path. SSDi8 introduces a reformulation that decouples element-wise multiplications from matrix multiplications, enabling reuse of quantized activations across modules. Moreover, SSDi8 adaptively quantizes channel-varying activations at cost-effective points, further reducing latency. On the accuracy side, SSDi8 explicitly leverages the intrinsic dimensional decomposition of SSD, exploiting distinct outlier distributions across axes, and incorporates an error correction term based on per-channel error statistics. Comprehensive experiments demonstrate that SSDi8 achieves accuracy comparable to FP16 while delivering up to 1.4X speedup in W4A8 and W8A8 settings. We further validate its robustness in resource-constrained environments by deploying it on the Orin Nano device.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces SSDi8, a post-training quantization framework targeting Structured State Space Duality architectures with persistent INT8 paths. According to the taxonomy, it resides in the 'SSD-Specific INT8 Quantization Frameworks' leaf, which contains only two papers total. This leaf sits within the broader 'Selective State Space Model Quantization' branch, indicating a relatively sparse research direction focused on architecture-specific optimizations for SSD variants rather than general Mamba models.

The taxonomy reveals neighboring work in 'General Mamba Post-Training Quantization' (containing Quamba and Q-Mamba) and 'Small-Scale and Edge-Optimized SSM Quantization' (Quantizing Edge SSM). These sibling leaves address post-training compression for Mamba variants but differ in scope: general Mamba methods handle multi-bit precision without SSD-specific optimizations, while edge-focused work prioritizes resource constraints over architectural duality. The taxonomy explicitly excludes quantization-aware training methods and vision-specific Mamba variants, positioning SSDi8 within a narrower post-training context.

Among 21 candidates examined, the framework's core contribution (SSDi8 PTQ for SSD) shows no clear refutation across 10 candidates reviewed. The sparse-aware reformulation examined only 1 candidate with no overlap found. However, the error correction mechanism based on per-channel statistics encountered 2 refutable candidates among 10 examined, suggesting this component has more substantial prior work in quantization literature. The limited search scope means these statistics reflect top semantic matches rather than exhaustive coverage.

Given the sparse taxonomy leaf and limited refutation signals, the work appears to occupy a relatively novel position within SSD-specific quantization. The analysis covers top-21 semantic candidates and does not claim exhaustive field coverage. The error correction component shows more overlap with existing techniques, while the SSD-tailored framework and reformulation appear more distinctive within the examined scope.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: 8-bit quantization for Structured State Space Duality models. The field organizes around three main branches that reflect different stages and scopes of model compression. Post-Training Quantization Methods for State Space Models focus on reducing precision after training is complete, with works like Quamba[2] and Q-Mamba[3] developing selective strategies that identify which components of state space architectures are most sensitive to quantization. Quantization-Aware Training and Co-Design Approaches integrate precision reduction directly into the training loop or explore joint optimization of architecture and bit-width, as seen in methods like QAT Survey[7] and AutoNeural[8]. Specialized Applications and Hybrid Architectures address domain-specific constraints or combine state space models with other paradigms, exemplified by TVMamba[9] and Spikingbrain[5], which adapt quantization to particular deployment contexts or neuromorphic settings. Recent activity has concentrated on refining post-training methods that balance compression ratio against the unique computational patterns of selective state space models. A central tension emerges between aggressive uniform quantization and more nuanced layer-wise or context-sensitive schemes, with Context-Aware Quantization[6] exploring adaptive strategies and Slender-Mamba[4] pursuing extreme sparsity alongside reduced precision. SSDi8[0] sits squarely within the selective state space model quantization cluster, proposing an INT8 framework tailored to the duality structure of SSD architectures. Compared to broader approaches like Q-Mamba[3], which targets general Mamba variants, SSDi8[0] emphasizes architecture-specific optimizations that exploit the mathematical properties of duality layers. Meanwhile, Quantizing Edge SSM[1] addresses similar post-training challenges but prioritizes edge deployment constraints, highlighting an open question about whether specialized frameworks or unified methods will prove more effective as state space models diversify.

Claimed Contributions

SSDi8 post-training quantization framework for SSD

10 retrieved papers

The authors introduce SSDi8, a novel post-training quantization framework tailored for Structured State Space Duality (SSD) in Mamba-2. This framework maintains a persistent INT8 execution path through the SSD architecture, addressing the unique computational organization and challenges of quantizing SSD layers.

10 retrieved papers

Sparse-aware reformulation for element-wise operations

1 retrieved paper

The authors propose a sparse-aware reformulation that separates element-wise multiplications from matrix multiplications within SSD. This reformulation enables quantized activation reuse across multiple modules and maintains the INT8 execution path, with formal mathematical guarantees provided through theoretical analysis.

1 retrieved paper

Error correction based on per-channel statistics

Can Refute

10 retrieved papers

The authors introduce a mean correction strategy that compensates for quantization errors using per-channel error statistics. This correction term is derived in closed form and applied through a layer-wise sequential update strategy to mitigate error accumulation across SSD layers.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[6] The Context-Aware Quantization Design Space: Unlocking Scalable Training and Inference for Large AI Models PDF

E Yang (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

SSDi8 post-training quantization framework for SSD

[1] Quantizing Small-Scale State-Space Models for Edge AI PDF

Cannot Refute

[2] Quamba: A post-training quantization recipe for selective state space models PDF

Cannot Refute

[3] Q-Mamba: Towards more efficient Mamba models via post-training quantization PDF

Cannot Refute

[4] Slender-Mamba: Fully Quantized Mamba in 1.58 Bits From Head to Toe PDF

Cannot Refute

[6] The Context-Aware Quantization Design Space: Unlocking Scalable Training and Inference for Large AI Models PDF

Cannot Refute

[10] Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models PDF

Cannot Refute

[11] Q-S5: Towards Quantized State Space Models PDF

Cannot Refute

[12] UniQL: Unified Quantization and Low-rank Compression for Adaptive Edge LLMs PDF

Cannot Refute

[13] A diagonal state space model on loihi 2 for efficient streaming sequence processing PDF

Cannot Refute

[14] ToMamba: Towards Token-Efficient Mamba Architecture on FPGA PDF

Cannot Refute

Contribution

Sparse-aware reformulation for element-wise operations

[25] Privacy-Aware Distributed Machine Learning PDF

Cannot Refute

Contribution

Error correction based on per-channel statistics

[15] Accurate post training quantization with small calibration sets PDF

Can Refute

[18] Qllm: Accurate and efficient low-bitwidth quantization for large language models PDF

Can Refute

[16] DGQ: Distribution-Aware Group Quantization for Text-to-Image Diffusion Models PDF

Cannot Refute

[17] Automated Fine-Grained Mixture-of-Experts Quantization PDF

Cannot Refute

[19] Magr: Weight magnitude reduction for enhancing post-training quantization PDF

Cannot Refute

[20] The uniqueness of llama3-70b series with per-channel quantization PDF

Cannot Refute

[21] Channel-Wise Bit Allocation for Deep Visual Feature Quantization PDF

Cannot Refute

[22] Channel-Wise Mixed-Precision Quantization for Large Language Models PDF

Cannot Refute

[23] TCAQ-DM: Timestep-Channel Adaptive Quantization for Diffusion Models PDF

Cannot Refute

[24] Distribution-aware adaptive multi-bit quantization PDF

Cannot Refute

SSDi8: Accurate and Efficient 8-bit Quantization for State Space Duality

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[6] The Context-Aware Quantization Design Space: Unlocking Scalable Training and Inference for Large AI Models PDF

Contribution Analysis

SSDi8 post-training quantization framework for SSD

[1] Quantizing Small-Scale State-Space Models for Edge AI PDF

[2] Quamba: A post-training quantization recipe for selective state space models PDF

[3] Q-Mamba: Towards more efficient Mamba models via post-training quantization PDF

[4] Slender-Mamba: Fully Quantized Mamba in 1.58 Bits From Head to Toe PDF

[6] The Context-Aware Quantization Design Space: Unlocking Scalable Training and Inference for Large AI Models PDF

[10] Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models PDF

[11] Q-S5: Towards Quantized State Space Models PDF

[12] UniQL: Unified Quantization and Low-rank Compression for Adaptive Edge LLMs PDF

[13] A diagonal state space model on loihi 2 for efficient streaming sequence processing PDF

[14] ToMamba: Towards Token-Efficient Mamba Architecture on FPGA PDF

Sparse-aware reformulation for element-wise operations

[25] Privacy-Aware Distributed Machine Learning PDF

Error correction based on per-channel statistics

[15] Accurate post training quantization with small calibration sets PDF

[18] Qllm: Accurate and efficient low-bitwidth quantization for large language models PDF

[16] DGQ: Distribution-Aware Group Quantization for Text-to-Image Diffusion Models PDF

[17] Automated Fine-Grained Mixture-of-Experts Quantization PDF

[19] Magr: Weight magnitude reduction for enhancing post-training quantization PDF

[20] The uniqueness of llama3-70b series with per-channel quantization PDF

[21] Channel-Wise Bit Allocation for Deep Visual Feature Quantization PDF

[22] Channel-Wise Mixed-Precision Quantization for Large Language Models PDF

[23] TCAQ-DM: Timestep-Channel Adaptive Quantization for Diffusion Models PDF

[24] Distribution-aware adaptive multi-bit quantization PDF

Table of Contents