MaskPro: Linear-Space Probabilistic Learning for Strict (N:M)-Sparsity on LLMs

ICLR 2026 Conference SubmissionAnonymous Authors
Semi-structured sparsityPolicy gradient learningProbabilistic Relaxation
Abstract:

The rapid scaling of large language models(LLMs) has made inference efficiency a primary bottleneck in the practical deployment. To address this, semi-structured sparsity offers a promising solution by strategically retaining NN elements out of every MM weights, thereby enabling hardware-friendly acceleration and reduced memory. However, existing (N:M)-compatible approaches typically fall into two categories: rule-based layerwise greedy search, which suffers from considerable errors, and gradient-driven combinatorial learning, which incurs prohibitive training costs. To tackle these challenges, we propose a novel linear-space probabilistic framework named MaskPro, which aims to learn a prior categorical distribution for every MM consecutive weights and subsequently leverages this distribution to generate the (N:M)-sparsity throughout an NN-way sampling without replacement. Furthermore, to mitigate the training instability induced by the high variance of policy gradients in the super large combinatorial space, we propose a novel update method by introducing a moving average tracker of loss residuals instead of vanilla loss. Finally, we conduct comprehensive theoretical analysis and extensive experiments to validate the superior performance of MaskPro, as well as its excellent scalability in memory efficiency and exceptional robustness to data samples.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes MaskPro, a probabilistic framework for learning N:M sparsity patterns in large language models through categorical distributions and Gumbel-Softmax sampling. It resides in the 'Gumbel-Softmax and Categorical Distribution Learning' leaf, which contains only three papers total, including the original work. This leaf sits within the broader 'Learnable Semi-Structured Sparsity Methods' branch, indicating a moderately sparse research direction focused on gradient-based mask optimization rather than post-training heuristics.

The taxonomy reveals neighboring approaches in sibling leaves: 'Continuous Differentiable Sparsity Training' (one paper using proximal operators) and 'Regularized Optimization for Mask Selection' (one paper transforming mask selection into regularized problems). These alternatives avoid discrete sampling in favor of continuous relaxations or global feedback mechanisms. The broader field includes dense clusters in 'One-Shot Post-Training Pruning Methods' (eleven papers across three leaves) and 'Hybrid Compression Techniques' (seven papers), suggesting that learnable categorical methods occupy a smaller but distinct niche compared to post-training magnitude-based or quantization-hybrid strategies.

Among twenty candidates examined, the core MaskPro framework (Contribution A) shows no clear refutation across seven candidates, suggesting novelty in its linear-space categorical prior formulation. However, the enhanced policy gradient estimator with loss residuals (Contribution B, three candidates examined) and theoretical variance analysis (Contribution C, ten candidates examined) both encounter refutable prior work. Contribution C in particular faces eight candidates providing overlapping variance reduction theory, indicating that while the overall framework may be novel, its theoretical underpinnings draw heavily on established policy gradient literature.

Given the limited search scope of twenty semantically similar papers, this assessment captures local novelty within the immediate research neighborhood but cannot claim exhaustive coverage. The framework's positioning in a sparse taxonomy leaf and the absence of refutation for its core mechanism suggest meaningful differentiation from existing categorical sparsity methods, though the theoretical contributions appear more incremental relative to broader reinforcement learning and variance reduction literature.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
20
Contribution Candidate Papers Compared
9
Refutable Paper

Research Landscape Overview

Core task: Semi-structured sparsity learning for large language models. The field organizes around several complementary strategies for reducing model size and computational cost while preserving performance. Learnable semi-structured sparsity methods train differentiable masks or use gradient-based optimization to discover sparse patterns during fine-tuning or continued training, often employing techniques like Gumbel-Softmax relaxations (e.g., MaskLLM[1], MaskPro[0]) or proximal operators (ProxSparse[14]). One-shot post-training pruning methods such as SparseGPT[17] apply magnitude or Hessian-based criteria without retraining, trading speed for potential accuracy loss. Pruning-aware pretraining and retraining branches explore training models from scratch with sparsity constraints or recovering performance through extended fine-tuning. Hybrid compression techniques combine pruning with quantization or low-rank decomposition, while activation and contextual sparsity exploit runtime sparsity in activations or attention. Specialized architectures adapt sparsity to non-Transformer models (SparseSSM[21]), and parameter-efficient fine-tuning with sparsity integrates methods like LoRA with pruning (SparseLoRA[36]). Evaluation and analysis branches provide benchmarks and theoretical insights (Beyond Perplexity[23]). Within learnable methods, a particularly active line uses categorical distributions and Gumbel-Softmax to enable end-to-end mask learning, balancing differentiability with discrete sparsity patterns. MaskPro[0] sits squarely in this cluster, employing Gumbel-Softmax relaxations to learn N:M sparsity masks during training. It shares conceptual ground with MaskLLM[1], which similarly uses learnable masks but may differ in optimization details or mask parameterization. A closely related variant, MaskPro Linear Space[5], explores memory-efficient mask representations, suggesting that scalability and parameter overhead remain open challenges. These learnable approaches contrast with one-shot methods like SparseGPT[17], which sacrifice adaptability for deployment simplicity, and with hybrid techniques that layer sparsity atop quantization or low-rank factorization. The central trade-off across branches involves training cost, final accuracy, and hardware compatibility, with learnable methods offering flexibility at the expense of longer optimization cycles.

Claimed Contributions

MaskPro: Linear-space probabilistic framework for (N:M)-sparsity

The authors introduce MaskPro, a probabilistic framework that reformulates semi-structured sparsity learning as N-way sampling without replacement from categorical distributions over M consecutive weights. This approach achieves linear memory complexity O(d) for storing logits, compared to the exponential O((M choose N)^(d/M)) required by prior methods like MaskLLM.

7 retrieved papers
Enhanced policy gradient estimator with loss residuals and moving-average baseline

The authors develop a refined policy gradient estimator that replaces the vanilla loss metric with loss residuals computed against an initial mask, and stabilizes training by incorporating a moving-average tracker. This modification addresses the high variance problem in policy gradient updates caused by the vast combinatorial space.

3 retrieved papers
Can Refute
Theoretical analysis of unbiasedness and variance reduction

The authors present rigorous theoretical analysis proving that their proposed policy gradient estimators are unbiased and demonstrate variance reduction properties. The analysis establishes conditions under which the enhanced estimator achieves lower variance than vanilla policy gradients.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MaskPro: Linear-space probabilistic framework for (N:M)-sparsity

The authors introduce MaskPro, a probabilistic framework that reformulates semi-structured sparsity learning as N-way sampling without replacement from categorical distributions over M consecutive weights. This approach achieves linear memory complexity O(d) for storing logits, compared to the exponential O((M choose N)^(d/M)) required by prior methods like MaskLLM.

Contribution

Enhanced policy gradient estimator with loss residuals and moving-average baseline

The authors develop a refined policy gradient estimator that replaces the vanilla loss metric with loss residuals computed against an initial mask, and stabilizes training by incorporating a moving-average tracker. This modification addresses the high variance problem in policy gradient updates caused by the vast combinatorial space.

Contribution

Theoretical analysis of unbiasedness and variance reduction

The authors present rigorous theoretical analysis proving that their proposed policy gradient estimators are unbiased and demonstrate variance reduction properties. The analysis establishes conditions under which the enhanced estimator achieves lower variance than vanilla policy gradients.

MaskPro: Linear-Space Probabilistic Learning for Strict (N:M)-Sparsity on LLMs | Novelty Validation