ARMOR: High-Performance Semi-Structured Pruning via Adaptive Matrix Factorization

ICLR 2026 Conference SubmissionAnonymous Authors
PruningModel CompressionBlock Coordinate Descent
Abstract:

Large language models (LLMs) present significant deployment challenges due to their immense computational and memory requirements. While semi-structured pruning, particularly 2:4 sparsity, offers a path to practical hardware acceleration, existing methods often incur substantial performance degradation. To bridge this gap, we introduce ARMOR: (Adaptive Representation with Matrix-factORization), a novel one-shot post-training pruning algorithm. Instead of directly pruning weights, ARMOR factorizes each weight matrix into a 2:4 sparse core wrapped by two low-overhead, block diagonal matrices. These wrappers act as efficient pre- and post-transformation error correctors, offering greater flexibility to preserve model quality compared to conventional 2:4 pruning techniques. The sparse core and block diagonal wrappers are chosen through a block coordinate descent algorithm that minimizes a layer-wise proxy loss. We theoretically prove this optimization is guaranteed to converge to a solution with a proxy loss less than or equal to state-of-the-art pruning algorithms. Experiments on Llama (Touvron et al., 2023; Dubey et al., 2024) and Qwen (Yang et al., 2025) model families demonstrate that ARMOR consistently and significantly outperforms state-of-the-art 2:4 pruning methods across a wide range of downstream tasks and perplexity evaluations. ARMOR achieves this superior performance while retaining the inference speedups and substantial memory usage reductions of 2:4 pruning, establishing a more effective trade-off between model compression and task accuracy

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces ARMOR, a one-shot post-training pruning method that factorizes weight matrices into a 2:4 sparse core wrapped by block-diagonal error correctors. This work resides in the 'Learnable N:M Sparsity Mask Optimization' leaf, which contains five papers including the original submission. This leaf sits within the broader 'Semi-Structured Sparsity Pattern Methods' branch, indicating a moderately populated research direction focused on adaptive mask selection mechanisms. The taxonomy reveals that semi-structured pruning is an active area with multiple competing approaches across ten major branches.

The taxonomy tree shows that ARMOR's immediate neighbors include methods like MaskLLM and CAST, which also optimize N:M masks but through different training or fine-tuning strategies. Adjacent leaves explore 'Block-Wise and Structured Sparsity Patterns' (five papers) and 'Dependency-Aware and GLU-Specific Pruning' (one paper), suggesting that block-level transformations and architectural specialization are related but distinct research threads. The 'One-Shot Importance-Based Pruning' branch (nine papers across three leaves) represents an alternative paradigm using magnitude or Hessian-based metrics without learnable masks, highlighting a fundamental methodological divide in the field.

Among 29 candidates examined, none clearly refute any of ARMOR's three core contributions. The matrix factorization approach (10 candidates examined, 0 refutable) appears distinct from prior learnable mask methods in its use of block-diagonal wrappers rather than direct mask optimization. The block coordinate descent algorithm (9 candidates examined, 0 refutable) and convergence guarantee (10 candidates examined, 0 refutable) similarly show no substantial overlap within the limited search scope. These statistics suggest that ARMOR's combination of factorization, block-diagonal transformations, and theoretical guarantees may represent a novel synthesis, though the search examined only top-K semantic matches rather than an exhaustive literature review.

Based on the limited search scope of 29 candidates, ARMOR appears to occupy a relatively unexplored niche within learnable N:M sparsity optimization. The absence of refutable prior work across all three contributions, combined with its position in a moderately populated taxonomy leaf, suggests meaningful differentiation from existing approaches. However, this assessment is constrained by the top-K semantic search methodology and does not account for potentially relevant work outside the examined candidate set or in adjacent compression domains.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
29
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: semi-structured pruning of large language models. The field has organized itself around several complementary strategies for reducing model size and computational cost while preserving performance. At the highest level, Semi-Structured Sparsity Pattern Methods explore learnable or fixed N:M patterns that balance hardware efficiency with flexibility, while One-Shot Importance-Based Pruning techniques such as Sparsegpt[23] and Simple Effective Pruning[4] remove weights in a single pass using gradient or activation-based metrics. Depth and Layer-Level Pruning approaches like Shortened Llama[5] target entire layers or blocks, and Hybrid Compression with Pruning methods combine sparsity with quantization or low-rank factorization to achieve greater compression ratios. Meanwhile, Activation Sparsity and Inference Optimization branches focus on runtime efficiency, Specialized Pruning Frameworks adapt techniques to particular architectures, and Global and Coordinated Pruning Strategies enforce consistency across modules. Post-Training and Retraining-Free Methods aim to minimize calibration overhead, and dedicated evaluation branches assess the impact of pruning on downstream tasks and model behavior. Within the Semi-Structured Sparsity Pattern Methods branch, a particularly active line of work centers on learnable N:M sparsity mask optimization, where methods such as Maskllm[1] and CAST[36] dynamically adjust which weights to retain during training or fine-tuning. ARMOR[0] sits squarely in this cluster, emphasizing adaptive mask learning to optimize the trade-off between sparsity ratio and task performance. Compared to ProxSparse[39], which employs proximal gradient techniques for mask updates, ARMOR[0] explores alternative optimization strategies that may offer faster convergence or better final accuracy. Similarly, Semi-Structural Adaptive[40] investigates adaptive sparsity patterns but differs in how it balances global versus local importance signals. Across these branches, key open questions include how to select optimal sparsity ratios without extensive retraining, how to coordinate pruning decisions across layers, and whether learnable masks can generalize across diverse tasks and model scales.

Claimed Contributions

ARMOR matrix factorization for semi-structured pruning

The authors propose a novel weight representation that factorizes each weight matrix into a 2:4 sparse core surrounded by block diagonal wrapper matrices. These wrappers act as efficient pre- and post-transformation error correctors, offering greater flexibility to preserve model quality compared to conventional 2:4 pruning techniques.

10 retrieved papers
Block coordinate descent optimization algorithm

The authors develop a block coordinate descent optimization algorithm that alternates between updating continuous parameters (the block diagonal matrices and dense weights) and updating the sparse core. This algorithm is designed to minimize a layer-wise proxy loss while respecting the 2:4 sparsity constraint.

9 retrieved papers
Theoretical convergence guarantee

The authors establish a theoretical result (Theorem 3.1) proving that their optimization algorithm converges and achieves a proxy loss no worse than state-of-the-art methods like NoWag-P, providing formal guarantees for their approach.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

ARMOR matrix factorization for semi-structured pruning

The authors propose a novel weight representation that factorizes each weight matrix into a 2:4 sparse core surrounded by block diagonal wrapper matrices. These wrappers act as efficient pre- and post-transformation error correctors, offering greater flexibility to preserve model quality compared to conventional 2:4 pruning techniques.

Contribution

Block coordinate descent optimization algorithm

The authors develop a block coordinate descent optimization algorithm that alternates between updating continuous parameters (the block diagonal matrices and dense weights) and updating the sparse core. This algorithm is designed to minimize a layer-wise proxy loss while respecting the 2:4 sparsity constraint.

Contribution

Theoretical convergence guarantee

The authors establish a theoretical result (Theorem 3.1) proving that their optimization algorithm converges and achieves a proxy loss no worse than state-of-the-art methods like NoWag-P, providing formal guarantees for their approach.