SNaX: sparse narrow accelerated mixture of experts

ICLR 2026 Conference SubmissionAnonymous Authors
Mixture of ExpertsGPUkernel
Abstract:

Mixture of Experts (MoE) models have emerged as the de-facto architecture for scaling up language models without significantly increasing the computational cost. Existing MoE methods optimize system efficiency or model architecture independently. We show that as MoE models get more granular and sparser, they become more memory-bound, and jointly optimizing the algorithms and the kernel design leads to a major improvement in MoE training throughput. We first propose a memory-efficient algorithm to compute the forward and backward of MoE with minimal activation saved. We then design GPU kernels that overlap memory IO latency with compute, benefiting all MoE architectures. Finally, we propose a novel "token rounding" method that minimizes the wasted compute brought by tile quantization. As a result, our method SNaX reduces activation memory by 45% and has 1.80x throughput improvement on NVidia H100 GPUs compared to ScatterMoE for a fine-grained 7B MoE. Moreover, SNaX on 64 H100s achieves a training throughput of 213 billion tokens a day comparable to ScatterMoE's 225 billion tokens a day on 96 H100s for a 7B MoE model training with token-choice routing while training with FSDP-2. Under high MoE sparsity settings, our tile-aware token rounding algorithm yields an additional 1.18x speedup on kernel execution time compared to vanilla top-KK routing while maintaining similar downstream performance.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes SNaX, a system co-designing memory-efficient algorithms and GPU kernels for training fine-grained sparse MoE models. It resides in the 'Block-Sparse and Kernel-Level Optimization' leaf under 'Training Systems and Distributed Optimization', which contains only two papers including this work. This leaf represents a specialized but relatively sparse research direction focused on low-level computational optimizations for MoE training, distinct from higher-level parallelism strategies or architectural design choices explored in neighboring branches.

The taxonomy reveals that SNaX sits adjacent to several related but distinct research directions. The sibling leaf 'Distributed Parallelism and Communication' addresses expert-level scheduling and All-to-All reduction, while 'Dynamic Resource Management' tackles load balancing across devices. Neighboring branches like 'Fine-Grained Expert Granularity and Scaling' explore architectural configurations that create the sparsity patterns SNaX optimizes for. The scope note for this leaf explicitly excludes high-level parallelism and device placement, positioning SNaX as a hardware-aware execution layer beneath those system-level concerns.

Among the 27 candidates examined through semantic search, none clearly refute the three core contributions. The memory-efficient forward-backward algorithm was assessed against 10 candidates with no overlapping prior work identified. Similarly, the IO-aware GPU kernels with compute-memory overlap and the token rounding method each faced 10 and 7 candidates respectively, with no refutations found. This suggests that within the limited search scope, the specific combination of activation memory reduction, kernel-level IO overlap, and tile quantization handling appears distinct from examined prior systems.

The analysis reflects a focused literature search rather than exhaustive coverage of all MoE training systems. The taxonomy structure indicates SNaX occupies a niche intersection of fine-grained sparsity and kernel optimization, with limited direct competition in this specific leaf. However, the broader 'Training Systems' branch contains multiple related approaches that address overlapping efficiency goals through different mechanisms, suggesting the novelty lies primarily in the particular synthesis of algorithmic and kernel-level techniques rather than entirely new problem formulation.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
27
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: efficient training of sparse fine-grained mixture of experts models. The field has evolved around several interconnected branches that address different facets of building and deploying MoE systems at scale. Architectural Design and Scaling Laws explore how to configure expert counts and granularity, with works like Scaling Laws Fine-Grained MoE[1] and DeepSeekMoE[8] investigating optimal parameter allocation. Training Systems and Distributed Optimization tackle the computational challenges of parallelizing expert computation across devices, exemplified by systems such as Megablocks[3] and Fsmoe[5] that optimize communication patterns and load balancing. Routing Mechanisms and Expert Selection focus on how tokens are assigned to experts, while Model Compression and Efficiency Enhancement address post-training optimizations like quantization and pruning. Fine-Tuning and Adaptation Methods examine how to specialize pre-trained MoE models efficiently, and Deployment and Inference Optimization handle serving constraints. Interpretability and Analysis, Domain-Specific Applications, and Theoretical Foundations round out the taxonomy by studying expert behavior, applying MoE to vision or multimodal tasks, and formalizing design principles. Within Training Systems and Distributed Optimization, a particularly active line of work targets block-sparse and kernel-level optimizations that exploit the fine-grained sparsity patterns inherent in MoE architectures. SNaX[0] sits squarely in this cluster, proposing specialized hardware-aware kernels to accelerate training by reducing memory overhead and communication bottlenecks at a granular level. This approach contrasts with Megablocks[3], which focuses on dynamic batching and efficient GPU utilization for variable expert loads, and complements system-level frameworks like Fsmoe[5] that emphasize distributed scheduling. The central trade-off in this branch revolves around balancing fine-grained computational efficiency with the complexity of custom kernel development and hardware portability. SNaX[0] addresses this by co-designing sparsity patterns with low-level execution primitives, positioning itself as a bridge between architectural innovations in fine-grained MoE design and practical training system constraints.

Claimed Contributions

Memory-efficient MoE forward and backward algorithm

The authors introduce an algorithm that reorders the backward pass computation to avoid caching large activations, reducing activation memory by 45% for fine-grained MoE models. This approach keeps activation size constant regardless of expert granularity.

10 retrieved papers
IO-aware GPU kernels with overlapped memory and compute

The authors design GPU kernels that exploit asynchronous operations on modern GPUs to overlap memory IO with computation, achieving 1.80x throughput improvement on H100 GPUs for fine-grained MoE training compared to existing methods.

10 retrieved papers
Token rounding routing method

The authors propose a tile-aware token rounding algorithm that rounds the number of tokens routed to each expert to multiples of GEMM tile sizes, reducing wasted computation from padding while maintaining downstream performance. This yields an additional 1.18x speedup under high sparsity settings.

7 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Memory-efficient MoE forward and backward algorithm

The authors introduce an algorithm that reorders the backward pass computation to avoid caching large activations, reducing activation memory by 45% for fine-grained MoE models. This approach keeps activation size constant regardless of expert granularity.

Contribution

IO-aware GPU kernels with overlapped memory and compute

The authors design GPU kernels that exploit asynchronous operations on modern GPUs to overlap memory IO with computation, achieving 1.80x throughput improvement on H100 GPUs for fine-grained MoE training compared to existing methods.

Contribution

Token rounding routing method

The authors propose a tile-aware token rounding algorithm that rounds the number of tokens routed to each expert to multiples of GEMM tile sizes, reducing wasted computation from padding while maintaining downstream performance. This yields an additional 1.18x speedup under high sparsity settings.