SNaX: sparse narrow accelerated mixture of experts

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Mixture of ExpertsGPUkernel

Mixture of Experts (MoE) models have emerged as the de-facto architecture for scaling up language models without significantly increasing the computational cost. Existing MoE methods optimize system efficiency or model architecture independently. We show that as MoE models get more granular and sparser, they become more memory-bound, and jointly optimizing the algorithms and the kernel design leads to a major improvement in MoE training throughput. We first propose a memory-efficient algorithm to compute the forward and backward of MoE with minimal activation saved. We then design GPU kernels that overlap memory IO latency with compute, benefiting all MoE architectures. Finally, we propose a novel "token rounding" method that minimizes the wasted compute brought by tile quantization. As a result, our method SNaX reduces activation memory by 45% and has 1.80x throughput improvement on NVidia H100 GPUs compared to ScatterMoE for a fine-grained 7B MoE. Moreover, SNaX on 64 H100s achieves a training throughput of 213 billion tokens a day comparable to ScatterMoE's 225 billion tokens a day on 96 H100s for a 7B MoE model training with token-choice routing while training with FSDP-2. Under high MoE sparsity settings, our tile-aware token rounding algorithm yields an additional 1.18x speedup on kernel execution time compared to vanilla top- $K$ routing while maintaining similar downstream performance.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes SNaX, a system co-designing memory-efficient algorithms and GPU kernels for training fine-grained sparse MoE models. It resides in the 'Block-Sparse and Kernel-Level Optimization' leaf under 'Training Systems and Distributed Optimization', which contains only two papers including this work. This leaf represents a specialized but relatively sparse research direction focused on low-level computational optimizations for MoE training, distinct from higher-level parallelism strategies or architectural design choices explored in neighboring branches.

The taxonomy reveals that SNaX sits adjacent to several related but distinct research directions. The sibling leaf 'Distributed Parallelism and Communication' addresses expert-level scheduling and All-to-All reduction, while 'Dynamic Resource Management' tackles load balancing across devices. Neighboring branches like 'Fine-Grained Expert Granularity and Scaling' explore architectural configurations that create the sparsity patterns SNaX optimizes for. The scope note for this leaf explicitly excludes high-level parallelism and device placement, positioning SNaX as a hardware-aware execution layer beneath those system-level concerns.

Among the 27 candidates examined through semantic search, none clearly refute the three core contributions. The memory-efficient forward-backward algorithm was assessed against 10 candidates with no overlapping prior work identified. Similarly, the IO-aware GPU kernels with compute-memory overlap and the token rounding method each faced 10 and 7 candidates respectively, with no refutations found. This suggests that within the limited search scope, the specific combination of activation memory reduction, kernel-level IO overlap, and tile quantization handling appears distinct from examined prior systems.

The analysis reflects a focused literature search rather than exhaustive coverage of all MoE training systems. The taxonomy structure indicates SNaX occupies a niche intersection of fine-grained sparsity and kernel optimization, with limited direct competition in this specific leaf. However, the broader 'Training Systems' branch contains multiple related approaches that address overlapping efficiency goals through different mechanisms, suggesting the novelty lies primarily in the particular synthesis of algorithmic and kernel-level techniques rather than entirely new problem formulation.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: efficient training of sparse fine-grained mixture of experts models. The field has evolved around several interconnected branches that address different facets of building and deploying MoE systems at scale. Architectural Design and Scaling Laws explore how to configure expert counts and granularity, with works like Scaling Laws Fine-Grained MoE[1] and DeepSeekMoE[8] investigating optimal parameter allocation. Training Systems and Distributed Optimization tackle the computational challenges of parallelizing expert computation across devices, exemplified by systems such as Megablocks[3] and Fsmoe[5] that optimize communication patterns and load balancing. Routing Mechanisms and Expert Selection focus on how tokens are assigned to experts, while Model Compression and Efficiency Enhancement address post-training optimizations like quantization and pruning. Fine-Tuning and Adaptation Methods examine how to specialize pre-trained MoE models efficiently, and Deployment and Inference Optimization handle serving constraints. Interpretability and Analysis, Domain-Specific Applications, and Theoretical Foundations round out the taxonomy by studying expert behavior, applying MoE to vision or multimodal tasks, and formalizing design principles. Within Training Systems and Distributed Optimization, a particularly active line of work targets block-sparse and kernel-level optimizations that exploit the fine-grained sparsity patterns inherent in MoE architectures. SNaX[0] sits squarely in this cluster, proposing specialized hardware-aware kernels to accelerate training by reducing memory overhead and communication bottlenecks at a granular level. This approach contrasts with Megablocks[3], which focuses on dynamic batching and efficient GPU utilization for variable expert loads, and complements system-level frameworks like Fsmoe[5] that emphasize distributed scheduling. The central trade-off in this branch revolves around balancing fine-grained computational efficiency with the complexity of custom kernel development and hardware portability. SNaX[0] addresses this by co-designing sparsity patterns with low-level execution primitives, positioning itself as a bridge between architectural innovations in fine-grained MoE design and practical training system constraints.

Claimed Contributions

Memory-efficient MoE forward and backward algorithm

10 retrieved papers

The authors introduce an algorithm that reorders the backward pass computation to avoid caching large activations, reducing activation memory by 45% for fine-grained MoE models. This approach keeps activation size constant regardless of expert granularity.

10 retrieved papers

IO-aware GPU kernels with overlapped memory and compute

10 retrieved papers

The authors design GPU kernels that exploit asynchronous operations on modern GPUs to overlap memory IO with computation, achieving 1.80x throughput improvement on H100 GPUs for fine-grained MoE training compared to existing methods.

10 retrieved papers

Token rounding routing method

7 retrieved papers

The authors propose a tile-aware token rounding algorithm that rounds the number of tokens routed to each expert to multiples of GEMM tile sizes, reducing wasted computation from padding while maintaining downstream performance. This yields an additional 1.18x speedup under high sparsity settings.

7 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[3] Megablocks: Efficient sparse training with mixture-of-experts PDF

Gale, Trevor, Narayanan, Deepak, Trevor Gale, Young, Cliff, D. Narayanan, Zaharia, Matei, C. Young, M. Zaharia (2023)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Memory-efficient MoE forward and backward algorithm

[19] Taming Latency-Memory Trade-Off in MoE-Based LLM Serving via Fine-Grained Expert Offloading PDF

Cannot Refute

[58] Mpmoe: Memory efficient moe for pre-trained models with adaptive pipeline parallelism PDF

Cannot Refute

[67] The ultimate guide to fine-tuning llms from basics to breakthroughs: An exhaustive review of technologies, research, best practices, applied research challenges and â¦ PDF

Cannot Refute

[68] MPipeMoE: Memory Efficient MoE for Pre-trained Models with Adaptive Pipeline Parallelism PDF

Cannot Refute

[69] Parallelization Techniques for Large Language Models: A Review from Training to Inference PDF

Cannot Refute

[70] Promoe: Fast moe-based llm serving using proactive caching PDF

Cannot Refute

[71] Diff-MoE: Efficient Batched MoE Inference with Priority-Driven Differential Expert Caching PDF

Cannot Refute

[72] Scaling beyond the GPU memory limit for large mixture-of-experts model training PDF

Cannot Refute

[73] Pangu ultra moe: How to train your big moe on ascend npus PDF

Cannot Refute

[74] SpikingBrain: Spiking Brain-inspired Large Models PDF

Cannot Refute

Contribution

IO-aware GPU kernels with overlapped memory and compute

[5] Fsmoe: A flexible and scalable training system for sparse mixture-of-experts models PDF

Cannot Refute

[51] Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts PDF

Cannot Refute

[52] Pre-gated moe: An algorithm-system co-design for fast and scalable mixture-of-expert inference PDF

Cannot Refute

[53] Accelerating mixture-of-expert inference with adaptive expert split mechanism PDF

Cannot Refute

[54] Harnessing inter-gpu shared memory for seamless moe communication-computation fusion PDF

Cannot Refute

[55] Hetermoe: Efficient training of mixture-of-experts models on heterogeneous gpus PDF

Cannot Refute

[56] Megascale-infer: Efficient mixture-of-experts model serving with disaggregated expert parallelism PDF

Cannot Refute

[57] MoE-Gen: High-Throughput MoE Inference on a Single GPU with Module-Based Batching PDF

Cannot Refute

[58] Mpmoe: Memory efficient moe for pre-trained models with adaptive pipeline parallelism PDF

Cannot Refute

[59] Moe-infinity: Offloading-efficient moe model serving PDF

Cannot Refute

Contribution

Token rounding routing method

[60] Turn Waste into Worth: Rectifying Top- Router of MoE PDF

Cannot Refute

[61] Language model optimization using pruning, distillation and quantization techniques for NLP tasks PDF

Cannot Refute

[62] Expertflow: Optimized expert activation and token allocation for efficient mixture-of-experts inference PDF

Cannot Refute

[63] SonicMoE: Accelerating MoE with IO and Tile-aware Optimizations PDF

Cannot Refute

[64] MoE-Sched: Enabling Efficient FPGA Deployment of Mixture-of-Experts Vision Transformers via Coordinated Scheduling PDF

Cannot Refute

[65] Speculative MoE: Communication Efficient Parallel MoE Inference with Speculative Token and Expert Pre-scheduling PDF

Cannot Refute

[66] Router Choice Matters: Rank-Aware Post-Training Quantization for MoE Models PDF

Cannot Refute

SNaX: sparse narrow accelerated mixture of experts

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[3] Megablocks: Efficient sparse training with mixture-of-experts PDF

Contribution Analysis

Memory-efficient MoE forward and backward algorithm

[19] Taming Latency-Memory Trade-Off in MoE-Based LLM Serving via Fine-Grained Expert Offloading PDF

[58] Mpmoe: Memory efficient moe for pre-trained models with adaptive pipeline parallelism PDF

[67] The ultimate guide to fine-tuning llms from basics to breakthroughs: An exhaustive review of technologies, research, best practices, applied research challenges and â¦ PDF

[68] MPipeMoE: Memory Efficient MoE for Pre-trained Models with Adaptive Pipeline Parallelism PDF

[69] Parallelization Techniques for Large Language Models: A Review from Training to Inference PDF

[70] Promoe: Fast moe-based llm serving using proactive caching PDF

[71] Diff-MoE: Efficient Batched MoE Inference with Priority-Driven Differential Expert Caching PDF

[72] Scaling beyond the GPU memory limit for large mixture-of-experts model training PDF

[73] Pangu ultra moe: How to train your big moe on ascend npus PDF

[74] SpikingBrain: Spiking Brain-inspired Large Models PDF

IO-aware GPU kernels with overlapped memory and compute

[5] Fsmoe: A flexible and scalable training system for sparse mixture-of-experts models PDF

[51] Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts PDF

[52] Pre-gated moe: An algorithm-system co-design for fast and scalable mixture-of-expert inference PDF

[53] Accelerating mixture-of-expert inference with adaptive expert split mechanism PDF

[54] Harnessing inter-gpu shared memory for seamless moe communication-computation fusion PDF

[55] Hetermoe: Efficient training of mixture-of-experts models on heterogeneous gpus PDF

[56] Megascale-infer: Efficient mixture-of-experts model serving with disaggregated expert parallelism PDF

[57] MoE-Gen: High-Throughput MoE Inference on a Single GPU with Module-Based Batching PDF

[58] Mpmoe: Memory efficient moe for pre-trained models with adaptive pipeline parallelism PDF

[59] Moe-infinity: Offloading-efficient moe model serving PDF

Token rounding routing method

[60] Turn Waste into Worth: Rectifying Top- Router of MoE PDF

[61] Language model optimization using pruning, distillation and quantization techniques for NLP tasks PDF

[62] Expertflow: Optimized expert activation and token allocation for efficient mixture-of-experts inference PDF

[63] SonicMoE: Accelerating MoE with IO and Tile-aware Optimizations PDF

[64] MoE-Sched: Enabling Efficient FPGA Deployment of Mixture-of-Experts Vision Transformers via Coordinated Scheduling PDF

[65] Speculative MoE: Communication Efficient Parallel MoE Inference with Speculative Token and Expert Pre-scheduling PDF

[66] Router Choice Matters: Rank-Aware Post-Training Quantization for MoE Models PDF

Table of Contents

[67] The ultimate guide to fine-tuning llms from basics to breakthroughs: An exhaustive review of technologies, research, best practices, applied research challenges and â¦ PDF