FSA: An Alternative Efficient Implementation of Native Sparse Attention Kernel

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 7.3 Download Report PDF

Efficient attentionGPUsLong context LLMsSparse attention

Recent advance in sparse attention mechanisms has demonstrated strong potential for reducing the computational cost of long-context training and inference in large language models (LLMs). Native Sparse Attention (NSA), one state-of-the-art approach, introduces natively trainable, hardware-aligned sparse attention that delivers substantial system-level performance boost while maintaining accuracy comparable to full attention. However, the kernel implementation of NSA forces a loop order that is only efficient with a relatively large number of query heads in each Grouped Query Attention (GQA) group, whereas existing LLMs widely adopt much smaller number of query heads in each GQA group --- such an inconsistency significantly limits the applicability of this sparse algorithmic advance. In this work, we propose Flash Sparse Attention (FSA), an alternative kernel implementation that enables efficient NSA computation across a wide range of popular LLMs with varied smaller number of query heads in each GQA group on modern GPUs. Compared to vanilla NSA kernel implementation, our empirical evaluation demonstrates that FSA achieves (i) up to 3.5x and on average 1.6x kernel-level latency reduction, (ii) up to 1.25x and 1.09x on average end-to-end training speedup on state-of-the-art LLMs, and (iii) up to 1.36x and 1.11x on average for prefill-phase speedup in LLM generative inference.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Flash Sparse Attention (FSA), a GPU kernel implementation optimized for Native Sparse Attention (NSA) that accommodates models with small query head counts per Grouped Query Attention group. It resides in the 'GPU Kernel Implementation and Optimization' leaf, which contains five papers including the original work. This leaf sits within the broader 'Implementation and System Optimization' branch, indicating a moderately populated research direction focused on translating sparse attention algorithms into efficient low-level code. The taxonomy reveals that kernel-level optimization is a distinct but active subfield, separate from algorithmic pattern design and higher-level serving systems.

The taxonomy structure shows that FSA's leaf neighbors include 'Serving Systems and Inference Optimization', which addresses batching and memory management at the system level rather than kernel internals. Sibling papers in the same leaf likely tackle related challenges such as memory access patterns, warp utilization, or register allocation for sparse primitives. The broader 'Implementation and System Optimization' branch connects to 'Sparse Attention Algorithm Design', where methods like NSA define the sparsity patterns that FSA aims to execute efficiently. The taxonomy's scope and exclude notes clarify that FSA belongs strictly to kernel-level concerns, not algorithmic pattern selection or end-to-end serving frameworks.

Among the three contributions analyzed, the literature search examined 24 candidates total, with no refutable pairs identified. The core FSA kernel implementation examined 4 candidates with 0 refutations, while optimizations for non-contiguous memory access and empirical evaluation each examined 10 candidates with 0 refutations. This suggests that within the limited search scope—top-K semantic matches plus citation expansion—no prior work directly overlaps with FSA's specific approach to handling small query head counts in sparse attention kernels. However, the search scale is modest, and the absence of refutations reflects the examined sample rather than exhaustive coverage of all related kernel optimization literature.

Given the limited search scope of 24 candidates, the analysis indicates that FSA addresses a relatively specific gap in sparse attention kernel design. The taxonomy context reveals a moderately active research area with five papers in the same leaf, suggesting that while kernel optimization for sparse attention is an established concern, FSA's focus on small query head counts may represent a narrower technical contribution. The lack of refutable candidates among examined papers does not guarantee absolute novelty but suggests that FSA's particular optimization strategy is not prominently addressed in the top semantic matches and their citations.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: efficient sparse attention kernel implementation for long-context language models. The field has evolved into several interconnected branches that address complementary aspects of scaling transformers to longer sequences. Sparse Attention Algorithm Design explores structured and learned sparsity patterns that reduce computational complexity while preserving model quality, with works ranging from fixed patterns to dynamic, content-aware selection strategies. Implementation and System Optimization focuses on translating these algorithmic ideas into high-performance GPU kernels and system-level optimizations, exemplified by foundational efforts like FlashAttention[28] and specialized frameworks such as FlashInfer[16]. KV Cache Management and Compression tackles memory bottlenecks through quantization, eviction policies, and retrieval-augmented approaches. Training and Fine-Tuning for Long Context addresses the challenge of extending pretrained models to handle longer sequences efficiently, while Theoretical Analysis and Empirical Studies provide rigorous understanding of sparsity trade-offs. Alternative Efficiency Approaches explore orthogonal methods such as linear attention or state-space models, and Production Systems integrate these techniques into deployable inference engines. Recent work has intensified around bridging algorithmic sparsity with practical kernel efficiency. Many studies propose dynamic sparse patterns that adapt to input content, trading off selection overhead against attention savings, as seen in works like Minference[9] and SeerAttention[24]. Others emphasize co-design of sparsity and low-level optimizations, such as SparseD[3] and DynaX[41], which tailor kernel implementations to specific sparse structures. FSA[0] sits within the GPU Kernel Implementation and Optimization cluster, focusing on efficient execution of sparse attention primitives. Compared to FlashAttention[28], which pioneered IO-aware dense attention, FSA[0] extends these principles to handle irregular sparse access patterns with minimal overhead. Relative to SparseD[3], which also targets sparse kernel efficiency, FSA[0] may emphasize different sparsity formats or hardware utilization strategies. The central tension across this branch remains balancing the generality of sparse patterns against the predictability required for peak GPU performance.

Claimed Contributions

Flash Sparse Attention (FSA) kernel implementation

3 retrieved papers

FSA is an alternative kernel implementation for Native Sparse Attention that inverts the loop order compared to vanilla NSA. It processes KV blocks in the outer loop and query tokens in the inner loop, eliminating padding inefficiencies when GQA groups contain few query heads, thereby enabling efficient sparse attention across diverse modern LLMs.

3 retrieved papers

Optimizations for non-contiguous memory access and reduction

10 retrieved papers

The authors introduce specialized optimizations to handle challenges arising from FSA's inverted loop order, including index tensors for non-contiguous query token access, a separate reduction kernel to avoid atomic operations, and an online softmax kernel to ensure numerical correctness across distributed partial results.

10 retrieved papers

Empirical evaluation demonstrating FSA performance gains

10 retrieved papers

The authors provide comprehensive empirical evaluation across kernel-level and end-to-end scenarios, demonstrating that FSA achieves substantial speedups over vanilla NSA and full attention in training and inference on state-of-the-art LLMs with various GQA configurations.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[16] FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving PDF

Ye Zihao, Chen, Lequn, Lai, Ruihang, Lin, Wuwei, Zhang, Yineng, Wang, Stephanie, Tianqi, Grover, Vinod, Krishnamurthy, Arvind, Ceze, Luis (2025) • arXiv.org

[28] FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness PDF

Dao, Tri, Fu, Daniel Y., Ermon, Stefano, Rudra, Atri, Re Christopher (2022) • Neural Information Processing Systems

[41] DynaX: Sparse Attention Acceleration with Dynamic X:M Fine-Grained Structured Pruning PDF

Xiao Xiong, Zhaorui Chen, Yue Liang, Minghao Tian, Jiaxing Shang, Jiang Zhong, Dajiang Liu (2025) • International Conference on Architectural Support for Programming Languages and Operating Systems

[49] SparseAccelerate: Efficient Long-Context Inference for Mid-Range GPUs PDF

J Vo (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Flash Sparse Attention (FSA) kernel implementation

[55] Faster video diffusion with trainable sparse attention PDF

Cannot Refute

[56] GSAformer: Group sparse attention transformer for functional brain network analysis PDF

Cannot Refute

[57] Evolving Sparsity: Leveraging Token Importance Dynamics for Efficient LLM Decoding with Sparse Attention PDF

Cannot Refute

Contribution

Optimizations for non-contiguous memory access and reduction

[32] TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection PDF

Cannot Refute

[58] Efficient deformable convnets: Rethinking dynamic and sparse operator for vision applications PDF

Cannot Refute

[59] Rethinking space-time networks with improved memory coverage for efficient video object segmentation PDF

Cannot Refute

[60] Stateful large language model serving with pensieve PDF

Cannot Refute

[61] HARDSEA: Hybrid Analog-ReRAM Clustering and Digital-SRAM In-Memory Computing Accelerator for Dynamic Sparse Self-Attention in Transformer PDF

Cannot Refute

[62] Lightweight video denoising using aggregated shifted window attention PDF

Cannot Refute

[63] DiffKV: Differentiated Memory Management for Large Language Models with Parallel KV Compaction PDF

Cannot Refute

[64] Off-tanet: A lightweight neural micro-expression recognizer with optical flow features and integrated attention mechanism PDF

Cannot Refute

[65] ChunkAttention: Efficient Attention on KV Cache with Chunking Sharing and Batching PDF

Cannot Refute

[66] AttentionRC: A Novel Approach to Improve Locality Sensitive Hashing Attention on Dual-Addressing Memory PDF

Cannot Refute

Contribution

Empirical evaluation demonstrating FSA performance gains

[1] Longlora: Efficient fine-tuning of long-context large language models PDF

Cannot Refute

[2] Native sparse attention: Hardware-aligned and natively trainable sparse attention PDF

Cannot Refute

[11] Sparser is faster and less is more: Efficient sparse attention for long-range transformers PDF

Cannot Refute

[14] The sparse frontier: Sparse attention trade-offs in transformer llms PDF

Cannot Refute

[24] Seerattention: Learning intrinsic sparse attention in your llms PDF

Cannot Refute

[38] Trainable Dynamic Mask Sparse Attention PDF

Cannot Refute

[51] Moa: Mixture of sparse attention for automatic large language model compression PDF

Cannot Refute

[52] DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model PDF

Cannot Refute

[53] Rectified Sparse Attention PDF

Cannot Refute

[54] Sparse sinkhorn attention PDF

Cannot Refute

FSA: An Alternative Efficient Implementation of Native Sparse Attention Kernel

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[16] FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving PDF

[28] FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness PDF

[41] DynaX: Sparse Attention Acceleration with Dynamic X:M Fine-Grained Structured Pruning PDF

[49] SparseAccelerate: Efficient Long-Context Inference for Mid-Range GPUs PDF

Contribution Analysis

Flash Sparse Attention (FSA) kernel implementation

[55] Faster video diffusion with trainable sparse attention PDF

[56] GSAformer: Group sparse attention transformer for functional brain network analysis PDF

[57] Evolving Sparsity: Leveraging Token Importance Dynamics for Efficient LLM Decoding with Sparse Attention PDF

Optimizations for non-contiguous memory access and reduction

[32] TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection PDF

[58] Efficient deformable convnets: Rethinking dynamic and sparse operator for vision applications PDF

[59] Rethinking space-time networks with improved memory coverage for efficient video object segmentation PDF

[60] Stateful large language model serving with pensieve PDF

[61] HARDSEA: Hybrid Analog-ReRAM Clustering and Digital-SRAM In-Memory Computing Accelerator for Dynamic Sparse Self-Attention in Transformer PDF

[62] Lightweight video denoising using aggregated shifted window attention PDF

[63] DiffKV: Differentiated Memory Management for Large Language Models with Parallel KV Compaction PDF

[64] Off-tanet: A lightweight neural micro-expression recognizer with optical flow features and integrated attention mechanism PDF

[65] ChunkAttention: Efficient Attention on KV Cache with Chunking Sharing and Batching PDF

[66] AttentionRC: A Novel Approach to Improve Locality Sensitive Hashing Attention on Dual-Addressing Memory PDF

Empirical evaluation demonstrating FSA performance gains

[1] Longlora: Efficient fine-tuning of long-context large language models PDF

[2] Native sparse attention: Hardware-aligned and natively trainable sparse attention PDF

[11] Sparser is faster and less is more: Efficient sparse attention for long-range transformers PDF

[14] The sparse frontier: Sparse attention trade-offs in transformer llms PDF

[24] Seerattention: Learning intrinsic sparse attention in your llms PDF

[38] Trainable Dynamic Mask Sparse Attention PDF

[51] Moa: Mixture of sparse attention for automatic large language model compression PDF

[52] DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model PDF

[53] Rectified Sparse Attention PDF

[54] Sparse sinkhorn attention PDF

Table of Contents