TileLang: Bridge Programmability and Performance in Modern Neural Kernels

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 7.0 Download Report PDF

compiler; AI; programming model

Modern AI algorithms increasingly adopt fused kernels for performance, but implementing them remains complex due to the lack of fine-grained control in existing compilers like Triton. We introduce TileLang, a controllable programming system for fused neural kernels. TileLang provides explicit tile-level primitives for memory placement, data movement, and parallel scheduling. To guide developers in hardware-aware programming, the TileLang introduces two key techniques: tile inference which models tile programs as fused graphs and automatically deduces tile configuration from partial annotations; and tile recommendation that suggests efficient tile configurations based on hardware profiles and heuristics. TileLang makes it easy to express a wide range of fused attention kernels in under 80 lines of Python code, reducing code size by up to 90% compared to manual implementations. Evaluations show that TileLang achieves up to 5x speedup over Triton on NVIDIA H100 and up to 6 on AMD GPUs, demonstrating its ability to bridge programmability and performance.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces TileLang, a programmable tile-level system providing explicit primitives for memory placement, data movement, and parallel scheduling in fused neural kernels. Within the taxonomy, it resides in the 'Explicit Tile-Level Programming Languages' leaf, which contains only two papers total. This sparse population suggests the research direction—domain-specific languages offering fine-grained tile control—remains relatively underexplored compared to broader hardware architecture branches. The sibling paper (Triton Compiler) shares the goal of GPU tile programming but emphasizes ease of use over explicit control, indicating TileLang occupies a distinct niche prioritizing programmability with hardware awareness.

The taxonomy reveals TileLang sits adjacent to 'Intermediate Compilation Frameworks and Overlays,' which abstract hardware details through compiler infrastructures rather than exposing tile primitives. Neighboring branches include 'Coarse-Grained Reconfigurable Arrays' (3 papers) and 'Reconfigurable Processing Elements' (3 papers), focusing on physical architectures rather than programming abstractions. The 'Optimization Techniques' branch addresses scheduling and fusion strategies but assumes existing programming models. TileLang's explicit tile-level primitives distinguish it from compiler-only frameworks while its programmability separates it from hardware-centric designs, positioning it at the intersection of abstraction and control.

Among 20 candidates examined across three contributions, the 'Unified fused tile-level dataflow graph (FTG) representation' shows one refutable candidate from 10 examined, suggesting some overlap in graph-based modeling approaches. The 'Programmable tile-level abstractions' contribution examined 10 candidates with zero refutations, indicating potential novelty in the explicit primitive design. The 'Tile recommendation and inference framework' was not evaluated against prior work in this limited search. The modest search scope (20 papers, not exhaustive) means substantial related work may exist beyond top-K semantic matches, particularly in compiler optimization or dataflow modeling domains.

Based on the limited 20-candidate search, TileLang appears to contribute novel explicit tile primitives in a sparsely populated research direction, though the FTG representation shows some prior overlap. The analysis covers top semantic matches and immediate taxonomy neighbors but does not exhaustively survey compiler frameworks, GPU programming models, or dataflow graph literature. A broader search might reveal additional related work in tensor compiler design or hardware-software co-design that was not captured in this scope.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: programmable tile-level system for fused neural kernels. The field centers on designing systems that partition neural computations into tiles—spatial or logical blocks—enabling efficient execution of fused operations on specialized hardware. The taxonomy reveals three main branches: Tile-Level Programming Abstractions and Compilation focuses on languages and compilers that expose tile-level parallelism to programmers, exemplified by explicit programming models like Triton Compiler[3] and TileLang Bridge[0]; Hardware Architecture Design for Tile-Based Acceleration explores reconfigurable and domain-specific architectures such as coarse-grained reconfigurable arrays (CGRAs) like Opal CGRA[1] and Onyx CGRA[4], as well as specialized accelerators like Venus Accelerator[8]; and Optimization Techniques for Tile-Based Systems addresses scheduling, memory management, and fusion strategies, including works on autonomous tiling (Autonomous Task Tiling[22]) and layer-centric fusion (Layer Centric Fusion[23]). These branches are tightly coupled: programming abstractions must map efficiently to hardware substrates, while optimization techniques bridge the gap between high-level code and low-level execution. A particularly active line of work involves explicit tile-level languages that give developers fine-grained control over data movement and compute scheduling, balancing productivity with performance. TileLang Bridge[0] sits squarely in this space, offering a programming interface that abstracts tile-level operations while remaining close to hardware semantics. It shares conceptual ground with Triton Compiler[3], which similarly targets GPU tile programming but emphasizes ease of use for kernel fusion. In contrast, hardware-centric approaches like Opal CGRA[1] and Reconfigurable Hardware Acceleration[5] prioritize architectural flexibility and energy efficiency, often requiring more specialized compilation flows. The tension between programmer-friendly abstractions and hardware-specific optimizations remains a central theme: some works lean toward domain-specific languages with rich semantics, while others favor lower-level primitives that expose more control. TileLang Bridge[0] appears to navigate this trade-off by providing a structured yet expressive tile abstraction, positioning itself as a bridge between high-level neural frameworks and tile-based execution models.

Claimed Contributions

Programmable tile-level abstractions for neural kernel development

10 retrieved papers

The authors introduce a tile-level programming model that provides explicit primitives for memory placement, data movement, and parallel scheduling. Unlike existing compilers that rely on opaque optimization passes, TileLang gives developers direct control over hardware resources through user-visible intrinsics for buffer allocation, data transfer orchestration, custom memory layouts, and parallelism strategies.

10 retrieved papers

Unified fused tile-level dataflow graph (FTG) representation

Can Refute

10 retrieved papers

The system represents tile-level programs as a unified FTG that captures dataflow and tiling structure, where nodes represent tile operators and edges encode data dependencies. This graph-based representation enables systematic analysis and transformation at tile granularity, supporting both tile recommendation and tile inference optimization techniques.

10 retrieved papers

Can Refute

Tile recommendation and inference framework

0 retrieved papers

The authors develop a two-stage optimization workflow: tile recommendation analyzes the FTG to provide hardware-aware defaults for tile shapes, memory placement, and warp partitions, while tile inference propagates constraints through the graph to automatically complete remaining configurations including memory layouts, software pipelining, and tensorization. This design blends flexible user control with automated optimization.

0 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[3] Triton: an intermediate language and compiler for tiled neural network computations PDF

Philippe Tillet, H. T. Kung, David Cox (2019)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Programmable tile-level abstractions for neural kernel development

[13] A Machine Learning Approach to Optimizing CNN Deployment on Tile-Based Systems-on-Chip PDF

Cannot Refute

[19] Aero: Design space exploration framework for resource-constrained cnn mapping on tile-based accelerators PDF

Cannot Refute

[27] Register tiling for unstructured sparsity in neural network inference PDF

Cannot Refute

[28] TileLang: A Composable Tiled Programming Model for AI Systems PDF

Cannot Refute

[29] Tilelink: Generating efficient compute-communication overlapping kernels using tile-centric primitives PDF

Cannot Refute

[30] Tile-based architecture exploration for convolutional accelerators in deep neural networks PDF

Cannot Refute

[31] Exact tile-based segmentation inference for images larger than gpu memory PDF

Cannot Refute

[32] Training of deep learning pipelines on memory-constrained GPUs via segmented fused-tiled execution PDF

Cannot Refute

[33] The next 700 accelerated layers: From mathematical expressions of network computation graphs to accelerated gpu kernels, automatically PDF

Cannot Refute

[34] UMDAM: A Unified Data Layout and DRAM Address Mapping for Heterogenous NPU-PIM PDF

Cannot Refute

Contribution

Unified fused tile-level dataflow graph (FTG) representation

[37] Welder: Scheduling deep learning memory access via tile-graph PDF

Can Refute

[35] Sparsepipe: Sparse Inter-operator Dataflow Architecture with Cross-Iteration Reuse PDF

Cannot Refute

[36] Optimizing openvx graphs for data movement PDF

Cannot Refute

[38] Leda: Leveraging Tiling Dataflow to Accelerate SpMM on HBM-Equipped FPGAs for GNNs PDF

Cannot Refute

[39] Rethinking Tiling and Dataflow for SpMM Acceleration: A Graph Transformation Framework PDF

Cannot Refute

[40] PIMapping: A tile-level dataflow optimization framework for PIM-architecture PDF

Cannot Refute

[41] Kitsune: Enabling Dataflow Execution on GPUs with Spatial Pipelines PDF

Cannot Refute

[42] A Unified Synthesis Framework for Dataflow Accelerators Through Multi-level Software and Hardware Intermediate Representations PDF

Cannot Refute

[43] FlatAttention: Dataflow and Fabric Collectives Co-Optimization for Efficient Multi-Head Attention on Tile-Based Many-PE Accelerators PDF

Cannot Refute

[44] Stateful Dataflow Multigraphs: A Data-Centric Model for Performance Portability on Heterogeneous Architectures PDF

Cannot Refute

Contribution

TileLang: Bridge Programmability and Performance in Modern Neural Kernels

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[3] Triton: an intermediate language and compiler for tiled neural network computations PDF

Contribution Analysis

Programmable tile-level abstractions for neural kernel development

[13] A Machine Learning Approach to Optimizing CNN Deployment on Tile-Based Systems-on-Chip PDF

[19] Aero: Design space exploration framework for resource-constrained cnn mapping on tile-based accelerators PDF

[27] Register tiling for unstructured sparsity in neural network inference PDF

[28] TileLang: A Composable Tiled Programming Model for AI Systems PDF

[29] Tilelink: Generating efficient compute-communication overlapping kernels using tile-centric primitives PDF

[30] Tile-based architecture exploration for convolutional accelerators in deep neural networks PDF

[31] Exact tile-based segmentation inference for images larger than gpu memory PDF

[32] Training of deep learning pipelines on memory-constrained GPUs via segmented fused-tiled execution PDF

[33] The next 700 accelerated layers: From mathematical expressions of network computation graphs to accelerated gpu kernels, automatically PDF

[34] UMDAM: A Unified Data Layout and DRAM Address Mapping for Heterogenous NPU-PIM PDF

Unified fused tile-level dataflow graph (FTG) representation

[37] Welder: Scheduling deep learning memory access via tile-graph PDF

[35] Sparsepipe: Sparse Inter-operator Dataflow Architecture with Cross-Iteration Reuse PDF

[36] Optimizing openvx graphs for data movement PDF

[38] Leda: Leveraging Tiling Dataflow to Accelerate SpMM on HBM-Equipped FPGAs for GNNs PDF

[39] Rethinking Tiling and Dataflow for SpMM Acceleration: A Graph Transformation Framework PDF

[40] PIMapping: A tile-level dataflow optimization framework for PIM-architecture PDF

[41] Kitsune: Enabling Dataflow Execution on GPUs with Spatial Pipelines PDF

[42] A Unified Synthesis Framework for Dataflow Accelerators Through Multi-level Software and Hardware Intermediate Representations PDF

[43] FlatAttention: Dataflow and Fabric Collectives Co-Optimization for Efficient Multi-Head Attention on Tile-Based Many-PE Accelerators PDF

[44] Stateful Dataflow Multigraphs: A Data-Centric Model for Performance Portability on Heterogeneous Architectures PDF

Tile recommendation and inference framework

Table of Contents