TileLang: Bridge Programmability and Performance in Modern Neural Kernels

ICLR 2026 Conference SubmissionAnonymous Authors
compiler; AI; programming model
Abstract:

Modern AI algorithms increasingly adopt fused kernels for performance, but implementing them remains complex due to the lack of fine-grained control in existing compilers like Triton. We introduce TileLang, a controllable programming system for fused neural kernels. TileLang provides explicit tile-level primitives for memory placement, data movement, and parallel scheduling. To guide developers in hardware-aware programming, the TileLang introduces two key techniques: tile inference which models tile programs as fused graphs and automatically deduces tile configuration from partial annotations; and tile recommendation that suggests efficient tile configurations based on hardware profiles and heuristics. TileLang makes it easy to express a wide range of fused attention kernels in under 80 lines of Python code, reducing code size by up to 90% compared to manual implementations. Evaluations show that TileLang achieves up to 5x speedup over Triton on NVIDIA H100 and up to 6 on AMD GPUs, demonstrating its ability to bridge programmability and performance.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces TileLang, a programmable tile-level system providing explicit primitives for memory placement, data movement, and parallel scheduling in fused neural kernels. Within the taxonomy, it resides in the 'Explicit Tile-Level Programming Languages' leaf, which contains only two papers total. This sparse population suggests the research direction—domain-specific languages offering fine-grained tile control—remains relatively underexplored compared to broader hardware architecture branches. The sibling paper (Triton Compiler) shares the goal of GPU tile programming but emphasizes ease of use over explicit control, indicating TileLang occupies a distinct niche prioritizing programmability with hardware awareness.

The taxonomy reveals TileLang sits adjacent to 'Intermediate Compilation Frameworks and Overlays,' which abstract hardware details through compiler infrastructures rather than exposing tile primitives. Neighboring branches include 'Coarse-Grained Reconfigurable Arrays' (3 papers) and 'Reconfigurable Processing Elements' (3 papers), focusing on physical architectures rather than programming abstractions. The 'Optimization Techniques' branch addresses scheduling and fusion strategies but assumes existing programming models. TileLang's explicit tile-level primitives distinguish it from compiler-only frameworks while its programmability separates it from hardware-centric designs, positioning it at the intersection of abstraction and control.

Among 20 candidates examined across three contributions, the 'Unified fused tile-level dataflow graph (FTG) representation' shows one refutable candidate from 10 examined, suggesting some overlap in graph-based modeling approaches. The 'Programmable tile-level abstractions' contribution examined 10 candidates with zero refutations, indicating potential novelty in the explicit primitive design. The 'Tile recommendation and inference framework' was not evaluated against prior work in this limited search. The modest search scope (20 papers, not exhaustive) means substantial related work may exist beyond top-K semantic matches, particularly in compiler optimization or dataflow modeling domains.

Based on the limited 20-candidate search, TileLang appears to contribute novel explicit tile primitives in a sparsely populated research direction, though the FTG representation shows some prior overlap. The analysis covers top semantic matches and immediate taxonomy neighbors but does not exhaustively survey compiler frameworks, GPU programming models, or dataflow graph literature. A broader search might reveal additional related work in tensor compiler design or hardware-software co-design that was not captured in this scope.

Taxonomy

Core-task Taxonomy Papers
26
3
Claimed Contributions
20
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: programmable tile-level system for fused neural kernels. The field centers on designing systems that partition neural computations into tiles—spatial or logical blocks—enabling efficient execution of fused operations on specialized hardware. The taxonomy reveals three main branches: Tile-Level Programming Abstractions and Compilation focuses on languages and compilers that expose tile-level parallelism to programmers, exemplified by explicit programming models like Triton Compiler[3] and TileLang Bridge[0]; Hardware Architecture Design for Tile-Based Acceleration explores reconfigurable and domain-specific architectures such as coarse-grained reconfigurable arrays (CGRAs) like Opal CGRA[1] and Onyx CGRA[4], as well as specialized accelerators like Venus Accelerator[8]; and Optimization Techniques for Tile-Based Systems addresses scheduling, memory management, and fusion strategies, including works on autonomous tiling (Autonomous Task Tiling[22]) and layer-centric fusion (Layer Centric Fusion[23]). These branches are tightly coupled: programming abstractions must map efficiently to hardware substrates, while optimization techniques bridge the gap between high-level code and low-level execution. A particularly active line of work involves explicit tile-level languages that give developers fine-grained control over data movement and compute scheduling, balancing productivity with performance. TileLang Bridge[0] sits squarely in this space, offering a programming interface that abstracts tile-level operations while remaining close to hardware semantics. It shares conceptual ground with Triton Compiler[3], which similarly targets GPU tile programming but emphasizes ease of use for kernel fusion. In contrast, hardware-centric approaches like Opal CGRA[1] and Reconfigurable Hardware Acceleration[5] prioritize architectural flexibility and energy efficiency, often requiring more specialized compilation flows. The tension between programmer-friendly abstractions and hardware-specific optimizations remains a central theme: some works lean toward domain-specific languages with rich semantics, while others favor lower-level primitives that expose more control. TileLang Bridge[0] appears to navigate this trade-off by providing a structured yet expressive tile abstraction, positioning itself as a bridge between high-level neural frameworks and tile-based execution models.

Claimed Contributions

Programmable tile-level abstractions for neural kernel development

The authors introduce a tile-level programming model that provides explicit primitives for memory placement, data movement, and parallel scheduling. Unlike existing compilers that rely on opaque optimization passes, TileLang gives developers direct control over hardware resources through user-visible intrinsics for buffer allocation, data transfer orchestration, custom memory layouts, and parallelism strategies.

10 retrieved papers
Unified fused tile-level dataflow graph (FTG) representation

The system represents tile-level programs as a unified FTG that captures dataflow and tiling structure, where nodes represent tile operators and edges encode data dependencies. This graph-based representation enables systematic analysis and transformation at tile granularity, supporting both tile recommendation and tile inference optimization techniques.

10 retrieved papers
Can Refute
Tile recommendation and inference framework

The authors develop a two-stage optimization workflow: tile recommendation analyzes the FTG to provide hardware-aware defaults for tile shapes, memory placement, and warp partitions, while tile inference propagates constraints through the graph to automatically complete remaining configurations including memory layouts, software pipelining, and tensorization. This design blends flexible user control with automated optimization.

0 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Programmable tile-level abstractions for neural kernel development

The authors introduce a tile-level programming model that provides explicit primitives for memory placement, data movement, and parallel scheduling. Unlike existing compilers that rely on opaque optimization passes, TileLang gives developers direct control over hardware resources through user-visible intrinsics for buffer allocation, data transfer orchestration, custom memory layouts, and parallelism strategies.

Contribution

Unified fused tile-level dataflow graph (FTG) representation

The system represents tile-level programs as a unified FTG that captures dataflow and tiling structure, where nodes represent tile operators and edges encode data dependencies. This graph-based representation enables systematic analysis and transformation at tile granularity, supporting both tile recommendation and tile inference optimization techniques.

Contribution

Tile recommendation and inference framework

The authors develop a two-stage optimization workflow: tile recommendation analyzes the FTG to provide hardware-aware defaults for tile shapes, memory placement, and warp partitions, while tile inference propagates constraints through the graph to automatically complete remaining configurations including memory layouts, software pipelining, and tensorization. This design blends flexible user control with automated optimization.

TileLang: Bridge Programmability and Performance in Modern Neural Kernels | Novelty Validation