README.md

June 10, 2026 ยท View on GitHub

Tile Lang Benchmark (Archived, please checkout TileOps)

Benchmark Summary

TileLang achieves exceptional performance across a variety of computational patterns. Below are selected results showcasing its capabilities:

  • Flash Attention Performance on H100

    operator performance on H100
  • Matmul Performance on GPUs (RTX 4090, A100, H100, MI300X)

    gemm fp16 performance on Gpus
  • Dequantize Matmul Performance on A100

    dequantize gemv performance on A100

Benchmark OP Set


Table 1: Matrix shapes in our benchmark

V0V1V2V3V4V5V6V7
m11111111
n16384430081433657344143369216368649216
k16384143361433614336573449216921636864
M0M1M2M3M4M5M6M7
m40964096409640968192819281928192
n1024819228672819210248192286728192
k8192819281922867281928192819228672

Table 2: FlashAttention shapes in our benchmark

FA0FA1FA2FA3FA4
batch11111
nheads3232323232
seq_len512512102410244096
head_dim128128128128128
causaltruefalsetruefalsetrue

Table 3: Linear Attention shapes in our benchmark

CC0CC1CC2CC3CC4CC5
batch111646464
nheads646464646464
seq_len102420488192102420488192
head_dim646464646464
d_state128128128128128128
CT0CT1CT2CT3CT4CT5
batch111646464
nheads646464646464
seq_len102420488192102420488192
head_dim646464646464
d_state128128128128128128