README.md

April 18, 2026 · View on GitHub

📚 LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners 🐑

Featured|HelloGitHub

📚 LeetCUDA: It includes Tensor/CUDA Cores, TF32/F16/BF16/F8, 📖200+ CUDA Kernels🔥 with PyTorch, 📖100+ LLM/CUDA🔥 blogs, 📖HGEMM⚡️ which can achieve 98%~100% TFLOPS of cuBLAS, and 📖flash-attn⚡️ using Tensor Cores with pure MMA PTX. ♥️ Please consider to leave a ⭐️ Star to support me, my bro ~ ♥️

🔥🔥 PR Welcome: Add Your Kernel to LeetCUDA! Let's make it Awesome together! 🎉🎉

©️Citations🎉🎉

@misc{LeetCUDA@2025,
  title={LeetCUDA: A Modern CUDA Learn Notes with PyTorch for Beginners},
  url={https://github.com/xlite-dev/LeetCUDA.git},
  note={Open-source software available at https://github.com/xlite-dev/LeetCUDA.git},
  author={DefTruth and Many Others},
  year={2025}
}

📖 News 🔥🔥

  • [2026/03] Cache-DiT 🎉v1.3.0 release is ready, the major updates including: Ring Attention w/ batched P2P, USP (Hybrid Ring and Ulysses), Hybrid 2D and 3D Parallelism (💥USP + TP), VAE-P Comm overhead reduce.

arch

  • [2026/04]: 🤖ffpa-attn is released! Yet another Faster Flash Prefill Attention with O(1)🎉SRAM complexity for large headdim, **1.8x3x↑**🎉 vs SDPA EA: 📈L20 ~1.9x↑🎉, 📈A30 ~1.8x↑🎉,📈4090 ~2.1x↑🎉. Currently, FFPA supports self-attention, cross-attention, grouped/multi-query attention, causal attention with large headdim (D=3201024). While the standard FlashAttention-2 only support headdim <= 256.
image
  • [2024/12]: ⚡️HGEMM is released! Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, achieve peak🎉 performance.

📖 Contents

📖 HGEMM Benchmark 🎉🎉

Currently, on NVIDIA L20, RTX 4090 and RTX 3080 Laptop, compared with cuBLAS's default Tensor Cores algorithm, the HGEMM (WMMA/MMA/CuTe) in this repo (blue🔵) can achieve 98%~100% of its (orange🟠) performance. Please check toy-hgemm library⚡️⚡️ or HGEMM⚡️⚡️ repo for more details.

toy-hgemm-library

📚Feature📚Feature📚Feature📚Feature
✔️CUDA/Tensor Cores✔️Loop over K✔️Tile Block(BMxBK)✔️Tile Threads(T 8x8)
✔️WMMA(m16n16k16)✔️MMA(m16n8k16)✔️Pack LDST(128 bits)✔️SMEM Padding
✔️Copy Async✔️Tile MMAs✔️Tile Warps✔️Multi Stages(2~4)
✔️Register Double Buffers✔️Block Swizzle✔️Warp Swizzle✔️SMEM Swizzle(CuTe/MMA)
✔️Collective Store(Shfl)✔️Layout NN✔️Layout TN✔️SGEMM FP32/TF32

📖 FA2-MMA Benchmark 🎉🎉

I have also implemented FlashAttention-2 using pure MMA PTX instructions, which supports features such as Multi-Stages, Tile MMA, Tile Warp, Shared KV SMEM, Fully Shared QKV SMEM, Prefetch Q s2r, Prefetch K/V g2s, QKV Fine-grained Tiling, Collective Store, etc. Please refer to flash-attn⚡️⚡️ for more details.

flash-attn-mma

📚Feature📚Feature📚Feature📚Feature
✔️Tensor Cores✔️Loop over N/D✔️Tile Block(Br, Bc)✔️MMA(m16n8k16)
✔️Pack LDST(128 bits)✔️SMEM Swizzle/Padding✔️Copy Async✔️Tile MMAs
✔️Tile Warps✔️Multi Stages(1/2)✔️Collective Store(Shfl)✔️Split KV/Q
✔️Shared QKV SMEM✔️Prefetch Q s2r✔️Prefetch KV g2s✔️QKV Fine-grained Tiling

Currently, for small-scale attention (B<=4, H <=48, SeqLen <= 8192, D <= 64) it can run faster than FA2/SDPA on some Devices. For example, on NVIDIA RTX 3080 Laptop, 📚 Split Q + Fully Shared QKV SMEM method can achieve 55 TFLOPS (D=64) that almost ~1.5x 🎉 faster than FA2. On NVIDIA L20, 🤖ffpa-attn method can achieve 104 TFLOPS (D=512) that almost ~1.8x 🎉 faster than SDPA (EFFICIENT ATTENTION). However, for large-scale attention, there remains a performance gap. Stay tuned for updates ~ (MMA Acc F16/F32, softmax Acc F32 vs FA2 MMA/softmax Acc F32, 👇Benchmark)

Algorithm(B,H,N,D)RTX 3080 LaptopL20RTX 4090
FlashAttention-2(1,8,8192,64)37 TFLOPS100 TFLOPS145 TFLOPS
share-qkv+stage2(1,8,8192,64)55 TFLOPS99 TFLOPS221 TFLOPS
FlashAttention-2(1,48,8192,64)37 TFLOPS109 TFLOPS163 TFLOPS
share-qkv+stage2(1,48,8192,64)48 TFLOPS107 TFLOPS224 TFLOPS
SDPA(EFFICIENT ATTENTION)(1,48,8192,512)16 TFLOPS58 TFLOPS85 TFLOPS
🤖ffpa-attn(1,48,8192,512)39 TFLOPS104 TFLOPS200 TFLOPS
Precision Errors vs FA2/SDPA/max: < ~1e-3min: ~0.0mean: < ~1e-5

The Split KV and Split Q implementations have been carried out in flash-attn⚡️⚡️ for performance comparison. The Split KV method, which involves splitting all QKV across MMA (Warps), is slower than Split Q method, which splitting Q across MMA(Warps) and keep access KV for all MMA(Warps).

  • 📚 Split KV (Basic, FlashAttention-1)
// Split QKV across MMA(Warps) using naive matmul MMA&Warp tiling policy.
// case: The layout of 8 MMA(2x4)  [after] kWarpTileSeqLenQxkWarpTileSeqLenK(2x2) -> 32x2,32x2=64x64:
// |  [64,64]  |    warp_KV 0    |    warp_KV 1    |    warp_KV 2    |    warp_KV 3    |
// | warp_QP 0 |-- MMA 0,MMA 0 --|-- MMA 2,MMA 2 --|-- MMA 4,MMA 4 --|-- MMA 6,MMA 6 --|
// | warp_QP 0 |-- MMA 0,MMA 0 --|-- MMA 2,MMA 2 --|-- MMA 4,MMA 4 --|-- MMA 6,MMA 6 --|
// | warp_QP 1 |-- MMA 1,MMA 1 --|-- MMA 3,MMA 2 --|-- MMA 5,MMA 5 --|-- MMA 7,MMA 7 --|
// | warp_QP 1 |-- MMA 1,MMA 1 --|-- MMA 3,MMA 2 --|-- MMA 5,MMA 5 --|-- MMA 7,MMA 7 --|
__global__ void // Q, K, V, O -> [B, H, N, D]
flash_attn_mma_stages_split_kv_kernel(half* Q, half* K, half* V, half* O, ...);
  • 📚 Split Q (Faster, FlashAttention-2)
// Split Q across MMA(Warps) and keep access KV for all MMA(Warps),
// in order to reduce the comm between warps via smem and warp shuffle.
// case: MMA = m16n8k16, Br=16x4=64, Bc=8x8=64, layout: 4 warps
// |   64x64   |      warp_KV 0       |
// | warp_QP 0 | MMA 0 ... MMA 0 (x8) |
// | warp_QP 1 | MMA 1 ... MMA 1 (x8) |
// | warp_QP 2 | MMA 2 ... MMA 2 (x8) |
// | warp_QP 3 | MMA 3 ... MMA 3 (x8) |
__global__ void // Q, K, V, O -> [B, H, N, D]
flash_attn_mma_stages_split_q_kernel(half* Q, half* K, half* V, half* O, ...);
  • 📚 Split Q + Shared KV SMEM (1/2 SRAM vs FA2)
// K, V shared the same shared memory, improve block occupancy.
__global__ void // Q, K, V, O -> [B, H, N, D]
flash_attn_mma_stages_split_q_shared_kv_kernel(half* Q, half* K, half* V, half* O, ...);
  • 📚 Split Q + Fully Shared QKV SMEM (1/4 SRAM vs FA2)
// Q, K, V fully shared the same shared memory and prefetch Q s2r, improve block occupancy
// and reduce Q SMEM IO-Access.
__global__ void // Q, K, V, O -> [B, H, N, D]
flash_attn_mma_stages_split_q_shared_qkv_kernel(half* Q, half* K, half* V, half* O, ...);
  • 📚 Split Q + QK Fine-grained Tiling (O(16xd) SRAM vs FA2 O(4xBrxd) SRAM, Headdim -> 1024)
// Fine-grained tiling at the MMA level for Q@K^T results in a constant SRAM usage of
// 64 * kMmaAtomK for Q and K. For V, the SRAM complexity is O(kMmaAtomK * d), leading to
// an overall SRAM complexity of O(kMmaAtomK * d). Consequently, this approach allows us to
// extend D (head dimension) up to 1024.
__global__ void // Q, K, V, O -> [B, H, N, D]
flash_attn_mma_stages_split_q_tiling_qk_kernel(half* Q, half* K, half* V, half* O, ...);
  • 📚 Split Q + Fully QKV Fine-grained Tiling (O(2xBrx16)~O(1) SRAM vs FA2 O(4xBrxd) SRAM)
// Fine-grained tiling at the MMA level for all Q@K^T and P@V results in a constant SRAM usage of
// Br * 16 or Bc * 16 for Q, K, V, leading to an overall SRAM complexity of O(Br * 16). Consequently,
// this approach allows us to run faster than SDPA w or w/o MMA Acc F32.
__global__ void // Q, K, V, O -> [B, H, N, D]
flash_attn_mma_stages_split_q_tiling_qkv_kernel(half* Q, half* K, half* V, half* O, ...);

💡NOTE: 📚Split Q + Fully QKV Fine-grained Tiling has been refactored into 🤖ffpa-attn.

📖 200+ CUDA Kernels 🔥🔥 (Easy -> Hard++) (©️back👆🏻)

The kernels listed here will guide you through a step-by-step progression, ranging from easy to very challenging topics. The workflow for each topic will be as follows: custom CUDA kernel implementation -> PyTorch Python bindings -> Run tests. 👉TIPS: * = Tensor Cores (WMMA, MMA, CuTe), otherwise, CUDA Cores; / = not supported; ✔️ = supported; = TODO. Contents are listed as follows:

📚 Easy and 📚 Medium sections cover operations such as element-wise, mat_trans, warp/block reduce, nms, relu, gelu, swish, layer-norm, rms-norm, online-softmax, dot-prod, embedding and basic usage for FP32, FP16, BF16 and FP8 . 📚 Hard, 📚 Hard+ and 📚 Hard++ sections delve deeper into advanced topics, primarily focusing on operations like sgemv, sgemm, hgemv, hgemm and flash-attention. These sections also provide numerous kernels implemented using Tensor Cores with pure MMA PTX.

📚 Easy ⭐️ & Medium ⭐️⭐️ (©️back👆🏻)

📖 CUDA Kernel📖 Elem DType📖 Acc DType📖 Docs📖 Level
✔️ elementwise_f32f32/link⭐️
✔️ elementwise_f32x4f32/link⭐️
✔️ elementwise_f16f16/link⭐️
✔️ elementwise_f16x2f16/link⭐️
✔️ elementwise_f16x8f16/link⭐️
✔️ elementwise_f16x8_packf16/link⭐️⭐️
✔️ histogram_i32i32/link⭐️
✔️ histogram_i32x4i32/link⭐️
✔️ sigmoid_f32f32/link⭐️
✔️ sigmoid_f32x4f32/link⭐️
✔️ sigmoid_f1616/link⭐️
✔️ sigmoid_f16x2f16/link⭐️
✔️ sigmoid_f16x8f16/link⭐️
✔️ sigmoid_f16x8_packf16/link⭐️⭐️
✔️ relu_f32f32/link⭐️
✔️ relu_f32x4f32/link⭐️
✔️ relu_f16f16/link⭐️
✔️ relu_f16x2f16/link⭐️
✔️ relu_f16x8f16/link⭐️
✔️ relu_f16x8_packf16/link⭐️⭐️
✔️ elu_f32f32/link⭐️
✔️ elu_f32x4f32/link⭐️
✔️ elu_f16f16/link⭐️
✔️ elu_f16x2f16/link⭐️
✔️ elu_f16x8f16/link⭐️
✔️ elu_f16x8_packf16/link⭐️⭐️
✔️ gelu_f32f32/link⭐️
✔️ gelu_f32x4f32/link⭐️
✔️ gelu_f16f16/link⭐️
✔️ gelu_f16x2f16/link⭐️
✔️ gelu_f16x8f16/link⭐️
✔️ gelu_f16x8_packf16/link⭐️⭐️
✔️ swish_f32f32/link⭐️
✔️ swish_f32x4f32/link⭐️
✔️ swish_f16f16/link⭐️
✔️ swish_f16x2f16/link⭐️
✔️ swish_f16x8f16/link⭐️
✔️ swish_f16x8_packf16/link⭐️⭐️
✔️ hardswish_f32f32/link⭐️
✔️ hardswish_f32x4f32/link⭐️
✔️ hardswish_f16f16/link⭐️
✔️ hardswish_f16x2f16/link⭐️
✔️ hardswish_f16x8f16/link⭐️
✔️ hardswish_f16x8_packf16/link⭐️⭐️
✔️ hardshrink_f32f32/link⭐️
✔️ hardshrink_f32x4f32/link⭐️
✔️ hardshrink_f16f16/link⭐️
✔️ hardshrink_f16x2f16/link⭐️
✔️ hardshrink_f16x8f16/link⭐️
✔️ hardshrink_f16x8_packf16/link⭐️⭐️
✔️ embedding_f32f32/link⭐️
✔️ embedding_f32x4f32/link⭐️
✔️ embedding_f32x4_packf32/link⭐️
✔️ embedding_f16f16/link⭐️
✔️ embedding_f16x2f16/link⭐️
✔️ embedding_f16x8f16/link⭐️
✔️ embedding_f16x8_packf16/link⭐️⭐️
✔️ mat_trans_f32_col2row{2d}f32/link⭐️
✔️ mat_trans_f32_row2col{2d}f32/link⭐️
✔️ mat_trans_f32_diagonal2df32/link⭐️⭐️
✔️ mat_trans_f32x4_col2row{2d}f32/link⭐️⭐️
✔️ mat_trans_f32x4_row2col{2d}f32/link⭐️⭐️
✔️ mat_trans_cutef32/link⭐️⭐️
✔️ warp_reduce_{all}allalllink⭐️⭐️
✔️ block_all_reduce_f32_f32f32f32link⭐️⭐️
✔️ block_all_reduce_f32x4_f32f32f32link⭐️⭐️
✔️ block_all_reduce_f16_f16f16f16link⭐️⭐️
✔️ block_all_reduce_f16_f32f16f32link⭐️⭐️
✔️ block_all_reduce_f16x2_f16f16f16link⭐️⭐️
✔️ block_all_reduce_f16x2_f32f16f32link⭐️⭐️
✔️ block_all_reduce_f16x8_pack_f16f16f16link⭐️⭐️
✔️ block_all_reduce_f16x8_pack_f32f16f32link⭐️⭐️
✔️ block_all_reduce_bf16_bf16bf16bf16link⭐️⭐️
✔️ block_all_reduce_bf16_f32bf16f32link⭐️⭐️
✔️ block_all_reduce_bf16x2_bf16bf16bf16link⭐️⭐️
✔️ block_all_reduce_bf16x2_f32bf16f32link⭐️⭐️
✔️ block_all_reduce_bf16x8_pack_bf16bf16bf16link⭐️⭐️
✔️ block_all_reduce_bf16x8_pack_f32bf16f32link⭐️⭐️
✔️ block_all_reduce_fp8_e4m3_f16fp8_e4m3f16link⭐️⭐️⭐️
✔️ block_all_reduce_fp8_e5m2_f16fp8_e5m2f16link⭐️⭐️⭐️
✔️ block_all_reduce_fp8_e4m3x16_pack_f16fp8_e4m3f16link⭐️⭐️⭐️
✔️ block_all_reduce_fp8_e5m2x16_pack_f16fp8_e5m2f16link⭐️⭐️⭐️
✔️ block_all_reduce_i8_i32i8i32link⭐️⭐️
✔️ block_all_reduce_i8x16_pack_i32i8i32link⭐️⭐️
✔️ dot_product_f32f32f32link⭐️⭐️
✔️ dot_product_f32x4f32f32link⭐️⭐️
✔️ dot_product_f16_f32f16f32link⭐️⭐️
✔️ dot_product_f16x2_f32f16f32link⭐️⭐️
✔️ dot_product_f16x8_pack_f32f16f32link⭐️⭐️
✔️ softmax_f32_per_tokf32f32link⭐️⭐️
✔️ softmax_f32x4_per_tokf32f32link⭐️⭐️
✔️ safe_softmax_f32_per_tokf32f32link⭐️⭐️
✔️ safe_softmax_f32x4_per_tokf32f32link⭐️⭐️
✔️ safe_softmax_f16_f32_per_tokf16f32link⭐️⭐️
✔️ safe_softmax_f16x2_f32_per_tokf16f32link⭐️⭐️
✔️ safe_softmax_f16x8_pack_f32_per_tokf16f32link⭐️⭐️
✔️ online_safe_softmax_f32_per_tokenf32f32link⭐️⭐️
✔️ online_safe_softmax_f32x4_pack_per_tokf32f32link⭐️⭐️
✔️ rope_f32f32f32link⭐️⭐️
✔️ rope_f32x4_packf32f32link⭐️⭐️
✔️ layer_norm_f32f32f32link⭐️⭐️
✔️ layer_norm_f32x4f32f32link⭐️⭐️
✔️ layer_norm_f16_f16f16f16link⭐️⭐️
✔️ layer_norm_f16x2_f16f16f16link⭐️⭐️
✔️ layer_norm_f16x8_f16f16f16link⭐️⭐️
✔️ layer_norm_f16x8_pack_f16f16f16link⭐️⭐️
✔️ layer_norm_f16x8_pack_f32f16f32link⭐️⭐️
✔️ layer_norm_f16_f32f16f32link⭐️⭐️
✔️ rms_norm_f32f32f32link⭐️⭐️
✔️ rms_norm_f32x4f32f32link⭐️⭐️
✔️ rms_norm_f16_f16f16f16link⭐️⭐️
✔️ rms_norm_f16x2_f16f16f16link⭐️⭐️
✔️ rms_norm_f16x8_f16f16f16link⭐️⭐️
✔️ rms_norm_f16x8_f32f16f32link⭐️⭐️
✔️ rms_norm_f16x8_pack_f16f16f16link⭐️⭐️
✔️ rms_norm_f16x8_pack_f32f16f32link⭐️⭐️
✔️ rms_norm_f16_f32f16f32link⭐️⭐️
✔️ nms_f32f32/link⭐️⭐️
✔️ merge_attn_statesf16/bf16/f32f32link⭐️⭐️
✔️ notes v1(deprecated)f32f32/⭐️⭐️
✔️ How to use nsys/ncu(timeline/ptx/sass)//link⭐️⭐️

📚 Hard ⭐⭐⭐️ (©️back👆🏻)

📖 CUDA Kernel📖 Elem DType📖 Acc DType📖 Docs📖 Level
✔️ sgemv_k32_f32f32f32link⭐️⭐️⭐️
✔️ sgemv_k128_f32x4f32f32link⭐️⭐️⭐️
✔️ sgemv_k16_f32f32f32link⭐️⭐️⭐️
✔️ hgemv_k32_f16f16f16link⭐️⭐️⭐️
✔️ hgemv_k128_f16x4f16f16link⭐️⭐️⭐️
✔️ hgemv_k16_f16f16f16link⭐️⭐️⭐️
✔️ sgemm_naive_f32f32f32link⭐️⭐️
✔️ sgemm_sliced_k_f32f32f32link⭐️⭐️⭐️
✔️ sgemm_t_8x8_sliced_k_f32x4f32f32link⭐️⭐️⭐️
✔️ sgemm_t_8x8_sliced_k...bcff32f32link⭐️⭐️⭐️
✔️ sgemm_t_8x8_sliced_k...dbuff32f32link⭐️⭐️⭐️
✔️ sgemm_t_8x8_sliced_k16...dbuff32f32link⭐️⭐️⭐️
✔️ sgemm_t_8x8_sliced_k16...asyncf32f32link⭐️⭐️⭐️
✔️ sgemm_wmma_m16n16k8...stages*tf32f32link⭐️⭐️⭐️
✔️ sgemm_wmma_m16n16k8...swizzle*tf32f32link⭐️⭐️⭐️
✔️ hgemm_naive_f16f16f16link⭐️⭐️
✔️ hgemm_sliced_k_f16f16f16link⭐️⭐️⭐️
✔️ hgemm_t_8x8_sliced_k_f16x4f16f16link⭐️⭐️⭐️
✔️ hgemm_t_8x8_sliced_k_f16x4_packf16f16link⭐️⭐️⭐️
✔️ hgemm_t_8x8_sliced_k_f16x8_packf16f16link⭐️⭐️⭐️
✔️ hgemm_t_8x8_sliced_k...dbuff16f16link⭐️⭐️⭐️
✔️ hgemm_t_8/16x8...k16/32...dbuff16f16link⭐️⭐️⭐️
✔️ hgemm_t_8/16x8...k16/32...asyncf16f16link⭐️⭐️⭐️
✔️ hgemm_wmma_m16n16k16...naive*f16f16link⭐️⭐️⭐️
✔️ hgemm_wmma_m16n16k16...mma4x2*f16f16link⭐️⭐️⭐️
✔️ hgemm_wmma_m16n16k16...mma4x4*f16f16link⭐️⭐️⭐️
✔️ hgemm_wmma_m16n16k16...dbuf*f16f16link⭐️⭐️⭐️
✔️ hgemm_wmma_m32n8k16....dbuf*f16f16link⭐️⭐️⭐️
✔️ hgemm_wmma_m16n16k16...stages*f16f16link⭐️⭐️⭐️
✔️ hgemm_wmma_m16n16k16...swizzle*f16f16link⭐️⭐️⭐️
✔️ hgemm_mma_m16n8k16...naive*f16f16link⭐️⭐️⭐️
✔️ hgemm_mma_m16n8k16...mma2x4*f16f16link⭐️⭐️⭐️
✔️ hgemm_mma_m16n8k16...stages*f16f16link⭐️⭐️⭐️
✔️ hgemm_mma_m16n8k16...swizzle*f16f16link⭐️⭐️⭐️
✔️ hgemm_mma_m16n8k16...swizzle{smem}*f16f16link⭐️⭐️⭐️
✔️ hgemm_mma_m16n8k16...swizzle{tn}{smem}*f16f16link⭐️⭐️⭐️
✔️ hgemm_mma_stages_swizzle{smem}...cute*f16f16link⭐️⭐️⭐️
✔️ hgemm_mma_cublas*f16f16link⭐️⭐️
✔️ hgemm_wgmma_m64n128k16...tma{ws}{tn}*f16f16link⭐️⭐️⭐️
✔️ hgemm_wgmma_m64n128k16_fp32...tma*f16f32link⭐️⭐️⭐️

📚 Hard+ ⭐️⭐️⭐️⭐️ & Hard++ ⭐️⭐️⭐️⭐️⭐️ (©️back👆🏻)

  • 📚 FlashAttention-2 MMA (MMA Acc F32/F16, swizzle, QKV smem share, fine-grained tiling, etc.🎉)
📖 CUDA Kernel📖 Elem DType📖 Acc DType📖 Docs📖 Level
✔️ flash_attn_cute(naive)f16f32link⭐️⭐️⭐️
✔️ How to implement MMA smem swizzle*f16f16link⭐️⭐️⭐️
✔️ flash_attn_mma_stages_split_kv*f16f16link⭐️⭐️⭐️⭐️
✔️ flash_attn_mma_stages_split_q*f16f16link⭐️⭐️⭐️⭐️
✔️ flash_attn_mma_stages...shared_kv*f16f16link⭐️⭐️⭐️⭐️
✔️ flash_attn_mma_stages...shared_qkv*f16f16link⭐️⭐️⭐️⭐️
✔️ flash_attn_mma_stages...tiling_qk*f16f16link⭐️⭐️⭐️⭐️
✔️ flash_attn_mma_stages...tiling_qkv*f16f16link⭐️⭐️⭐️⭐️
✔️ flash_attn_mma_stages...shared_kv{f32}*f16f32link⭐️⭐️⭐️⭐️
✔️ flash_attn_mma_stages...shared_qkv{f32}*f16f32link⭐️⭐️⭐️⭐️
✔️ flash_attn_mma_stages...tiling_qk{f32}*f16f32link⭐️⭐️⭐️⭐️
✔️ flash_attn_mma_stages...tiling_qkv{f32}*f16f32link⭐️⭐️⭐️⭐️
✔️ flash_attn_mma...shared_kv{f32}{rr}*f16f32link⭐️⭐️⭐️⭐️
✔️ flash_attn_mma...shared_qkv{f32}{rr}*f16f32link⭐️⭐️⭐️⭐️
✔️ flash_attn_mma...shared_kv_swizzle{q}*f16f16link⭐️⭐️⭐️⭐️
✔️ flash_attn_mma...shared_kv_swizzle{qk}*f16f16link⭐️⭐️⭐️⭐️
✔️ flash_attn_mma...shared_kv_swizzle{qkv}*f16f16link⭐️⭐️⭐️⭐️
✔️ flash_attn_mma...shared_qkv_swizzle{q}*f16f16link⭐️⭐️⭐️⭐️
✔️ flash_attn_mma...shared_qkv_swizzle{qk}*f16f16link⭐️⭐️⭐️⭐️
✔️ flash_attn_mma...shared_qkv_swizzle{qkv}*f16f16link⭐️⭐️⭐️⭐️
✔️ flash_attn_mma...tiling_qk_swizzle{q}*f16f16link⭐️⭐️⭐️⭐️
✔️ flash_attn_mma...tiling_qk_swizzle{qk}*f16f16link⭐️⭐️⭐️⭐️
✔️ flash_attn_mma...tiling_qk_swizzle{qkv}*f16f16link⭐️⭐️⭐️⭐️
✔️ flash_attn_mma...tiling_qkv_swizzle{q}*f16f16link⭐️⭐️⭐️⭐️
✔️ flash_attn_mma...tiling_qkv_swizzle{qk}*f16f16link⭐️⭐️⭐️⭐️
✔️ flash_attn_mma...tiling_qkv_swizzle{qkv}*f16f16link⭐️⭐️⭐️⭐️
✔️ flash_attn...tiling_qkv_swizzle{q}{f32}*f16f32link⭐️⭐️⭐️⭐️
✔️ flash_attn...tiling_qkv_swizzle{qk}{f32}*f16f32link⭐️⭐️⭐️⭐️
✔️ flash_attn...tiling_qkv_swizzle{qkv}{f32}*f16f32link⭐️⭐️⭐️⭐️

💡NOTE: rr: means reduce registers usage (for d>128); f32: means MMA accumulate with FP32 dtype, otherwise, FP16. softmax Acc dtype is always be FP32 for high precision; swizzle: now, only support smem swizzle for MMA.

  • 📚 FFPA Attention MMA (1.8x~3x🎉faster vs SDPA EA, D > 256, FA2 not supported)
📖 CUDA Kernel📖 Elem DType📖 Acc DType📖 Docs📖 Level
✔️ ffpa_mma_stages_split_q_L1_F16F16F16f16f16link⭐️⭐️⭐️⭐️
✔️ ffpa_mma_stages_split_q_L1_F16F16F32f16f32link⭐️⭐️⭐️⭐️
✔️ ffpa_mma_stages_split_q_L1_mixed_accf16QK f32, PV f16link⭐️⭐️⭐️⭐️
⚠️ ffpa_mma_stages_split_q_L2_F16F16F16f16f16link⭐️⭐️⭐️⭐️
⚠️ ffpa_mma_stages_split_q_L2_F16F16F32f16f32link⭐️⭐️⭐️⭐️
⚠️ ffpa_mma_stages_split_q_L2_mixed_accf16QK f32, PV f16link⭐️⭐️⭐️⭐️
⚠️ ffpa_mma_stages_split_q_L3_F16F16F16f16f16link⭐️⭐️⭐️⭐️
⚠️ ffpa_mma_stages_split_q_L3_F16F16F32f16f32link⭐️⭐️⭐️⭐️
⚠️ ffpa_mma_stages_split_q_L3_mixed_accf16QK f32, PV f16link⭐️⭐️⭐️⭐️

💡NOTE: 🤖ffpa-attn: 📚FFPA - Yet another Faster Flash Prefill Attention with O(1)🎉SRAM complexity for headdim > 256, 1.8x~3x🎉faster than SDPA EA: 📈L20 ~1.9x↑🎉, 📈 A30 ~1.8x↑🎉, 📈3080 ~2.9x↑🎉, 📈4090 ~2.1x↑🎉.

📚 Triton Kernel (OpenAI Triton) ⭐️⭐️⭐️ (©️back👆🏻)

📖 Triton Kernel📖 Elem DType📖 Acc DType📖 Docs📖 Level
✔️ triton_vector_add_kernelallalllink⭐️⭐️
✔️ triton_fused_softmax(multi-stages)f16/bf16/f32f32link⭐️⭐️⭐️
✔️ triton_fused_layer_norm(forward-pass)f16/bf16/f32f32link⭐️⭐️⭐️
✔️ triton_fused_layer_norm(backward-pass)f16/bf16/f32f32link⭐️⭐️⭐️
✔️ triton_merge_attn_states_kernel(w/ CUDA)f16/bf16/f32f32link⭐️⭐️⭐️

📚 CUTLASS/CuTe Kernel ⭐️⭐️⭐️ (©️back👆🏻)

📖 CUTLASS/CuTe Kernel📖 Elem DType📖 Acc DType📖 Docs📖 Level
✔️ mat_transpose_cutef32/link⭐️⭐️
✔️ flash_attn_cute(naive)f16f32link⭐️⭐️⭐️
✔️ hgemv_f16_cute_kernelf16f16link⭐️⭐️⭐️
✔️ hgemv_f16x8_cute_kernelf16f16link⭐️⭐️⭐️
✔️ hgemv_tensor_core_cute_kernelf16f16link⭐️⭐️⭐️
✔️ hgemm_mma_stages_swizzle{smem}...cute*f16f16link⭐️⭐️⭐️
✔️ ws_hgemm_naive_cute_kernelf16f16link⭐️⭐️⭐️

📖 100+ 高性能计算与分布式-技术博客

📚 高性能计算与分布式-个人技术专栏 (©️back👆🏻)

📖 类型-标题📖 作者📖 推荐
[Diffusion推理]📖简短的2025年总结,写在Cache-DiT v1.2.1之际@DefTruth⭐️⭐️
[Diffusion推理]📖CacheDiT支持Z-Image分布式推理和缓存加速​​@DefTruth⭐️⭐️
[Diffusion推理]📖cache-dit支持FLUX.2分布式推理和Cache@DefTruth⭐️⭐️
[Diffusion推理]📖Cache加速-FoCa公式理解记录@DefTruth⭐️⭐️⭐
[Diffusion推理]📖cache-dit: BlockAdapter支持HunyuanImage-2.1 Cache加速!@DefTruth⭐️⭐️⭐
[Diffusion推理]📖cache-dit + Qwen-Image-Lightning 实现 3.5 steps 推理!@DefTruth⭐️⭐️⭐
[Diffusion推理]📖cache-dit: Wan2.2-MoE 2.4x 推理加速!@DefTruth⭐️⭐️⭐
[Diffusion推理]📖cache-dit: Qwen-Image-Edit 2x 无损加速!@DefTruth⭐️⭐️⭐
[Diffusion推理]📖cache-dit: Qwen-Image 1.5x 无损加速!@DefTruth⭐️⭐️⭐
[Diffusion推理]📖Cache加速-TaylorSeer算法简析@DefTruth⭐️⭐️⭐
[Diffusion推理]📖DiT推理加速综述: Caching@DefTruth⭐️⭐️⭐
[Triton编程][基础]📖Triton极简入门: Triton Vector Add@DefTruth⭐️⭐️⭐
[Triton编程][基础]📖Triton Fused Softmax Kernel详解: 从Python源码到PTX@DefTruth⭐️⭐️⭐
[Triton编程][基础]📖vLLM Triton Merge Attention States Kernel详解@DefTruth⭐️⭐️⭐
[Triton编程][进阶]📖vLLM Prefix Prefill Triton Kernel图解@DefTruth⭐️⭐️⭐️
[张量/序列并行]📖序列并行: BPT、Ring-Attention及Striped-Attention笔记@DefTruth⭐️⭐️⭐
[vLLM实践][算子]📖vLLM算子开发流程:”保姆级“详细记录@DefTruth⭐️⭐️⭐
[vLLM实践][万字]📖vLLM + DeepSeek-R1 671B 多机部署及修Bug笔记@DefTruth⭐️⭐️⭐
[Attention优化]📖FFPA(Split-D): FA2无限HeadDim扩展,2x↑🎉 vs SDPA EA@DefTruth⭐️⭐️⭐️
[CUDA基础][开篇]📖LeetCUDA: v3.0 大升级-面试刷题不迷路@DefTruth⭐️⭐️⭐⭐️
[分布式训推][张量/序列并行]📖图解DeepSpeed-Ulysses&Megatron-LM TP/SP@DefTruth⭐️⭐️
[VLM推理优化][InternVL系列]📖InternLM2/.../InternVL1.5系列笔记: 核心点解析@DefTruth⭐️⭐️
[LLM推理优化][TensorRT-LLM][5w字]📖TensorRT-LLM部署调优-指北@DefTruth⭐️⭐️⭐️
[LLM推理优化][KV Cache优化]📖GQA/YOCO/CLA/MLKV: 层内和层间KV Cache共享@DefTruth⭐️⭐️
[LLM推理优化][Prefill优化][万字]📖图解vLLM Automatic Prefix Caching: TTFT优化@DefTruth⭐️⭐️⭐️
[LLM推理优化][Attention优化]📖图解:从Online-Softmax到FlashAttention V1/V2/V3@DefTruth⭐️⭐️⭐️
[LLM推理优化][Decoding优化]📖原理&图解FlashDecoding/FlashDecoding++@DefTruth⭐️⭐️
[VLM推理优化][LLaVA系列]📖CLIP/LLaVA/LLaVA1.5/VILA笔记: 核心点解析@DefTruth⭐️⭐️
[LLM推理优化][Attention优化][万字]📖TensorRT MHA/Myelin vs FlashAttention-2@DefTruth⭐️⭐️⭐️
[LLM推理优化][PTX汇编]📖CUDA 12 PTX汇编: PRMT指令详解-通用模式@DefTruth⭐️
[LLM推理优化][PTX汇编]📖CUDA 12 PTX汇编: LOP3指令详解@DefTruth⭐️
[LLM推理优化][CUDA][3w字]📖高频面试题汇总-大模型手撕CUDA@DefTruth⭐️⭐️⭐️
[LLM推理优化][Weight Only]📖WINT8/4-(00): 通俗易懂讲解-快速反量化算法@DefTruth⭐️⭐️
[LLM推理优化][Weight Only]📖WINT8/4-(01): PRMT指令详解及FT源码解析@DefTruth⭐️⭐️
[LLM推理优化][Weight Only]📖WINT8/4-(02): 快速反量化之INT8转BF16@DefTruth⭐️⭐️
[LLM推理优化][Weight Only]📖WINT8/4-(03): LOP3指令详解及INT4转FP16/BF16@DefTruth⭐️⭐️
[LLM推理优化][LLM Infra整理]📖100+篇: 大模型推理各方向新发展整理@DefTruth⭐️⭐️
[LLM推理优化][LLM Infra整理]📖30+篇: LLM推理论文集-500页PDF@DefTruth⭐️⭐️
[LLM推理优化][LLM Infra整理]📖FlashDecoding++: 比FlashDecoding还要快!@DefTruth⭐️
[LLM推理优化][LLM Infra整理]📖TensorRT-LLM开源,TensorRT 9.1也来了@DefTruth⭐️
[LLM推理优化][LLM Infra整理]📖20+篇: LLM推理论文集-300页PDF@DefTruth⭐️⭐️
[LLM推理优化][LLM Infra整理]📖PagedAttention论文新鲜出炉@DefTruth⭐️
[推理部署][CV/NLP]📖FastDeploy三行代码搞定150+ CV、NLP模型部署@DefTruth⭐️
[推理部署][CV]📖如何在lite.ai.toolkit(3.6k+ stars)中增加您的模型?@DefTruth⭐️⭐️
[推理部署][CV]📖美团 YOLOv6 ORT/MNN/TNN/NCNN C++推理部署@DefTruth⭐️⭐️
[推理部署][ONNX]📖ONNX推理加速技术文档-杂记@DefTruth⭐️
[推理部署][TensorFlow]📖Mac源码编译TensorFlow C++指北@DefTruth⭐️
[推理部署][CV]📖1Mb!头部姿态估计: FSANet,一个小而美的模型(C++)@DefTruth⭐️
[推理部署][CV]📖opencv+ffmpeg编译打包全解指南@DefTruth⭐️⭐️
[推理部署][CV]📖RobustVideoMatting视频抠图静态ONNX模型转换@DefTruth⭐️
[推理部署][CV]📖190Kb!SSRNet年龄检测详细解读(含C++工程)@DefTruth⭐️
[推理部署][CV]📖MGMatting(CVPR2021)人像抠图C++应用记录@DefTruth⭐️
[推理部署][CV]📖超准确人脸检测(带关键点)YOLO5Face C++工程详细记录@DefTruth⭐️⭐️
[推理部署][ORT]📖解决: ONNXRuntime(Python) GPU 部署配置记录@DefTruth⭐️
[推理部署][CV]📖记录SCRFD(CVPR2021)人脸检测C++工程化(含docker镜像)@DefTruth⭐️⭐️
[推理部署][NCNN]📖野路子:记录一个解决onnx转ncnn时op不支持的trick@DefTruth⭐️
[推理部署][CV]📖升级版NanoDet-Plus MNN/TNN/NCNN/ORT C++工程记录@DefTruth⭐️⭐️
[推理部署][CV]📖超轻量级NanoDet MNN/TNN/NCNN/ORT C++工程记录@DefTruth⭐️
[推理部署][CV]📖详细记录MGMatting之MNN、TNN和ORT C++移植@DefTruth⭐️⭐️
[推理部署][CV]📖YOLOX NCNN/MNN/TNN/ONNXRuntime C++工程简记@DefTruth⭐️
[推理部署][TNN]📖手动修改YoloX的tnnproto记录-TNN@DefTruth⭐️
[推理部署][ORT]📖全网最详细 ONNXRuntime C++/Java/Python 资料!@DefTruth⭐️
[推理部署][CV]📖RobustVideoMatting: C++工程化记录-实现篇@DefTruth⭐️⭐️
[推理部署][CV]📖RobustVideoMatting: C++工程化记录-应用篇@DefTruth⭐️⭐️
[推理部署][ORT]📖ONNXRuntime C++ CMake 工程分析及编译@DefTruth⭐️⭐️
[推理部署][ORT]📖如何使用ORT C++ API处理NCHW和NHWC输入?@DefTruth⭐️
[推理部署][TNN]📖tnn-convert搭建简记-YOLOP转TNN@DefTruth⭐️
[推理部署][CV]📖YOLOP ONNXRuntime C++工程化记录@DefTruth⭐️⭐️
[推理部署][NCNN]📖超有用NCNN参考资料整理@DefTruth⭐️
[推理部署][MNN]📖超有用MNN参考资料整理@DefTruth⭐️
[推理部署][TNN]📖超有用TNN参考资料整理@DefTruth⭐️
[推理部署][ONNX]📖超有用ONNX参考资料整理@DefTruth⭐️
[推理部署][ONNX]📖超有用ONNX模型结构参考资料整理@DefTruth⭐️
[推理部署][OpenCV-DNN]📖超有用OpenCV-DNN参考资料整理@DefTruth⭐️
[推理部署][Tensorflow]📖超有用Tensorflow C++工程化知识点@DefTruth⭐️
[推理部署][模型转换]📖深度学习模型转换资料整理@DefTruth⭐️
[技术随笔][C++][CMake]📖超有用CMake参考资料整理@DefTruth⭐️⭐️
[技术随笔][C++][3W字]📖静态链接和静态库实践指北-原理篇@DefTruth⭐️⭐️⭐️
[技术随笔][C++]📖Mac下C++内存检查指北(Valgrind VS Asan)@DefTruth⭐️
[技术随笔][CV]📖torchlm: 人脸关键点检测库@DefTruth⭐️⭐️
[技术随笔][ML]📖《统计学习方法-李航: 笔记-从原理到实现-基于R》@DefTruth⭐️⭐️
[技术随笔][Git]📖如何优雅地git clone和git submodule?@DefTruth⭐️
[技术随笔][3D]📖人脸重建3D参考资料整理@DefTruth⭐️
[技术随笔][3D]📖BlendShapes参考资料整理@DefTruth⭐️
[技术随笔][3D]📖从源码安装Pytorch3D详细记录及学习资料@DefTruth⭐️
[技术随笔][ML]📖200页:《统计学习方法:李航》笔记 -从原理到实现@DefTruth⭐️⭐️

📚 高性能计算与分布式-技术博客推荐 (©️back👆🏻)

💡说明: 本小节整理一些自己比较喜欢的文章。欢迎大家提PR推荐更多优秀的文章!

📖 类型-标题📖 作者📖 推荐
[cute系列详解][入门]📖cutlass cute 101@朱小霖⭐️⭐️⭐️
[cute系列详解][入门]📖CUTLASS 2.x & CUTLASS 3.x Intro 学习笔记@BBuf⭐️⭐️⭐️
[cute系列详解][入门]📖写给大家看的 CuTe 教程:tiled copy@竹熙佳处⭐️⭐️⭐️
[cute系列详解][入门]📖写给大家看的 CuTe 教程:tiled mma@竹熙佳处⭐️⭐️⭐️
[cute系列详解][入门]📖写给大家看的 CuTe 教程:Layout Compose & Inverse@竹熙佳处⭐️⭐️⭐️
[cute系列详解][入门]📖写给大家看的 CuTe 教程: Layout Product & Divide@竹熙佳处⭐️⭐️⭐️
[cute系列详解][入门]📖写给大家看的 CuTe 教程:TMA Copy@竹熙佳处⭐️⭐️⭐️
[cute系列详解][入门]📖写给进阶开发的 CuTe 笔记:permutationMNK 参数@竹熙佳处⭐️⭐️⭐️
[cute系列详解][Layout]📖cute 之 Layout@reed⭐️⭐️⭐️
[cute系列详解][Layout]📖cute Layout 的代数和几何解释@reed⭐️⭐️⭐️
[cute系列详解][Tensor]📖cute 之 Tensor@reed⭐️⭐️⭐️
[cute系列详解][MMA]📖cute 之 MMA抽象@reed⭐️⭐️⭐️
[cute系列详解][Copy]📖cute 之 Copy抽象@reed⭐️⭐️⭐️
[cute系列详解][Swizzle]📖cute 之 Swizzle@reed⭐️⭐️⭐️
[cute系列详解][Swizzle]📖cute Swizzle细谈@进击的Killua⭐️⭐️⭐️
[cute系列详解][Swizzle]📖cutlass swizzle机制解析(一)@Titus⭐️⭐️⭐️
[cute系列详解][Swizzle]📖cutlass swizzle机制解析(二)@Titus⭐️⭐️⭐️
[cute系列详解][Swizzle]📖CUDA避免smem bank conflict的swizzle机制解析@frankshi⭐️⭐️⭐️
[cute系列详解][Swizzle]📖布局代数实战:Swizzle自动推导@melonedo⭐️⭐️⭐️
[cute系列详解][GEMM]📖cute 之 简单GEMM实现@reed⭐️⭐️⭐️
[cute系列详解][GEMM]📖cute 之 GEMM流水线@reed⭐️⭐️⭐️
[cute系列详解][GEMM]📖cute 之 高效GEMM实现@reed⭐️⭐️⭐️
[cute系列详解][GEMM]📖GEMM流水线: single/multi-stage、pipeline@Titus⭐️⭐️⭐️
[cute系列详解][GEMM]📖GEMM细节分析(一): ldmatrix的选择@Anonymous⭐️⭐️⭐️
[cute系列详解][GEMM]📖GEMM细节分析(二): TiledCopy与cp.async@Anonymous⭐️⭐️⭐️
[cute系列详解][GEMM]📖GEMM细节分析(三): Swizzle<B,M,S>参数取值@Anonymous⭐️⭐️⭐️
[cute系列详解][实践]📖Hopper Mixed GEMM的CUTLASS实现笔记@BBuf⭐️⭐️⭐️
[cute系列详解][实践]📖CUTLASS CuTe实战(一): 基础@进击的Killua⭐️⭐️⭐️
[cute系列详解][实践]📖CUTLASS CuTe实战(二): 应用@进击的Killua⭐️⭐️⭐️
[cute系列详解][实践]📖FlashAttention fp8实现(ada架构)@shengying.wei⭐️⭐️⭐️
[cute系列详解][实践]📖FlashAttention 笔记: tiny-flash-attention解读@shengying.wei⭐️⭐️⭐️
[cute系列详解][实践]📖使用cutlass cute复现flash attention@66RING⭐️⭐️⭐️
[cutlass教程][入门]📖cutlass 基本认知@JoeNomad⭐️⭐️⭐️
[cutlass教程][入门]📖cutlass 软件架构@JoeNomad⭐️⭐️⭐️
[cutlass教程][入门]📖CUTLASS 基础介绍@进击的Killua⭐️⭐️⭐️
[cutlass教程][入门]📖乱谈CUTLASS GTC2020 SLIDES@zzk again⭐️⭐️⭐️
[cutlass教程][深入]📖cutlass block swizzle 和 tile iterator@JoeNomad⭐️⭐️⭐️
[cutlass教程][深入]📖cutlass bank conflict free的smem layout@JoeNomad⭐️⭐️⭐️
[cutlass教程][深入]📖cutlass 多级流水线@JoeNomad⭐️⭐️⭐️
[GPU指令集架构][精解]📖NVidia GPU指令集架构-前言@reed⭐️⭐️⭐️
[GPU指令集架构][精解]📖NVidia GPU指令集架构-寄存器@reed⭐️⭐️⭐️
[GPU指令集架构][精解]📖NVidia GPU指令集架构-Load和Cache@reed⭐️⭐️⭐️
[GPU指令集架构][精解]📖NVidia GPU指令集架构-浮点运算@reed⭐️⭐️⭐️
[GPU指令集架构][精解]📖NVidia GPU指令集架构-整数运算@reed⭐️⭐️⭐️
[GPU指令集架构][精解]📖NVidia GPU指令集架构-比特和逻辑操作@reed⭐️⭐️⭐️
[GPU指令集架构][精解]📖NVidia GPU指令集架构-Warp级和Uniform操作@reed⭐️⭐️⭐️
[CUDA优化][入门]📖CUDA 入门的正确姿势:how-to-optimize-gemm@白牛⭐️⭐️⭐️
[CUDA优化][入门]📖CUDA(一):CUDA 编程基础@紫气东来⭐️⭐️⭐️
[CUDA优化][入门]📖CUDA(二):GPU的内存体系及其优化指南@紫气东来⭐️⭐️⭐️
[CUDA优化][实践]📖CUDA(三):通用矩阵乘法:从入门到熟练@紫气东来⭐️⭐️⭐️
[CUDA优化][实践]📖ops(1):LayerNorm 算子的 CUDA 实现与优化@紫气东来⭐️⭐️⭐️
[CUDA优化][实践]📖ops(2):SoftMax算子的 CUDA 实现@紫气东来⭐️⭐️⭐️
[CUDA优化][实践]📖ops(3):Cross Entropy 的 CUDA 实现@紫气东来⭐️⭐️⭐️
[CUDA优化][实践]📖ops(4):AdamW 优化器的 CUDA 实现@紫气东来⭐️⭐️⭐️
[CUDA优化][实践]📖ops(5):激活函数与残差连接的 CUDA 实现@紫气东来⭐️⭐️⭐️
[CUDA优化][实践]📖ops(6):embedding 层与 LM head 层的 CUDA 实现@紫气东来⭐️⭐️⭐️
[CUDA优化][实践]📖ops(7):self-attention 的 CUDA 实现及优化 (上)@紫气东来⭐️⭐️⭐️
[CUDA优化][实践]📖ops(8):self-attention 的 CUDA 实现及优化 (下)@紫气东来⭐️⭐️⭐️
[CUDA优化][实践]📖CUDA(四):使用 CUDA 实现 Transformer 结构@紫气东来⭐️⭐️⭐️
[CUDA优化][Copy]📖Async Copy及Memory Barrier指令的功能与实现@Frank Wang⭐️⭐️⭐️
[CUDA优化][GEMV]📖深入浅出GPU优化系列:gemv优化@有了琦琦的棍子⭐️⭐️⭐️
[CUDA优化][实践]📖CUDA element-wise 算子详解@懒蚂蚁呀不嘿⭐️⭐️⭐️
[CUDA优化][实践]📖CUDA transpose 算子详解@懒蚂蚁呀不嘿⭐️⭐️⭐️
[CUDA优化][实践]📖CUDA reduce 算子详解@懒蚂蚁呀不嘿⭐️⭐️⭐️
[CUDA优化][实践]📖CUDA GEMM 算子详解@懒蚂蚁呀不嘿⭐️⭐️⭐️
[Tensor Cores]📖Nvidia Tensor Core初探@木子知⭐️⭐️⭐️
[Tensor Cores]📖Nvidia Tensor Core-WMMA API编程入门@木子知⭐️⭐️⭐️
[Tensor Cores]📖Nvidia Tensor Core-MMA PTX编程入门@木子知⭐️⭐️⭐️
[Tensor Cores]📖CUDA Ampere Tensor Core HGEMM 矩阵乘法优化@nicholaswilde⭐️⭐️⭐️
[GPU通信架构][精解]📖NVIDIA GPGPU(四)- 通信架构@Bruce⭐️⭐️⭐️
[torch.compile][原理]📖Torch.compile流程解析: 介绍@StarCap⭐️⭐️⭐️
[torch.compile][原理]📖Torch.compile流程解析: TorchDynamo@StarCap⭐️⭐️⭐️
[torch.compile][原理]📖Torch.compile流程解析: AOTAutograd@StarCap⭐️⭐️⭐️
[torch.compile][原理]📖Torch.compile流程解析: TorchInductor@StarCap⭐️⭐️⭐️
[torch.compile][原理]📖Torch.compile流程解析: 算子融合@StarCap⭐️⭐️⭐️
[torch.compile][实践]📖Torch.compile使用指南@jhang⭐️⭐️⭐️
[torch.compile][实践]📖Torch.compile详细示例解析教程@Bbuf⭐️⭐️⭐️
[torch.compile][原理]📖一文搞懂TorchDynamo原理@吾乃阿尔法⭐️⭐️⭐️
[torch.compile][原理]📖理解torch.compile基本原理和使用方式@俯仰⭐️⭐️⭐️

©️License (©️back👆🏻)

GNU General Public License v3.0

🎉Contribute (©️back👆🏻)

How to contribute? Star this repo or check 🌤🌤CONTRIBUTE🎉🎉.

Star History Chart

📖 References (©️back👆🏻)