how-to-optim-algorithm-in-cuda

May 20, 2026 ยท View on GitHub

CUDA, GPU kernel, and AI infrastructure optimization notes.

This repository collects hands-on CUDA kernels, CUTLASS/CuTe notes, Triton examples, PTX ISA notes, PyTorch internals notes, and LLM inference/training optimization material. It is one of my main public study and engineering notebooks for GPU systems work.

Repository Map

  • cuda-kernels/: handwritten CUDA kernels for reduce, softmax, elementwise, GEMV, indexing, atomic add, upsampling, and linear attention.
  • cuda-mode/: notes and code from the CUDA-MODE lecture series.
  • cutlass/: CUTLASS and CuTe DSL notes, including GEMM, TMA, WGMMA, swizzling, and instruction-level material.
  • triton/: Triton kernels, PyTorch interop examples, and meetup notes.
  • large-language-model/: LLM serving, training, and systems optimization notes.
  • pytorch/: PyTorch internals and CUDA-related notes.
  • papers/: GPU architecture and ML systems paper notes.
  • ptx-isa/: PTX ISA study notes.
  • tools/: small helper scripts.
  • deprecated/: older material kept for reference.

Status

Actively curated around CUDA kernels, LLM inference optimization, and AI infrastructure. Older Chinese-language notes are being consolidated or replaced with English entry points.

Star History

Star History Chart