how-to-optim-algorithm-in-cuda

May 20, 2026 · View on GitHub

CUDA, GPU kernel, and AI infrastructure optimization notes.

This repository collects hands-on CUDA kernels, CUTLASS/CuTe notes, Triton examples, PTX ISA notes, PyTorch internals notes, and LLM inference/training optimization material. It is one of my main public study and engineering notebooks for GPU systems work.

Repository Map

cuda-kernels/: handwritten CUDA kernels for reduce, softmax, elementwise, GEMV, indexing, atomic add, upsampling, and linear attention.
cuda-mode/: notes and code from the CUDA-MODE lecture series.
cutlass/: CUTLASS and CuTe DSL notes, including GEMM, TMA, WGMMA, swizzling, and instruction-level material.
triton/: Triton kernels, PyTorch interop examples, and meetup notes.
large-language-model/: LLM serving, training, and systems optimization notes.
pytorch/: PyTorch internals and CUDA-related notes.
papers/: GPU architecture and ML systems paper notes.
ptx-isa/: PTX ISA study notes.
tools/: small helper scripts.
deprecated/: older material kept for reference.

CUDA and GPU optimization: this repository.
Deep learning compiler notes: https://github.com/BBuf/tvm_mlir_learn
Deep learning framework notes: https://github.com/BBuf/how-to-learn-deep-learning-framework

Status

Actively curated around CUDA kernels, LLM inference optimization, and AI infrastructure. Older Chinese-language notes are being consolidated or replaced with English entry points.

Repository Map

Related Repositories

Status

Star History