CUDA GEMM Optimization

July 19, 2024 ยท View on GitHub

Introduction

This repository contains the CUDA kernels for general matrix-matrix multiplication (GEMM) and the corresponding performance analysis. The correctness of the CUDA kernels is guaranteed for any matrix size. The parameters of the CUDA kernels are slightly turned for GEMM 4096 x 4096 x 4096 on an NVIDIA GeForce RTX 3090 GPU. The CUDA kernels should be compatible with any NVIDIA GPUs with compute capability 7.0 or higher.

Usages

Docker is used to build and run the CUDA kernels. The custom Docker container is built based on the NVIDIA NGC CUDA 12.2.2 Docker container.

Please adjust the base Docker container CUDA version if the host computer has a different CUDA version. Otherwise, weird compilation errors and runtime errors may occur.

Build Docker Images

To build the custom Docker image, please run the following command.

$ docker build -f docker/gemm-cuda.Dockerfile --no-cache --tag=gemm-cuda:12.2.2 .

Run Docker Container

To run the custom Docker container, please run the following command.

$ docker run -it --rm --gpus device=0 -v $(pwd):/mnt gemm-cuda:12.2.2

If we want to profile the CUDA kernels using NVIDIA Nsight Compute, we need to add additional flags --cap-add=SYS_ADMIN and --security-opt seccomp=unconfined when we run the Docker container.

Build CUDA Kernels

To build the CUDA kernels, please run the following commands inside the Docker container.

$ cmake -B build
$ cmake --build build --config Release --parallel
$ cmake --install build

Run CUDA Kernels

To run the FP32 and FP16 GEMM CUDA kernels, please run the following commands inside the Docker container.

$ ./build/src/profile_cuda_gemm_fp32
$ ./build/src/profile_cuda_gemm_fp16

Performances

All the experiments are conducted on a single NVIDIA GeForce RTX 3090 GPU. The performance can vary, sometimes up to 25%, from one measurement to another.

FP32 GEMM

All the FP32 GEMM kernels cannot utilize the NVIDIA Tensor Cores.

GEMM KernelTFLOPSKernel Description
cuBLAS GEMM Kernel24.5971cuBLAS implementation
Custom GEMM Kernel V000.278129Non-coalesced global memory access
Custom GEMM Kernel V011.7218Coalesced global memory access
Custom GEMM Kernel V022.661572D block tiling
Custom GEMM Kernel V02 Vectorized1.905142D block tiling with vectorized memory access
Custom GEMM Kernel V038.913182D block tiling and 1D thread tiling
Custom GEMM Kernel V03 Vectorized4.047962D block tiling and 1D thread tiling with vectorized memory access
Custom GEMM Kernel V0413.02472D block tiling and 2D thread tiling
Custom GEMM Kernel V04 Vectorized15.0272D block tiling and 2D thread tiling with vectorized memory access
Custom GEMM Kernel V0511.14482D block tiling and 2D thread tiling and matrix transpose
Custom GEMM Kernel V05 Vectorized19.66882D block tiling and 2D thread tiling and matrix transpose with vectorized memory access
Custom GEMM Kernel V0611.07032D block tiling and 2D warp tiling and 2D thread tiling and matrix transpose
Custom GEMM Kernel V06 Vectorized20.16492D block tiling and 2D warp tiling and 2D thread tiling and matrix transpose with vectorized memory access

FP16 GEMM

The FP16 custom GEMM kernels V00 to V06 do not utilize the NVIDIA Tensor Cores. The FP16 cuBLAS GEMM kernel and custom GEMM kernels V07 utilize the NVIDIA Tensor Cores.

GEMM KernelTFLOPSKernel Description
cuBLAS GEMM Kernel138.955cuBLAS implementation
Custom GEMM Kernel V000.284095Non-coalesced global memory access
Custom GEMM Kernel V011.7316Coalesced global memory access
Custom GEMM Kernel V022.466772D block tiling GEMM
Custom GEMM Kernel V02 Vectorized1.930882D block tiling with vectorized memory access
Custom GEMM Kernel V038.675632D block tiling and 1D thread tiling GEMM
Custom GEMM Kernel V03 Vectorized2.140472D block tiling and 1D thread tiling with vectorized memory access
Custom GEMM Kernel V0420.27462D block tiling and 2D thread tiling GEMM
Custom GEMM Kernel V04 Vectorized22.90012D block tiling and 2D thread tiling with vectorized memory access
Custom GEMM Kernel V0518.37362D block tiling and 2D thread tiling and matrix transpose GEMM
Custom GEMM Kernel V05 Vectorized27.9622D block tiling and 2D thread tiling and matrix transpose with vectorized memory access
Custom GEMM Kernel V0614.76222D block tiling and 2D warp tiling and 2D thread tiling and matrix transpose GEMM
Custom GEMM Kernel V06 Vectorized28.45882D block tiling and 2D warp tiling and 2D thread tiling and matrix transpose with vectorized memory access
Custom GEMM Kernel V0735.23122D block tiling and 2D warp tiling and WMMA and matrix transpose
Custom GEMM Kernel V07 Vectorized55.02982D block tiling and 2D warp tiling and WMMA and matrix transpose and vectorized memory access.

References