CK Tile Dispatcher

May 17, 2026 · View on GitHub

A unified kernel dispatch system for AMD GPUs with C++ and Python frontends, supporting GEMM and Grouped Convolution operations.

Validated Platform: AMD Instinct MI300 series (gfx942)


Table of Contents

  1. Quick Start
  2. Docker Setup
  3. Prerequisites
  4. Step-by-Step Build Guide
  5. Running Examples
  6. External Integration
  7. Core Concepts
  8. Operation Support Matrix
  9. Troubleshooting
  10. File Structure

Quick Start

Complete setup from scratch (5 minutes):

# From the composable_kernel root directory
cd dispatcher

# Step 1: Create build directory
mkdir -p build && cd build

# Step 2: Configure CMake
cmake .. \
  -DCMAKE_PREFIX_PATH=/opt/rocm \
  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
  -DCMAKE_BUILD_TYPE=Release \
  -DGPU_TARGETS="gfx942" \
  -DBUILD_DISPATCHER_EXAMPLES=ON

# Step 3: Generate kernels and build (CMake handles this automatically)
make -j$(nproc)

# Step 4: Run C++ examples
./examples/gemm_01_basic

# Step 5: Build Python libraries (required for Python examples)
make python_libs

# Step 6: Run Python examples (from dispatcher directory)
cd ..
python3 examples/gemm/python/01_basic_gemm.py

For a reproducible build environment, use the official ROCm Docker image:

Step 1: Pull and Run Container

# Pull the CK Docker image
docker pull rocm/composable_kernel:ck_ub24.04_rocm7.0.1

# Run container with GPU access
docker run \
  -it \
  --privileged \
  --device=/dev/kfd \
  --device=/dev/dri \
  --group-add video \
  --group-add render \
  -w /root/workspace \
  -v $(pwd):/root/workspace \
  rocm/composable_kernel:ck_ub24.04_rocm7.0.1 \
  /bin/bash

Note: Omit --device flags if building without GPU access.

Step 2: Clone and Build

# Inside the container
git clone https://github.com/ROCm/composable_kernel.git
cd composable_kernel
git checkout builder-dispatch-tile-gemm

# Set up Python environment
python3 -m venv .venv
source .venv/bin/activate
pip install numpy

# Build dispatcher
cd dispatcher
mkdir -p build && cd build
cmake .. \
  -DCMAKE_PREFIX_PATH=/opt/rocm \
  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
  -DCMAKE_BUILD_TYPE=Release \
  -DGPU_TARGETS="gfx942" \
  -DBUILD_DISPATCHER_EXAMPLES=ON

make -j$(nproc)

One-Liner Build (inside container)

git clone https://github.com/ROCm/composable_kernel.git && \
cd composable_kernel && git checkout builder-dispatch-tile-gemm && \
python3 -m venv .venv && source .venv/bin/activate && pip install numpy && \
cd dispatcher && mkdir -p build && cd build && \
cmake .. -DCMAKE_PREFIX_PATH=/opt/rocm -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
  -DCMAKE_BUILD_TYPE=Release -DGPU_TARGETS="gfx942" -DBUILD_DISPATCHER_EXAMPLES=ON && \
make -j$(nproc)

Prerequisites

Required Software

SoftwareMinimum VersionCheck Command
ROCm6.4+rocminfo
CMake3.16+cmake --version
Python3.8+python3 --version
NumPy1.20+pip show numpy
hipcc(from ROCm)/opt/rocm/bin/hipcc --version

Note: Newer GPU targets (gfx950, gfx1201) require ROCm 6.3+. For ROCm 6.4+, you can also use amdclang++ instead of hipcc.

Check Your GPU Architecture

# Find your GPU architecture
rocminfo | grep -i "gfx"
# Example output: "gfx942"

Supported architectures:

  • gfx942 - MI300X, MI300A, MI308, MI325 (Instinct MI300 series)
  • gfx90a - MI200 series (MI250, MI250X)
  • gfx950 - MI350 series
  • gfx1101 - RDNA3 series
  • gfx1201 - RDNA4 series

Install Python Dependencies

Core Dependencies (Required)

NumPy is required for Python examples and kernel generation. We recommend using a virtual environment:

Option 1: Using standard venv

# Create virtual environment
python3 -m venv .venv

# Activate virtual environment
source .venv/bin/activate  # Linux/macOS
# .venv\Scripts\activate   # Windows

# Install core dependencies
pip install -r python/requirements.txt

Option 2: Using uv (faster alternative)

# Install uv if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create and activate virtual environment
uv venv .venv
source .venv/bin/activate  # Linux/macOS
# .venv\Scripts\activate   # Windows

# Install core dependencies
uv pip install -r python/requirements.txt

Option 3: System-wide install (not recommended)

pip install -r python/requirements.txt

Note: Always activate your virtual environment before running CMake or Python examples.

ML Heuristics Dependencies (Optional)

For ML-based kernel selection (examples 09-11), install additional dependencies:

# Activate your virtual environment first
source .venv/bin/activate

# Install ML dependencies (LightGBM, pandas, pyarrow, scikit-learn)
pip install -r requirements-ml.txt

Why separate? ML dependencies are large (especially pyarrow) and not needed for basic dispatcher usage. Install only if you need:

  • ML-based kernel selection (examples/gemm/python/09_ml_heuristic.py)
  • Model training (heuristics/train.py)
  • Model evaluation (heuristics/evaluate.py)
  • Automated benchmark analysis

Core dependencies: ~50 MB (NumPy only) With ML dependencies: ~500 MB (includes LightGBM, pandas, pyarrow, scikit-learn)

Supported Data Types

CK Tile supports a wide range of data types for GEMM operations:

A dtypeB dtypeAcc dtypeWarp Tile SizesNotes
fp32fp32fp3216x16x4, 16x16x16Full precision
fp16fp16fp3232x32x8, 32x32x16, 16x16x16, 16x16x32Standard half
bf16bf16fp3232x32x8, 32x32x16, 16x16x16, 16x16x32Brain float 16
fp8fp8fp3232x32x16, 32x32x32, 16x16x32, 16x16x64FP8 E4M3
fp8bf8fp3232x32x16, 16x16x32Mixed FP8/BF8
bf8fp8fp3232x32x16, 16x16x128Mixed BF8/FP8
bf8bf8fp3232x32x16, 32x32x32, 16x16x32BF8 E5M2
int8int8int3232x32x16, 16x16x32, 16x16x16Integer GEMM
pk_fp4pk_fp4fp3216x16x128Packed 4-bit float

Notes:

  • Accumulator is always fp32 except for int8 which uses int32
  • FP8 types: fp8 = E4M3, bf8 = E5M2
  • pk_fp4 = Packed 4-bit float (2 values per byte)
  • Some dtypes require specific GPU architectures (e.g., FP8 requires MI300+)

Step-by-Step Build Guide

Step 1: Navigate to Dispatcher Directory

# From composable_kernel root
cd dispatcher

# Verify you're in the right place
ls CMakeLists.txt  # Should exist

Step 2: Create Build Directory

mkdir -p build
cd build

Step 3: Configure CMake

Basic configuration (library only):

cmake .. \
  -DCMAKE_PREFIX_PATH=/opt/rocm \
  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
  -DCMAKE_BUILD_TYPE=Release \
  -DGPU_TARGETS="gfx942"

Full configuration (with examples and tests):

cmake .. \
  -DCMAKE_PREFIX_PATH=/opt/rocm \
  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
  -DCMAKE_BUILD_TYPE=Release \
  -DGPU_TARGETS="gfx942" \
  -DBUILD_DISPATCHER_EXAMPLES=ON \
  -DBUILD_DISPATCHER_TESTS=ON

Expected output:

-- Found hip: /opt/rocm (found suitable version "6.x.x")
-- Generating GEMM kernels...
-- Built: gemm_01 through gemm_06, dispatcher_gemm_lib.so
-- Configuring done

Step 4: Build

# Build all targets (generates kernels automatically, then compiles)
make -j$(nproc)

# Or build specific targets
make gemm_01_basic          # Single GEMM example
make dispatcher_gemm_lib    # GEMM shared library for Python

# Build ONLY Python libraries (faster if you don't need C++ examples)
make python_libs -j$(nproc)

Kernel Generation Targets

Kernels are generated automatically during make, but you can also control generation explicitly:

# Generate all kernels only (no compilation)
make generate_all_kernels

# Generate GEMM kernels only
make generate_gemm_kernels

# Force regenerate (even if kernels exist)
make regenerate_all_kernels
make regenerate_gemm_kernels

# Generate for specific GPU architecture
make generate_kernels_gfx942    # MI300X
make generate_kernels_gfx90a    # MI200
make generate_kernels_gfx1100   # RDNA3

Step 5: Verify Build

# Check executables were built
ls examples/gemm_*

# Check shared libraries were built
ls examples/libdispatcher_gemm_lib.so

CMake Options Reference

FlagDefaultDescription
CMAKE_BUILD_TYPEDebugUse Release for performance!
GPU_TARGETSNoneTarget GPU: "gfx942", "gfx90a", etc.
BUILD_DISPATCHER_EXAMPLESOFFBuild C++ examples and Python libs
BUILD_DISPATCHER_TESTSOFFBuild unit tests
CMAKE_PREFIX_PATH-ROCm installation path
CMAKE_CXX_COMPILER-Path to hipcc compiler

WARNING: Important: Always use -DCMAKE_BUILD_TYPE=Release for benchmarking. Debug builds are slower. WARNING: Important: Note that the current system provides single GPU target support for architecture-based kernel filtering, please do not use multiple GPU targets at a time (if necessary, please compile into different build directories).


Running Examples

C++ Examples

After building, executables are in build/examples/:

cd build/examples

# GEMM Examples
./gemm_01_basic              # Basic GEMM with autofill/autocorrect
./gemm_02_multi_size         # Wildcard expansion
./gemm_03_benchmark_validation  # Benchmarking + validation
./gemm_04_heuristics         # Heuristic kernel selection
./gemm_05_json_export        # Registry JSON export
./gemm_06_multi_registry     # Multiple registries

# Grouped Convolution Examples
./grouped_conv_01_basic         # Declaration patterns + GPU execution
./grouped_conv_02_all_dirs      # Forward/BwdData/BwdWeight with GPU
./grouped_conv_03_bench_val     # Benchmark + CPU reference validation
./grouped_conv_04_registry_json # Heuristic selection + JSON export
./grouped_conv_05_bwd_data      # Backward data + CPU validation
./grouped_conv_06_bwd_weight    # Backward weight + CPU validation
./grouped_conv_07_benchmark     # Multi-tile ResNet benchmark

Python Examples

Run from the dispatcher directory:

cd /path/to/composable_kernel/dispatcher

# GEMM Examples
python3 examples/gemm/python/01_basic_gemm.py     # Basic multi-kernel GEMM
python3 examples/gemm/python/04_validation.py     # CPU reference validation
python3 examples/gemm/python/07_stress_test.py    # Stress test
python3 examples/gemm/python/08_heuristics.py     # Heuristic selection

# Grouped Convolution Examples
python3 examples/grouped_conv/python/01_basic_grouped_conv.py  # Config patterns + registry + GPU
python3 examples/grouped_conv/python/02_forward.py             # Forward 2D/3D + CPU ref
python3 examples/grouped_conv/python/03_bwd_data.py            # Backward data + CPU ref
python3 examples/grouped_conv/python/04_bwd_weight.py          # Backward weight + CPU ref
python3 examples/grouped_conv/python/05_benchmark.py           # Multi-problem benchmark
python3 examples/grouped_conv/python/06_registry_json.py       # Heuristic selection + JSON

# FMHA Examples (JIT-compiled on the fly)
python3 examples/fmha/python/01_basic_fmha.py      # Basic forward attention
python3 examples/fmha/python/12_masks_fmha.py       # Causal masks
python3 examples/fmha/python/18_backward_fmha.py    # Backward pass
python3 examples/fmha/python/16_splitkv_fmha.py     # Split-KV for long sequences

Example Output

Expected C++ output (gemm_01_basic):

======================================================================
Example 01: Basic GEMM with Declarative Kernel Definition
======================================================================

Step 1: Declared Kernels
------------------------
Kernel Set: fp16_gemm_kernels
  Architecture: gfx942
  Configurations: 1
    - gemm_fp16_rcr_compv4_cshuffle_intrawave_128x128x32

Step 2: Create Registry and Dispatcher
--------------------------------------
  Registered 1 kernels

Step 3: Define Problem
----------------------
  M=1024, N=1024, K=1024

Step 4: GPU Execution
---------------------
  *** GPU EXECUTION ***
  Time:   <varies> ms
  TFLOPS: <varies>

Note: Timing values vary by GPU model and system configuration.


Benchmark Parameters

The dispatcher supports fine-grained control over benchmarking, matching CK Tile's stream_config:

Available Parameters

ParameterTypeDefaultDescription
warmupint5Warmup iterations (discarded from timing)
repeatint20Benchmark iterations (averaged)
flush_cacheboolfalseFlush GPU L2 cache between iterations
rotating_countint1Rotating buffer count (for cache simulation)
timerstring"gpu"Timer type: "gpu" (HIP events) or "cpu"
initstring"random"Matrix initialization: "random", "linear", "constant"
split_kint1Split-K parallelism factor

Python Usage

from ctypes_utils import DispatcherLib

# Basic usage (default benchmark settings)
lib = DispatcherLib.load()

# Advanced benchmark settings via command line
python3 examples/gemm/python/10_advanced_benchmark.py \
    --warmup 10 \
    --repeat 100 \
    --flush-cache

C++ Usage

// Basic timing
ck_tile::stream_config cfg{nullptr, true};

// Advanced benchmark settings
ck_tile::stream_config cfg{
    nullptr,          // stream_id (nullptr = default stream)
    true,             // time_kernel
    1,                // log_level
    10,               // cold_niters (warmup)
    100,              // nrepeat
    true,             // is_gpu_timer
    true,             // flush_cache
    4                 // rotating_count
};

float avg_time = kernel.run(args, cfg);

Command Line (Python Examples)

# Basic run
python3 examples/gemm/python/10_advanced_benchmark.py

# With benchmark parameters
python3 examples/gemm/python/10_advanced_benchmark.py \
    --warmup 10 \
    --repeat 100 \
    --flush-cache \
    --rotating-count 4 \
    --timer gpu

When to Use Each Parameter

Use CaseRecommended Settings
Quick testwarmup=1, repeat=3
Stable benchmarkwarmup=10, repeat=100
Memory-bound analysisflush_cache=True, rotating_count=4
Compute-bound analysisflush_cache=False (default)
Debug timingtimer="cpu"
Productiontimer="gpu" (default)

ML-Based Kernel Selection (Optional)

The dispatcher includes ML heuristics for automated kernel selection using trained LightGBM models.

Prerequisites: Install ML dependencies first:

pip install -r requirements-ml.txt  # ~500 MB (LightGBM, pandas, pyarrow, scikit-learn)

Documentation: See heuristics/README.md for:

  • Training and evaluating models
  • Feature engineering (72 features)
  • Using pre-trained models
  • Python API reference

Examples:

python3 examples/gemm/python/09_ml_heuristic.py      # ML-based kernel selection
python3 examples/gemm/python/10_rank_kernels.py      # Kernel ranking

Model Compression: Trained models are stored in compressed .lgbm.gz format to save space (~67% size reduction). Python tools automatically decompress models on first use. For C++ examples, decompress manually:

# If you have compressed models
cd heuristics/models/gemm_universal_fp16_gfx950
gunzip model_tflops.lgbm.gz

# Then use in C++ example
cd ../../../build
./gemm_09_ml_heuristic --model ../heuristics/models/gemm_universal_fp16_gfx950/model_tflops.lgbm

External Integration

Using Dispatcher in Your Own Project

Add to your CMakeLists.txt:

# Set path to composable_kernel
set(CK_ROOT "/path/to/composable_kernel")

# Add dispatcher subdirectory
add_subdirectory(${CK_ROOT}/dispatcher dispatcher_build)

# Link to your target
target_link_libraries(your_target PRIVATE ck_tile_dispatcher)
target_include_directories(your_target PRIVATE 
    ${CK_ROOT}/dispatcher/include
    ${CK_ROOT}/include
)

Option 2: Include as Pre-built Library

# Find the pre-built library
find_library(CK_DISPATCHER ck_tile_dispatcher 
    PATHS /path/to/composable_kernel/dispatcher/build)

# Include directories
set(CK_INCLUDE_DIRS
    /path/to/composable_kernel/include
    /path/to/composable_kernel/dispatcher/include
)

target_link_libraries(your_target PRIVATE ${CK_DISPATCHER})
target_include_directories(your_target PRIVATE ${CK_INCLUDE_DIRS})

Option 3: Python Integration

import sys
sys.path.insert(0, "/path/to/composable_kernel/dispatcher/examples/gemm/python")

# For GEMM
from ctypes_utils import DispatcherLib, Dispatcher, KernelConfig

Required Include Paths

When integrating, you need these include paths:

/path/to/composable_kernel/include              # CK Tile core headers
/path/to/composable_kernel/dispatcher/include   # Dispatcher headers
/path/to/composable_kernel/dispatcher/build/generated_kernels  # Generated kernels

Required Compile Flags

# Minimum flags for hipcc
-std=c++17
-D__HIP_PLATFORM_AMD__=1
--offload-arch=gfx942  # Your target GPU

# Recommended flags
-O3
-mllvm -enable-noalias-to-md-conversion=0
-Wno-undefined-func-template
-Wno-float-equal
-Wall 
-Werror

Python Path Setup

For Python scripts outside the dispatcher directory:

# Option 1: Environment variable
export PYTHONPATH="/path/to/composable_kernel/dispatcher/examples/gemm/python:$PYTHONPATH"

# Option 2: In your Python script
import sys
sys.path.insert(0, "/path/to/composable_kernel/dispatcher/examples/gemm/python")

Library Search Paths

The Python utilities search for the shared library in these locations:

# For GEMM (ctypes_utils.py)
SEARCH_PATHS = [
    "build/examples/libdispatcher_gemm_lib.so",
    "../build/examples/libdispatcher_gemm_lib.so",
    "../../build/examples/libdispatcher_gemm_lib.so",
]

If using from a different location, set the library path explicitly:

# GEMM
from ctypes_utils import DispatcherLib
lib = DispatcherLib.load("/absolute/path/to/libdispatcher_gemm_lib.so")

Core Concepts

Data Flow

KernelConfig -> Registry -> Dispatcher -> GPU Execution
  1. KernelConfig: Defines kernel parameters (tile sizes, data types, layouts)
  2. Registry: Stores multiple kernel configurations
  3. Dispatcher: Selects best kernel for a given problem and executes it

GEMM Layouts

LayoutABCUse Case
RCRRowColRowMost common (PyTorch default)
RRRRowRowRowBoth inputs row-major
CRRColRowRowA transposed
CCRColColRowBoth inputs column-major

Split-K Support

Split-K divides the K dimension across multiple thread blocks, useful for large K dimensions.

Usage (C++):

// GEMM with 4-way K split
auto problem = ProblemBuilder()
    .m(1024).n(1024).k(8192)
    .split_k(4)
    .build();

Operation Support Matrix

This matrix shows all CK Tile operations with per-data-type, per-layout, and per-GPU support status. It uses a three-state convention: ✅ = supported by both CK Tile and the dispatcher, ❌ = supported by CK Tile but not yet in the dispatcher, blank = not supported by CK Tile itself.

Data TypesLayoutsGPU Targets
OpCK Tile Kernelfp16fp8bf16bf8int8fp4fp6rcrrrrccrcrr90a9429501201
GEMMgemm_multi_d [5]
engine: dispatcher/
example: 19_gemm_multi_d/
GEMMgemm_preshuffle [1][2]
engine: dispatcher/
GEMMgemm_universal [3][4][7][8]
engine: dispatcher/
example: 03_gemm/
GEMMbatched_contraction
example: 41_batched_contraction/
GEMMbatched_gemm
example: 16_batched_gemm/
GEMMblock_scale_gemm
example: 38_block_scale_gemm/
GEMMflatmm
example: 18_flatmm/
GEMMgemm_multi_abd
example: 22_gemm_multi_abd/
GEMMgemm_quant
GEMMgrouped_gemm
example: 17_grouped_gemm/
GEMMgrouped_gemm_quant
GEMMstreamk_gemm
example: 40_streamk_gemm/
Reducemulti_reduce2d
example: 05_reduce/
Reducereduce2d
example: 05_reduce/
Attentionfmha
example: 01_fmha/
Attentionsparse_attn
example: 50_sparse_attn/
Activationsoftmax
Activationtopk_softmax
example: 09_topk_softmax/
Convgrouped_conv [6]
example: 20_grouped_convolution/
Data Movebatched_transpose
example: 35_batched_transpose/
Data Moveimage_to_column
example: 04_img2col/
Data Movepermute
example: 06_permute/
Elementwiseelementwise
example: 21_elementwise/
MoEfused_moe
example: 15_fused_moe/
Normadd_rmsnorm2d_rdquant
example: 11_add_rmsnorm2d_rdquant/
Normlayernorm2d
example: 02_layernorm2d/
Normnorm_reduce
Normrmsnorm2d
example: 10_rmsnorm2d/
Poolingpooling
example: 36_pooling/
Quantsmoothquant
example: 12_smoothquant/

Notes:

  • [1] gemm_preshuffle: Supports only rcr layout. Uses fixed preshufflev2 pipeline, Auto scheduler, and cshuffle epilogue.
  • [2] gemm_preshuffle: int8 preshuffle support is limited to gfx942 and gfx950 (entries in preshuffle_warp_tile_combos).
  • [3] gemm_universal: fp4 (pk_fp4) support is only available on gfx950.
  • [4] gemm_universal: fp32 GEMM is supported by the dispatcher (fp32_fp32_fp32 warp tile combos exist) but is omitted from matrix columns for consistency with the tile engine matrix format.
  • [5] gemm_multi_d: Codegen supports MultiDAdd and MultiDMultiply element-wise ops. Preselected kernel sets also test Relu, Gelu, FastGelu.
  • [6] grouped_conv: arch_filter.py defines conv operator types (CONV_FWD, CONV_BWD_DATA, CONV_BWD_WEIGHT, CONV3D_*) but dispatcher infrastructure is incomplete (ctypes bindings are stubs, conv_utils.hpp does not exist).
  • [7] (all dispatcher ops): gfx908, gfx1100, and gfx1200 also have warp_tile_combos in arch_specs.json but are not shown in the matrix's 4 GPU columns.
  • [8] (all dispatcher ops): int4, fp32, fp64 are valid dispatcher data types (defined in kernel_key.hpp DataType enum) but have no dedicated matrix columns.

Dispatcher GEMM Configuration Detail

Per-Variant Configuration

GEMM VariantPipelinesSchedulersEpiloguesElement-wise OpsOutput Dtype
gemm_universalmem, compv3, compv4intrawave, interwavecshuffle, defaultPassThroughSame as input (fp8/bf8 -> fp16)
gemm_preshufflepreshufflev2AutocshufflePassThroughSame as input (fp8/bf8 -> fp16)
gemm_multi_dmem, compv3, compv4intrawave, interwavecshuffle, defaultMultiDAdd, MultiDMultiplySame as input (fp8/bf8 -> fp16)

Warp Tile Combinations per GPU

GPUfp16bf16fp8bf8int8pk_fp4
gfx110016x16x1616x16x16----16x16x16--
gfx120016x16x1616x16x1616x16x1616x16x1616x16x16--
gfx120116x16x1616x16x1616x16x1616x16x1616x16x16--
gfx90832x32x8, 16x16x16, 32x32x16, 16x16x3232x32x8, 16x16x16, 32x32x16, 16x16x32----32x32x16, 16x16x32--
gfx90a32x32x8, 16x16x16, 32x32x16, 16x16x32, 4x64x16, 64x4x1632x32x8, 16x16x16, 32x32x16, 16x16x32, 4x64x16, 64x4x1632x32x16, 32x32x3232x32x16, 32x32x3232x32x16, 16x16x32--
gfx94232x32x8, 16x16x16, 32x32x16, 16x16x32, 4x64x16, 64x4x1632x32x8, 16x16x16, 32x32x16, 16x16x32, 4x64x16, 64x4x1632x32x16, 32x32x32, 16x16x32, 16x16x6432x32x16, 32x32x32, 16x16x32, 16x16x6432x32x16, 16x16x32--
gfx95032x32x8, 16x16x16, 32x32x16, 16x16x32, 4x64x16, 64x4x1632x32x8, 16x16x16, 32x32x16, 16x16x32, 4x64x16, 64x4x1632x32x16, 32x32x32, 16x16x32, 16x16x64, 16x16x128, 32x32x6432x32x16, 32x32x32, 16x16x32, 16x16x64, 16x16x128, 32x32x6432x32x16, 16x16x3216x16x128

Preshuffle Warp Tile Combinations

GPUfp16bf16fp8bf8int8
gfx90a32x32x8, 16x16x16, 32x32x16, 16x16x32, 64x4x1632x32x8, 16x16x16, 32x32x16, 16x16x32, 64x4x1632x32x16, 32x32x3232x32x16, 32x32x32--
gfx94232x32x8, 16x16x16, 32x32x16, 16x16x32, 64x4x1632x32x8, 16x16x16, 32x32x16, 16x16x32, 64x4x1632x32x16, 32x32x32, 16x16x32, 16x16x6432x32x16, 32x32x32, 16x16x64, 16x16x3216x16x32, 32x32x16
gfx95032x32x8, 16x16x16, 32x32x16, 16x16x32, 64x4x1632x32x8, 16x16x16, 32x32x16, 16x16x32, 64x4x1632x32x16, 32x32x32, 16x16x32, 16x16x64, 16x16x128, 32x32x6432x32x16, 32x32x32, 16x16x64, 16x16x32, 16x16x128, 32x32x64--

Legend:

  • CK Tile Kernel column: First line is the kernel name. Lines prefixed with "engine:" show the dispatcher directory. Lines prefixed with "example:" show the CK Tile example directory under example/ck_tile/.
  • Green cell (✅): CK Tile implementation exists and the dispatcher supports it.
  • Red cell (❌): CK Tile implementation exists but the dispatcher does not support it.
  • Grey cell (blank): No CK Tile implementation exists for this combination.

Layout codes: Each 3-character layout code specifies the memory layout for tensors A, B, and C:

  • r = row-major, c = column-major
  • Example: rcr means A is row-major, B is column-major, C is row-major
  • gemm_multi_d uses 4-character codes internally (e.g., rcrr) where the 4th character is the D tensor layout (always r). The matrix shows only the 3-character A/B/C portion.

Data type mapping per config label:

Config LabelA (source)B (source)AccC (output)
fp16fp16fp16fp32fp16
bf16bf16bf16fp32bf16
int8int8int8int32int32
fp8fp8fp8fp32fp16
bf8bf8bf8fp32fp16
fp6fp6fp6fp32fp32
fp4fp16 or bf16fp4fp32fp16 or bf16

Troubleshooting

Build Issues

ProblemSolution
hipcc not foundSet -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc
hip not foundSet -DCMAKE_PREFIX_PATH=/opt/rocm
Very slow performanceUse -DCMAKE_BUILD_TYPE=Release
gfx942 not supportedCheck ROCm version (need 6.0+)
Kernel generation failsEnsure Python 3.8+ with NumPy installed in active venv
Build errorsFirst verify CK builds without dispatcher (see main CK README)

Runtime Issues

ProblemSolution
Library not foundBuild with -DBUILD_DISPATCHER_EXAMPLES=ON
No kernel foundCheck GPU arch matches build target
Python ModuleNotFoundErrorAdd paths to PYTHONPATH (see above)
Wrong resultsVerify layout matches your data

Debug Commands

# Check ROCm installation
rocminfo | head -20

# Check GPU architecture
rocminfo | grep "Name:"

# Verify library exists
ls -la build/examples/libdispatcher_*.so

# Run with verbose output
./build/examples/gemm_01_basic 2>&1

# Python: Check library loading
python3 -c "
import ctypes
lib = ctypes.CDLL('/path/to/libdispatcher_gemm_lib.so')
print('Library loaded successfully')
"

Clean Rebuild

If you encounter issues, try a clean rebuild:

cd dispatcher
rm -rf build
mkdir build && cd build
cmake .. [your options]
make -j$(nproc)

File Structure

dispatcher/
|---- README.md                    # This file
|---- CMakeLists.txt              # Build configuration
|
|---- include/ck_tile/dispatcher/  # C++ headers
|   |---- dispatcher.hpp           # Main dispatcher include
|   |---- registry.hpp             # GEMM kernel registry
|   |---- kernel_key.hpp           # Kernel configuration
|   |---- grouped_conv_config.hpp  # Grouped conv configuration
|   |---- grouped_conv_problem.hpp # Grouped conv problem (with builder)
|   |---- grouped_conv_kernel_decl.hpp  # Grouped conv kernel declarations
|   |---- grouped_conv_registry.hpp     # Grouped conv registry (thread-safe)
|   |---- grouped_conv_utils.hpp        # Grouped conv utilities
|   |---- fmha_types.hpp           # FMHA fwd/bwd args and traits structs
|   |---- fmha_problem.hpp         # FmhaProblem, FmhaProblemBuilder
|   |---- fmha_kernel_key.hpp      # FmhaKernelKey (Signature + Algorithm)
|   |---- fmha_kernel_instance.hpp # FmhaKernelInstance virtual interface
|   |---- fmha_kernel_decl.hpp     # Declarative FmhaSignature/FmhaAlgorithm
|   |---- fmha_registry.hpp        # FmhaRegistry (thread-safe)
|   +---- fmha_dispatcher.hpp      # FmhaDispatcher (plan, select, run)
|
|---- src/                        # C++ implementation
|
|---- codegen/                    # Kernel generation
|   |---- codegen_common.py       # Shared: TileConfig, TraitConfigBase, type mappings
|   |---- unified_gemm_codegen.py # GEMM kernel generator
|   |---- unified_grouped_conv_codegen.py  # Grouped conv kernel generator
|   |---- unified_fmha_codegen.py # FMHA kernel generator
|   |---- fmha_arch_specs.json    # FMHA per-arch tile/pipeline specs
|   |---- fmha_rules.py           # FMHA validation rules
|   |---- fmha_profiles.py        # FMHA named profiles/receipts
|   +---- arch_specs.json         # GPU specifications
|
|---- python/                     # Python utilities
|   |---- dispatcher_common.py    # Shared: paths, validation, Colors, phased output
|   |---- ctypes_utils.py         # GEMM ctypes utilities
|   |---- grouped_conv_utils.py   # Grouped conv utilities
|   +---- fmha_utils.py           # FMHA: JIT compile, FmhaRunner, FmhaKernelConfig
|
|---- scripts/                    # Build scripts
|   |---- compile_gemm_examples.py           # GEMM build script
|   +---- compile_grouped_conv_examples.py   # Grouped conv build script
|
|---- bindings/ctypes/            # Python ctypes interface
|   |---- gemm_ctypes_lib.cpp     # GEMM Python library
|   |---- conv_ctypes_lib.cpp     # Grouped conv Python library
|   +---- fmha_ctypes_lib.cpp     # FMHA Python library
|
|---- examples/                   # Examples
|   |---- gemm/
|   |   |---- cpp/                # C++ GEMM examples (01-07)
|   |   +---- python/             # Python GEMM examples (01-11)
|   |---- grouped_conv/
|   |   |---- cpp/                # C++ Grouped Conv examples (01-07)
|   |   +---- python/             # Python Grouped Conv examples (01-06)
|   +---- fmha/
|       |---- cpp/                # C++ FMHA examples (01-35)
|       +---- python/             # Python FMHA examples (01-38)
|
+---- tests/                      # Unit tests (C++ and Python)

Example Documentation

DirectoryREADME
GEMM C++examples/gemm/cpp/README.md
GEMM Pythonexamples/gemm/python/README.md
FMHA C++examples/fmha/cpp/ (35 examples covering all FMHA variants)
FMHA Pythonexamples/fmha/python/ (38 examples with JIT compilation)
Codegencodegen/README.md
Python Utilspython/README.md
C++ Headersinclude/ck_tile/dispatcher/README.md

Grouped Convolution Support

Grouped convolution is fully supported alongside GEMM, with shared infrastructure to eliminate duplication.

Python

# Generate grouped conv kernels
python3 codegen/unified_grouped_conv_codegen.py \
    --output-dir build/generated_kernels \
    --datatype fp16 --variant forward --ndim-spatial 2

# Build grouped conv examples
python3 scripts/compile_grouped_conv_examples.py examples/grouped_conv/cpp/01_basic_grouped_conv.cpp

Key Files

ComponentFile
C++ Headersinclude/ck_tile/dispatcher/grouped_conv_*.hpp
Python Codegencodegen/unified_grouped_conv_codegen.py
Python Utilspython/grouped_conv_utils.py
Build Scriptscripts/compile_grouped_conv_examples.py
Shared Codegencodegen/codegen_common.py
Shared Utilspython/dispatcher_common.py

Variants

  • Forward (grouped_conv_fwd) - Standard grouped convolution
  • Backward Data (grouped_conv_bwd_data) - Gradient w.r.t. input
  • Backward Weight (grouped_conv_bwd_weight) - Gradient w.r.t. weights

Shared Infrastructure

GEMM and grouped convolution share common code to avoid duplication:

  • codegen/codegen_common.py - TileConfig, TraitConfigBase, type mappings, parallel generation, arch-aware expansion
  • python/dispatcher_common.py - Path helpers, validation, auto-correction, Colors, phased output

License

MIT License - Copyright (c) 2025, Advanced Micro Devices, Inc.