MoA Kernel

October 9, 2024 · View on GitHub

This is the CUDA kernel implementation for MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression, or MoA.

Installation

We test our kernel with CUDA 12.4 and PyTorch 2.4. Install the required environments for MoA before installing the kernel.

cd python
FLASHINFER_LOGITS_POST_HOOKS=0 FLASHINFER_HEAD_DIMS=64,128 FLASHINFER_POS_ENCODING_MODES=0 python setup.py install

python accuracy_test.py

Our kernel is build upon FlashInfer project.