MoA Kernel

October 9, 2024 ยท View on GitHub

This is the CUDA kernel implementation for MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression, or MoA.

Installation

We test our kernel with CUDA 12.4 and PyTorch 2.4. Install the required environments for MoA before installing the kernel.

cd python
FLASHINFER_LOGITS_POST_HOOKS=0 FLASHINFER_HEAD_DIMS=64,128 FLASHINFER_POS_ENCODING_MODES=0 python setup.py install

Quick Test

python accuracy_test.py

Acknowledgement

Our kernel is build upon FlashInfer project.

TODO

  • support batch size > 1
  • support multi-GPU inference
  • support GQA