MoA Kernel
October 9, 2024 ยท View on GitHub
This is the CUDA kernel implementation for MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression, or MoA.
Installation
We test our kernel with
CUDA 12.4andPyTorch 2.4. Install the required environments for MoA before installing the kernel.
cd python
FLASHINFER_LOGITS_POST_HOOKS=0 FLASHINFER_HEAD_DIMS=64,128 FLASHINFER_POS_ENCODING_MODES=0 python setup.py install
Quick Test
python accuracy_test.py
Acknowledgement
Our kernel is build upon FlashInfer project.
TODO
- support batch size > 1
- support multi-GPU inference
- support GQA