ReadME.md
August 27, 2024 ยท View on GitHub
Note: This is only reproduction effort and not an official release!
This is the code to reproduce the results from the paper QCQA: Quality and Capacity-aware grouped Query Attention [https://arxiv.org/abs/2406.10247].
A sample command to run on a huggingface model is as follows:
python expt.py --model facebook/opt-125m --num_heads 12 --num_layers 12 --num_groups 6 --KV_parse_strings 'model.decoder.layers.{}.self_attn.k_proj.weight;model.decoder.layers.{}.self_attn.v_proj.weight' --n_gen 10
Installation
- No special installation is required.
Dependancies (python packages)
- Numpy
- Pymoo
- Pandas
- transformers
Current support
- Algorithm 1 implementation from the paper.
- Algorithm 2 implementation from the paper.
TODO:
Stay tuned for the following updates!
- Add support for multi-processing for parallel execution!
- Add support for percentile-based collation of results.
- Add torchtune support for evaluating and finetuning LLMs.
- (Bonus) add support for Huggingface API for evaluation and finetuning LLMs.
- Add support for long context tasks.
- Add support for integration with other KV-cache methods for token-based compression (e.g., H2O).