README.md
May 10, 2024 · View on GitHub
Consistency Large Language Models: A Family of Efficient Parallel Decoders
Consistency large language models (CLLMs) is a new family of models capable of reducing inference latency by efficiently decoding tokens in parallel. This decoding method is called Jacobi decoding, which improves inference efficiency in comparison with conventional auto-regressive (AR) decoding. CLLMs are trained with the objective of performing efficient Jacobi decoding by mapping any randomly initialized -token sequence to the same result as AR decoding in as few steps as possible.
Experiment results have demonstrated the effectiveness of CLLMs, showing $2.4\times to \3.4\times$ improvements in generation speed on a variety of tasks.
A demo of using CLLM to achieve significant improvements () in generation speed to solve a basic math problem is shown below:
Contents
News 🔥
- [2024/3] CLLMs are integrated in FastChat!
- [2024/2] CLLM Paper now available on arXiv. CLLMs model checkpoints are released on Huggingface Hub.
Introduction
Consistency Large Language Models (CLLMs) is a family of efficient parallel decoders refined from pre-trained LLMs.
Compared with existing fast decoding techniques, CLLMs achieve fast parallel decoding without the need for:
- Draft models
- Architectural modifications/auxiliary model components
This introduces a number of advantages for CLLMs:
- CLLMs don't have to deal with the complexity of obtaining 'good' draft models and managing two different models in a single system.
- CLLMs share the same architecture with target LLMs and require no additional engineering efforts when adopting the technique to different models.
- CLLMs can be integrated seamlessly with other techniques for efficient LLM inference (e.g. Lookahead Decoding) to achieve even more significant speedup.
Installation
- Environment setup:
conda create -n cllm python=3.10
conda activate cllm
- Clone this repository and build from source:
git clone git@github.com:hao-ai-lab/Consistency_LLM.git
cd Consistency_LLM
- Install dependency:
pip install -r requirements.txt
pip install flash-attn==2.4.1
Model Weights
Target Pre-trained Models
| Size | Dataset | Huggingface Repo |
|---|---|---|
| 7B | ShareGPT | cllm/vicuna-7b-sharegpt-gpt4-48k |
| 7B | GSM8K (Math) | GAIR/Abel-7B-001 |
| 7B | Spider (Text-to-SQL) | cllm/deepseekcoder-7b-instruct-spider |
| 7B | Code-Search-Net Python | cllm/deepseekcoder_7b_codesearch_net_python |
CLLMs
| Size | Dataset | Huggingface Repo |
|---|---|---|
| 7B | ShareGPT | cllm/consistency-llm-7b-sharegpt48k |
| 7B | GSM8K (Math) | cllm/consistency-llm-7b-math |
| 7B | Spider (Text-to-SQL) | cllm/consistency-llm-7b-spider |
| 7B | Code-Search-Net Python | cllm/consistency-llm-7b-codesearchnet |
Usage
Inference
bash applications/run_chat_cllm.sh {model_path} {cllm_type}
cllm_type can take the value of spider, python, gsm8k, sharegpt.
Training
- Collect Jacobi trajectory:
- Method 1: Directly download Jacobi trajectory to
data/collected_jacobi_trajectory/from our Huggingface Hub page. - Method 2 (Generate trajectory suitable to your own target model and dataset): Some raw datasets that contain additional information like database dependency or cannot be directly loaded from Huggingface Hub (for example, Spider and ShareGPT are required to be installed in
data/raw_data). Then runscripts/generate_trajectory.shand the training dataset for a CLLM will be saved indata/collected_jacobi_trajectory/.
For example, for the gsm8k dataset, run:
# max_new_tokens corresponds to the size of n_token_sequence
CUDA_VISIBLE_DEVICES=0 bash scripts/generate_trajectory.sh {filename} {model_path} {n_token_seq_size} {max_new_seq_len}
Other command options
--filename: path to the raw dataset, currently supporting {data/raw_data/spider, code_search_net, data/raw_data/gsm8k_train.jsonl, data/raw_data/ShareGPT_V3_unfiltered_cleaned_split.json} \
--data_size: maximum number of prompts used to extract Jacobi trajectories \
--use_aug: use data augmentation technique \
--use_labels: add dataset's labels to the output file
- Train a CLLM:
bash scripts/train_cllm.sh {model_path} {trajectory_file} {output_path} {n_token_seq_size}
Evaluation
We follow the same settings in human-eval, Spider, MT-bench and GSM8K evaluate CLLMs' generation quality. An example code to evaluate CLLMs' throughput measured in tokens/s, fast-forwarded token count, stationary token count can be found in eval folder. Take GSM8K dataset as an example:
To test the speedup, run:
CUDA_VISIBLE_DEVICES=0 bash eval/gsm8k/speedup.sh {model_path} {target_model_path} {max_new_tokens}
To test the accuracy, run:
CUDA_VISIBLE_DEVICES=0 python eval/gsm8k/acc.py --model_dir path_to_cllm --temperature 0.0 --top_p 1.0 --output_file_name 'cllm_generated_gsm8k.jsonl' \
--dev_set "gsm8k" --prompt_type math-single --max_new_tokens_for_consistency 16 --max_tokens 1024 --use_consistency_decoding
Citation
This is the official project repository for the following paper. If you find this repository helpful, Please kindly cite:
@misc{kou2024cllms,
title={CLLMs: Consistency Large Language Models},
author={Siqi Kou and Lanxiang Hu and Zhezhi He and Zhijie Deng and Hao Zhang},
year={2024},
eprint={2403.00835},
archivePrefix={arXiv},
primaryClass={cs.CL}
}