README.md

May 10, 2024 · View on GitHub

Consistency Large Language Models: A Family of Efficient Parallel Decoders

Consistency large language models (CLLMs) is a new family of models capable of reducing inference latency by efficiently decoding $n$ tokens in parallel. This decoding method is called Jacobi decoding, which improves inference efficiency in comparison with conventional auto-regressive (AR) decoding. CLLMs are trained with the objective of performing efficient Jacobi decoding by mapping any randomly initialized $n$ -token sequence to the same result as AR decoding in as few steps as possible.

Experiment results have demonstrated the effectiveness of CLLMs, showing $2.4\times $to \$ 3.4\times$ improvements in generation speed on a variety of tasks.

A demo of using CLLM to achieve significant improvements ( $\sim3\times$ ) in generation speed to solve a basic math problem is shown below:

News
Introduction
Installation
Model Weights
Usage
Citation

News 🔥

[2024/3] CLLMs are integrated in FastChat!
[2024/2] CLLM Paper now available on arXiv. CLLMs model checkpoints are released on Huggingface Hub.

Introduction

Consistency Large Language Models (CLLMs) is a family of efficient parallel decoders refined from pre-trained LLMs.

Compared with existing fast decoding techniques, CLLMs achieve fast parallel decoding without the need for:

Draft models
Architectural modifications/auxiliary model components

This introduces a number of advantages for CLLMs:

CLLMs don't have to deal with the complexity of obtaining 'good' draft models and managing two different models in a single system.
CLLMs share the same architecture with target LLMs and require no additional engineering efforts when adopting the technique to different models.
CLLMs can be integrated seamlessly with other techniques for efficient LLM inference (e.g. Lookahead Decoding) to achieve even more significant speedup.

Installation

Environment setup:

conda create -n cllm python=3.10
conda activate cllm

Clone this repository and build from source:

git clone git@github.com:hao-ai-lab/Consistency_LLM.git
cd Consistency_LLM

Install dependency:

pip install -r requirements.txt
pip install flash-attn==2.4.1

Model Weights

Target Pre-trained Models

Size	Dataset	Huggingface Repo
7B	ShareGPT	cllm/vicuna-7b-sharegpt-gpt4-48k
7B	GSM8K (Math)	GAIR/Abel-7B-001
7B	Spider (Text-to-SQL)	cllm/deepseekcoder-7b-instruct-spider
7B	Code-Search-Net Python	cllm/deepseekcoder_7b_codesearch_net_python

CLLMs

Size	Dataset	Huggingface Repo
7B	ShareGPT	cllm/consistency-llm-7b-sharegpt48k
7B	GSM8K (Math)	cllm/consistency-llm-7b-math
7B	Spider (Text-to-SQL)	cllm/consistency-llm-7b-spider
7B	Code-Search-Net Python	cllm/consistency-llm-7b-codesearchnet

Usage

Inference

bash applications/run_chat_cllm.sh {model_path} {cllm_type}

cllm_type can take the value of spider, python, gsm8k, sharegpt.

Training

Collect Jacobi trajectory:

Method 1: Directly download Jacobi trajectory to data/collected_jacobi_trajectory/ from our Huggingface Hub page.
Method 2 (Generate trajectory suitable to your own target model and dataset): Some raw datasets that contain additional information like database dependency or cannot be directly loaded from Huggingface Hub (for example, Spider and ShareGPT are required to be installed in data/raw_data). Then run scripts/generate_trajectory.sh and the training dataset for a CLLM will be saved in data/collected_jacobi_trajectory/.

For example, for the gsm8k dataset, run:

# max_new_tokens corresponds to the size of n_token_sequence
CUDA_VISIBLE_DEVICES=0 bash scripts/generate_trajectory.sh {filename} {model_path} {n_token_seq_size} {max_new_seq_len}

Other command options

--filename: path to the raw dataset, currently supporting {data/raw_data/spider, code_search_net, data/raw_data/gsm8k_train.jsonl, data/raw_data/ShareGPT_V3_unfiltered_cleaned_split.json} \ 
--data_size: maximum number of prompts used to extract Jacobi trajectories \ 
--use_aug: use data augmentation technique \
--use_labels: add dataset's labels to the output file

Train a CLLM:

bash scripts/train_cllm.sh {model_path} {trajectory_file} {output_path} {n_token_seq_size}

Evaluation

We follow the same settings in human-eval, Spider, MT-bench and GSM8K evaluate CLLMs' generation quality. An example code to evaluate CLLMs' throughput measured in tokens/s, fast-forwarded token count, stationary token count can be found in eval folder. Take GSM8K dataset as an example:

To test the speedup, run:

CUDA_VISIBLE_DEVICES=0 bash eval/gsm8k/speedup.sh {model_path} {target_model_path} {max_new_tokens}

To test the accuracy, run:

CUDA_VISIBLE_DEVICES=0 python eval/gsm8k/acc.py --model_dir path_to_cllm --temperature 0.0 --top_p 1.0 --output_file_name 'cllm_generated_gsm8k.jsonl' \
--dev_set "gsm8k" --prompt_type math-single --max_new_tokens_for_consistency 16 --max_tokens 1024 --use_consistency_decoding

Citation

This is the official project repository for the following paper. If you find this repository helpful, Please kindly cite:

@misc{kou2024cllms,
      title={CLLMs: Consistency Large Language Models}, 
      author={Siqi Kou and Lanxiang Hu and Zhezhi He and Zhijie Deng and Hao Zhang},
      year={2024},
      eprint={2403.00835},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}