Quantization of GLM-130B

October 13, 2022 · View on GitHub

Usage

Please note that SwissArmyTransformer>=0.2.11 is required for quantization

Set CHECKPOINT_PATH in configs/model_glm_130b_{int4/int8}.sh to your local checkpoint folder. The model will be first initialized from the FP16 checkpoint on the CPU memory, then dynamically quantized and transferred to the GPU memory. So please make sure you have enough CPU memory (>260GB) to store the FP16 model weights.

You need to pay attention to the tensor parallel dimension of the model checkpoint, we only provide the checkpoint in 8-way tensor parallel, i.e. 8 GPUs store a whole model. If you need to do inference on a small number of GPUs, e.g. 4 * RTX 3090 GPUs with INT4 precision, you first need to convert the checkpoint to 4-way tensor parallel using the following command and modify MP_SIZE in corresponding model config file.

python tools/convert_tp.py \
    --input-folder <SRC_CKPT_PATH>  \
    --output-folder <DST_CKPT_PATH> \
    --target-tp 4

Finally, change the model config file from configs/model_glm_130b.sh to configs/model_glm_130b_{int4/int8}.sh in your scripts (e.g. scripts/generate.sh), then run your scripts just as normal.

By default, the full precision checkpoint is expected to be loaded. Run the conversion script with --quantization-bit-width <4 or 8> will produce quantized model weights. To load from a quantized checkpoint, you should add --from-quantized-checkpoint in your model config file.

Evaluation Results

MMLU(Accuracy↑)LAMBADA(Accuracy↑ )WikiText-2(PPL↓)WikiText-103(PPL↓)PTB(PPL↓)
FP1644.75180.20610.90110.75918.964
INT844.70980.20610.90410.76318.994
INT444.80179.46811.16711.04619.535

Space and Speed Benchmark

HardwareGPU MemoryPrecison51210242048
8 * A10040 GBFP1645.21 s89.00 s179.22 s
8 * V10032 GBINT8106.35 s216.50 s449.17 s
4 * RTX 309024 GBINT4138.66 s292.69 s649.64 s
8 * RTX 2080 Ti11 GBINT4117.39 s240.96 s528.66 s

The above results in the table is tests with SAT. Using FasterTransformer can speed up more than 2X, as shown in the table below, and the detailed usage is shown in Inference with FasterTransformer.

HardwareGPU MemoryPrecison128 Encode / Decode512 Encode / Decode1024 Encode / Decode2048 Encode / Decode
8 * A10040 GBINT4145 ms / 4.29 s183 ms / 17.7 s313 ms / 37.8 s495 ms / 86.0 s
4 * A10080 GBINT4174 ms / 6.62 s272 ms / 27.1 s439 ms / 56.2 s810 ms / 123 s
8 * V10032 GBINT4309 ms / 6.97 s666 ms / 28.1 s1208 ms / 58.4 s2304 ms / 125 s
4 * V10032 GBINT4448 ms / 11.4 s843 ms / 45.87 s1488 ms / 93.5 s2803 ms / 196 s
8 * RTX 309024 GBINT4283 ms / 5.07 s915 ms / 20.5 s1793 ms / 42.7 s3477 ms / 90.3 s
4 * RTX 309024 GBINT4374 ms / 8.16 s1300 ms / 32.3 sOOM / 66.5 sOOM / 150 s
8 * RTX 2080 Ti11 GBINT4392 ms / 6.77 s1044 ms / 27.29 sOOM / 56.02 sOOM / OOM

Details

Typical methods quantize both model weights and activations to INT8, enabling the INT8 matrix multiplication kernel for efficiency. However, we found that there are outliers in GLM-130B's activations, making it hard to reduce the precision of activations.

Concurrently, researchers from Meta AI also found the emergent outliers issue in large-scale transformers (>6.8B), which is consistent with our observations on GLM-130B. They conducted an in-depth analysis and found that the outliers make up only about 0.1% of all feature dimensions, so it's possible to make a decomposition for matrix multiplication that focuses on high precision multiplication for these particular dimensions.

Distribution of outliers (the white ones) in GLM-130B's activation

Unfortunately, the outliers in GLM-130B can sometimes make up at most 30% of the feature dimension, possibly because we used GLU as a variant of FFN. Therefore, a mixed-precision decomposition for matmul can be much less efficient than a single FP16 matmul. After a few weeks of trial, we finally decided to keep the precision of activations to FP16 and only consider the quantization of model weights. In that case, the quantized model parameters are dynamically converted to FP16 precision at runtime, introducing a small computational overhead but greatly reducing GPU memory requirements for storing model weights.

We quantized all linear layers as they take up most of the model parameters. All model weights, excluding input/output embedding, layernorm and bias terms are quantized using vector-wise symmetric quantization. At the quantization precision of INT4, two INT4 weights are compressed into one INT8 weight for saving GPU memory usage, so that only 70GB of GPU memory approximately is required for INT4 model weights.