Performance

November 9, 2022 ยท View on GitHub

The following two tables is a comparison of performance between LightSeq and Faster Transformer, Which is tested on Tesla T4 with a model of Transformer-base. We also provide a TF baseline which's code is from Faster Transformer.

batch_sizebeam_sizeseq_lenTF(ms)FT(ms)lightseq(ms)PyTorch(ms)FT speeduplightseq speedupPyTorch speedup
1432419.5326.2529.66385.2315.9814.141.09
1464806.3854.0263.04760.7714.9312.791.06
8432439.6435.9934.77416.0612.2212.641.06
8464891.5479.8279.43835.7911.1711.221.07
3243253682.8259.49429.786.479.011.25
324641116.74198.95155.08929.975.617.201.20
64432668.45144.53101.54520.664.626.581.28
644641476.17351.14277.41237.794.205.321.19
128432996.88271.8200.49721.663.674.971.38
1284642157.85671.76502.912158.813.214.291.00

Sampling

batch_sizetopk/toppseq_lenFT(ms)lightseq(ms)lightseq speedup
10.753234.429.661.16
10.756471.4559.721.20
320.753256.6140.401.40
320.7564120.39100.361.20
1280.7532111.494.681.18
1280.7564246.97270.550.91
1323234.3528.061.22
1326472.4856.41.29
32323240.1539.231.02
32326487.4698.620.89
12832329990.831.09
1283264222.622620.85

Machine Translation

The following table is a comparison on a fr2en translation model which is a Transformer-big with a beam size of 4 and a target vocabulary size of approximately 30k. FP32 models are tested on Tesla P4, and FP16 models are tested on Tesla T4.

batch_sizeseq_lentf-fp32, mslightseq-fp32, mslightseq-fp16, mslightseq-fp32/tf-fp32, speeduplightseq-fp16/lightseq-fp32, speeduplightseq-fp16/tf-fp32, speedup
1630347276.441.7411.22
11239963386.331.6610.5
118702108596.51.8311.9
1241071167826.412.0413.06
13612341921056.421.8311.75
14614452271106.362.0613.14
15818873031426.222.1313.29
17027714281976.472.1714.07
2631757325.561.789.91
21241873395.721.8710.72
218723131665.511.9810.95
2241113201915.532.2112.23
23612762341045.452.2512.27
24615212821215.392.3312.57
25820043711595.42.3312.6
27029655422215.472.4513.42
4632661395.341.568.36
41243385475.091.819.21
418761154774.9429.88
42411952451134.872.1710.58
43613912821284.932.210.87
44616793391534.952.2210.97
45822324551994.92.2911.22
47034066732855.062.3611.95
8636476434.781.778.47
812470110564.271.968.39
818854205914.162.259.38
82413813181394.342.299.94
83616283781564.32.4210.44
84619894591934.332.3810.31
85826836172544.342.4310.56
87042519493824.472.4811.13

The following table is a comparison on a en2zh translation model which is a Transformer-deep(Compared with Transformer-big, it has 16 layers of encoder and other configurations remain the same) with a beam size of 4 and a target vocabulary size of approximately 30k. FP32 models are tested on Tesla P4, and FP16 models are tested on Tesla T4.

batch_sizeseq_lentf-fp32, mslightseq-fp32, mslightseq-fp16, mslightseq-fp32/tf-fp32, speeduplightseq-fp16/lightseq-fp32, speeduplightseq-fp16/tf-fp32, speedup
11254486436.32212.65
124914131666.971.9813.85
1361290200936.452.1513.87
14818362331067.892.217.32
17234564822127.172.2716.3
18426264311936.092.2313.61
212566100505.66211.32
224842158705.322.2612.03
23612872471035.212.412.5
24815042881185.222.4412.75
27231316112405.122.5513.05
28427895462175.12.5212.85
4125901185852.0310.17
424885187894.732.19.94
43613803011274.582.3710.87
44816223521494.62.3610.89
47234927633114.572.4511.23
48431456872824.572.4411.15
812631150664.22.279.56
8249792481033.942.419.5
83615844121563.842.6410.15
84818804771863.942.5610.11
872421810694043.942.6510.44
88438319763733.922.6210.27

BERT

The following table is a comparison between Hugging Face BERT-base model and LightSeq model on Tesla T4 using FP16.

batch_sizeseq_lenHugging Face(ms)lightseq(ms)lightseq speedup
11615.232.196.95
13216.241.998.16
16419.322.358.22
112816.572.985.56
125623.994.605.22
81613.063.473.76
83213.274.462.98
86423.027.433.10
812859.3517.273.44
8256117.0640.742.87
321629.2712.382.36
323254.9017.683.11
3264109.1336.203.01
32128260.1366.033.94
32256498.84145.573.43