Performance

November 9, 2022 · View on GitHub

The following two tables is a comparison of performance between LightSeq and Faster Transformer, Which is tested on Tesla T4 with a model of Transformer-base. We also provide a TF baseline which's code is from Faster Transformer.

Beam search

batch_size	beam_size	seq_len	TF(ms)	FT(ms)	lightseq(ms)	PyTorch(ms)	FT speedup	lightseq speedup	PyTorch speedup
1	4	32	419.53	26.25	29.66	385.23	15.98	14.14	1.09
1	4	64	806.38	54.02	63.04	760.77	14.93	12.79	1.06
8	4	32	439.64	35.99	34.77	416.06	12.22	12.64	1.06
8	4	64	891.54	79.82	79.43	835.79	11.17	11.22	1.07
32	4	32	536	82.82	59.49	429.78	6.47	9.01	1.25
32	4	64	1116.74	198.95	155.08	929.97	5.61	7.20	1.20
64	4	32	668.45	144.53	101.54	520.66	4.62	6.58	1.28
64	4	64	1476.17	351.14	277.4	1237.79	4.20	5.32	1.19
128	4	32	996.88	271.8	200.49	721.66	3.67	4.97	1.38
128	4	64	2157.85	671.76	502.91	2158.81	3.21	4.29	1.00

Sampling

batch_size	topk/topp	seq_len	FT(ms)	lightseq(ms)	lightseq speedup
1	0.75	32	34.4	29.66	1.16
1	0.75	64	71.45	59.72	1.20
32	0.75	32	56.61	40.40	1.40
32	0.75	64	120.39	100.36	1.20
128	0.75	32	111.4	94.68	1.18
128	0.75	64	246.97	270.55	0.91
1	32	32	34.35	28.06	1.22
1	32	64	72.48	56.4	1.29
32	32	32	40.15	39.23	1.02
32	32	64	87.46	98.62	0.89
128	32	32	99	90.83	1.09
128	32	64	222.62	262	0.85

Machine Translation

The following table is a comparison on a fr2en translation model which is a Transformer-big with a beam size of 4 and a target vocabulary size of approximately 30k. FP32 models are tested on Tesla P4, and FP16 models are tested on Tesla T4.

batch_size	seq_len	tf-fp32, ms	lightseq-fp32, ms	lightseq-fp16, ms	lightseq-fp32/tf-fp32, speedup	lightseq-fp16/lightseq-fp32, speedup	lightseq-fp16/tf-fp32, speedup
1	6	303	47	27	6.44	1.74	11.22
1	12	399	63	38	6.33	1.66	10.5
1	18	702	108	59	6.5	1.83	11.9
1	24	1071	167	82	6.41	2.04	13.06
1	36	1234	192	105	6.42	1.83	11.75
1	46	1445	227	110	6.36	2.06	13.14
1	58	1887	303	142	6.22	2.13	13.29
1	70	2771	428	197	6.47	2.17	14.07
2	6	317	57	32	5.56	1.78	9.91
2	12	418	73	39	5.72	1.87	10.72
2	18	723	131	66	5.51	1.98	10.95
2	24	1113	201	91	5.53	2.21	12.23
2	36	1276	234	104	5.45	2.25	12.27
2	46	1521	282	121	5.39	2.33	12.57
2	58	2004	371	159	5.4	2.33	12.6
2	70	2965	542	221	5.47	2.45	13.42
4	6	326	61	39	5.34	1.56	8.36
4	12	433	85	47	5.09	1.81	9.21
4	18	761	154	77	4.94	2	9.88
4	24	1195	245	113	4.87	2.17	10.58
4	36	1391	282	128	4.93	2.2	10.87
4	46	1679	339	153	4.95	2.22	10.97
4	58	2232	455	199	4.9	2.29	11.22
4	70	3406	673	285	5.06	2.36	11.95
8	6	364	76	43	4.78	1.77	8.47
8	12	470	110	56	4.27	1.96	8.39
8	18	854	205	91	4.16	2.25	9.38
8	24	1381	318	139	4.34	2.29	9.94
8	36	1628	378	156	4.3	2.42	10.44
8	46	1989	459	193	4.33	2.38	10.31
8	58	2683	617	254	4.34	2.43	10.56
8	70	4251	949	382	4.47	2.48	11.13

The following table is a comparison on a en2zh translation model which is a Transformer-deep(Compared with Transformer-big, it has 16 layers of encoder and other configurations remain the same) with a beam size of 4 and a target vocabulary size of approximately 30k. FP32 models are tested on Tesla P4, and FP16 models are tested on Tesla T4.

batch_size	seq_len	tf-fp32, ms	lightseq-fp32, ms	lightseq-fp16, ms	lightseq-fp32/tf-fp32, speedup	lightseq-fp16/lightseq-fp32, speedup	lightseq-fp16/tf-fp32, speedup
1	12	544	86	43	6.32	2	12.65
1	24	914	131	66	6.97	1.98	13.85
1	36	1290	200	93	6.45	2.15	13.87
1	48	1836	233	106	7.89	2.2	17.32
1	72	3456	482	212	7.17	2.27	16.3
1	84	2626	431	193	6.09	2.23	13.61
2	12	566	100	50	5.66	2	11.32
2	24	842	158	70	5.32	2.26	12.03
2	36	1287	247	103	5.21	2.4	12.5
2	48	1504	288	118	5.22	2.44	12.75
2	72	3131	611	240	5.12	2.55	13.05
2	84	2789	546	217	5.1	2.52	12.85
4	12	590	118	58	5	2.03	10.17
4	24	885	187	89	4.73	2.1	9.94
4	36	1380	301	127	4.58	2.37	10.87
4	48	1622	352	149	4.6	2.36	10.89
4	72	3492	763	311	4.57	2.45	11.23
4	84	3145	687	282	4.57	2.44	11.15
8	12	631	150	66	4.2	2.27	9.56
8	24	979	248	103	3.94	2.41	9.5
8	36	1584	412	156	3.84	2.64	10.15
8	48	1880	477	186	3.94	2.56	10.11
8	72	4218	1069	404	3.94	2.65	10.44
8	84	3831	976	373	3.92	2.62	10.27

BERT

The following table is a comparison between Hugging Face BERT-base model and LightSeq model on Tesla T4 using FP16.

batch_size	seq_len	Hugging Face(ms)	lightseq(ms)	lightseq speedup
1	16	15.23	2.19	6.95
1	32	16.24	1.99	8.16
1	64	19.32	2.35	8.22
1	128	16.57	2.98	5.56
1	256	23.99	4.60	5.22
8	16	13.06	3.47	3.76
8	32	13.27	4.46	2.98
8	64	23.02	7.43	3.10
8	128	59.35	17.27	3.44
8	256	117.06	40.74	2.87
32	16	29.27	12.38	2.36
32	32	54.90	17.68	3.11
32	64	109.13	36.20	3.01
32	128	260.13	66.03	3.94
32	256	498.84	145.57	3.43