Validated Model Performance

March 19, 2024 · View on GitHub

  1. LLM Quantization

  2. LLM Runtime Inference based on Pytorch Mode

    2.1 LLMs

    2.2 Stable Diffusion

    2.3 Electra

  3. LLM Runtime (GGML-Compatible)

    3.1 MPT-7B

    3.2 GPT-j-6B

    3.3 Falcon-7B

    3.4 GPT-NEOX-20B

    3.5 Dolly-V2-3B

    3.6 OPT-1.3B

    3.7 StarCoder-3B

  4. LLM Finetuning

System summary: Test by Intel on 09/19/2023. 1-node, 1x Intel(R) Xeon(R) Platinum 8480+ @3.8GHz, 56 cores/socket, HT On, Turbo On, Total Memory 256GB (16x16GB DDR5 4800 MT/s [4800 MT/s]), BIOS 3A14.TEL2P1, microcode 0x2b0001b0, CentOS Stream 8, gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-10), DL Models, Frameworks/Backends: PyTorch/ONNXRT/LLM Runtime/GGML, Datatype: FP32/INT8/BF16/FP8. Using 1 socket, 56 cores/instance, 1 instance and batch size 1

Performance varies by use, configuration and other factors. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks

LLM Quantization

Environment:

Pytorch: 2.0.1+cpu

Intel Extension for Pytorch: 2.0.100+cpu

Intel Neural Compressor: 2.3

INT8FP32INT8/FP32
FrameworkModelDatasetsThroughput (samples/sec)AccuracyThroughput (samples/sec)AccuracyThroughput GainRelative Accuracy (INT8- FP32)/FP32
pytorchopt_1.3bNeelNanda/pile-10k34.0757.05%19.9257.89%1.71-1.44%
pytorchbloom_1b7NeelNanda/pile-10k29.7449.95%13.346.34%2.247.79%
pytorchbloom_7b1NeelNanda/pile-10k12.3660.14%3.2257.64%3.834.34%
pytorchopt_2.7bNeelNanda/pile-10k23.1963.67%12.2463.65%1.890.03%
pytorchopt_6.7bNeelNanda/pile-10k13.567.01%4.167.69%3.29-1.00%
pytorchgpt_j_6bNeelNanda/pile-10k10.7667.59%4.3868.31%2.46-1.05%
pytorchflan_t5_largesamsum69.7546.25 (rougeLsum)33.1647.67 (rougeLsum)2.1-2.99%
pytorchgpt_neox_clmwikitext1.474.04 (eval_loss)0.653.52 (eval_loss)2.27-14.78%
pytorchgpt_j_6b_clmwikitext0.863 (eval_loss)0.282.34 (eval_loss)3.1-28.67%
onnxwhisper_largelambda-openai2.1197.07%1.1396.96%1.870.12%

LLM Runtime Inference based on Pytorch Mode

Environment:

Pytorch: 2.0.1+cpu

LLMs

FrameworkModelInputOutputINT8FP32BF16FP8INT8/FP32BF16/FP32FP8/FP32
pytorchgpt-neox-20b32329283 (ms)
pytorchdolly-v2-3b32323191 (ms)3798 (ms)2689 (ms)1.19x1.41x
pytorchgpt-j-6b-pruned32324523 (ms)2421 (ms)1758 (ms)1.87x2.57x
pytorchgpt-j-6b32321658 (ms)4561 (ms)2429 (ms)1793 (ms)2.75x1.88x2.54x

Stable Diffusion

ModelStepsOutputINT8FP32BF16INT8+BF16*INT8/FP32BF16/FP32
stable_diffusion_v2_120512*51216.98 (s)2.83 (s)6.00x
stable_diffusion_v1_520512*5122.18 (s)10.94 (s)2.74 (s)5.01x3.99x
stable_diffusion_v1_550512*5125.2 (s) / FID=35.466.3 (s) / FID = 31.075.5 (s) / FID = 30.58
stable_diffusion_v1_420512*51211.39 (s)2.83 (s)4.02x

Note: *Only works when steps = 50, using BF16 for inference from steps 1 to 5 and from steps 46 to 50, and INT8 for inference from steps 6 to 45. In this inference mode, accuracy and speed can achieve a good balance.

Electra

FP32BF16BF16/FP32
ModelBatch SizeSeq LengthLatency (ms)Latency (ms)Latency
electra_base_chinese_discriminator11611.504.302.67x
4165.501.803.06x
8166.201.703.65x
16165.601.304.31x
32165.701.204.75x
64165.201.104.73x
electra_base_chinese_generator112813.723.893.53x
412811.602.834.10x
812811.442.854.01x
1612812.042.704.46x
3212811.292.524.48x
6412811.752.544.63x

LLM Runtime (GGML-Compatible)

Environment: GCC / G++: 12.1.0 Transformers version: 4.35.2

MPT-7B

BackendInputOutputCores/InstancePrecisionCompute TypeGroup SizeNext Token(ms)Memory mean used (Top 50%) MBFirst Token(ms)Total Latency(ms)P90 Latency(ms)P99 Latency(ms)
LLM Runtime323232INT4INT812836.953522108.74958.537.2492.32
LLM Runtime10243232INT4INT812846.694913158341728146.8310940
LLM Runtime323248INT4INT812834.765206100.9490034.985.92
LLM Runtime10243248INT4INT812844.985147155061690145.3810713
LLM Runtime323256INT4INT812835.84523098.7192236.0784.33
LLM Runtime10243256INT4INT812845.545197151801659145.7310488
LLM Runtime323232INT4INT83238.334101157.31134538.59120.53
LLM Runtime10243232INT4INT83248.195346171781867248.3511868
LLM Runtime323248INT4INT83237.755199140.79131037.94108.99
LLM Runtime10243248INT4INT83247.215282172451870847.3611914
LLM Runtime323256INT4INT83238.045227137.21131638.19106.53
LLM Runtime10243256INT4INT83247.885274174541893948.1512058
GGML323232INT4INT83237.924047447.6162238.26320.8
GGML10243232INT4INT83247.745207265522803248.0318336
GGML323248INT4INT83234.785192330.06140835.02238.66
GGML10243248INT4INT83244.645231223892377244.8115462
GGML323256INT4INT83234.535225313.45138334.79227.08
GGML10243256INT4INT83244.645242215682295144.8614896

GPT-j-6B

BackendInputOutputCores/InstancePrecisionCompute TypeGroup SizeNext Token(ms)Memory mean used (Top 50%) MBFirst Token(ms)Total Latency(ms)P90 Latency(ms)P99 Latency(ms)
LLM Runtime323232INT4INT812823.59 401862.48793.8623.8250.55
LLM Runtime10243232INT4INT812826.240362055286726.431426
LLM Runtime20123232INT4INT812829.2145536114701929.334228
LLM Runtime323248INT4INT812821.56523060.5672921.7548.68
LLM Runtime10243248INT4INT812823.9252121763250424.171224
LLM Runtime20123248INT4INT812826.6251195230605526.813617
LLM Runtime323256INT4INT812821.98524460.85742.0822.2849.05
LLM Runtime10243256INT4INT812824.5452342007276824.71393
LLM Runtime20123256INT4INT83227.1651845151599327.43563
LLM Runtime323232INT4INT83225.353739107.52893.5225.4282.17
LLM Runtime10243232INT4INT83228.04443534054275.228.072359
LLM Runtime20123232INT4INT83230.3649148916985730.426161
LLM Runtime323248INT4INT83224.09522895.24842.124.1374.4
LLM Runtime10243248INT4INT83226.6551903307413326.892290
LLM Runtime20123248INT4INT83229.0951648021892329.185544
LLM Runtime323256INT4INT83224.66524398.16862.724.9375.54
LLM Runtime10243256INT4INT83227.0752223060389927.382120
LLM Runtime20123256INT4INT83229.5652107599851529.855253
GGML323232INT4INT83233.693585393.24143733.9281.6
GGML10243232INT4INT83236.244389127021382536.398775
GGML20123232INT4INT83239.195232272642847939.4418824
GGML323248INT4INT83230.345223291.84123230.57210
GGML10243248INT4INT83233.09513792061023133.216362
GGML20123248INT4INT83237.345245213412249937.6614737
GGML323256INT4INT83231.625241262.3124232192.2
GGML10243256INT4INT83234.0351938363941834.35781
GGML20123256INT4INT83236.945257188682001337.6613031

Falcon-7B

BackendInputOutputCores/InstancePrecisionCompute TypeGroup SizeNext Token(ms)Memory mean used (Top 50%) MBFirst Token(ms)Total Latency(ms)P90 Latency(ms)P99 Latency(ms)
LLM Runtime323232INT4INT812837.36379792.94125137.6975.88
LLM Runtime10243232INT4INT812840.3347075507675740.633813
LLM Runtime323248INT4INT812835.84499088.29119936.3272.68
LLM Runtime10243248INT4INT812837.9549515025620138.143479
LLM Runtime323256INT4INT812836.1501983.89120236.3669.19
LLM Runtime10243256INT4INT812838.8849935432663739.413761
LLM Runtime323232INT4INT83239.154395146.7135939.43113.16
LLM Runtime10243232INT4INT83241.6152136947823742.544807
LLM Runtime323248INT4INT83238.084980134.9131538.23105.1
LLM Runtime10243248INT4INT83240.5850856847810540.824737
LLM Runtime323256INT4INT83238.335011142.4133038.55110.8
LLM Runtime10243256INT4INT83240.8750846860812741.184746
GGML323232INT4INT83238.444269458.3165038.55328.4
GGML10243232INT4INT83241.644997175851887641.9412147
GGML323248INT4INT83235.874971338.3145036244.7
GGML10243248INT4INT83238.685024130641426339.069026
GGML323256INT4INT83236.225005318.9144136.43231.2
GGML10243256INT4INT83238.655045119431314238.838253

GPT-NEOX-20B

BackendInputOutputCores/InstancePrecisionCompute TypeGroup SizeNext Token(ms)Memory mean used (Top 50%) MBFirst Token(ms)Total Latency(ms)P90 Latency(ms)P99 Latency(ms)
LLM Runtime323232INT4INT812868.7710621234.18236569.11183.16
LLM Runtime10243232INT4INT812876.551253798171219077.066798
LLM Runtime323248INT4INT812860.3513639214.2208560.59167.34
LLM Runtime10243248INT4INT812868.191352492131132768.486378
LLM Runtime323256INT4INT812880.1613650221.52706107.23186.9
LLM Runtime10243256INT4INT812888.48135861004512788111.936968
LLM Runtime323232INT4INT83273.7811970390.1230874.13308.2
LLM Runtime10243232INT4INT83280.7513871149931749681.0710370
LLM Runtime323248INT4INT83268.1713616348.6212168.54275.9
LLM Runtime10243248INT4INT83274.8413717152781759875.2410566
LLM Runtime323256INT4INT83279.5713638398.22467103.79324.2
LLM Runtime10243256INT4INT83286.06137031811920787118.8612541
GGML323232INT4INT83298.23116601403444899.04998.9
GGML10243232INT4INT832105.33136864543448699105.7931382
GGML323248INT4INT83286.1913582980365186.74703.1
GGML10243248INT4INT83293.7913674329663587394.422776
GGML323256INT4INT83292.361362111363999119.08823.6
GGML10243256INT4INT83295.49136753991442874115.0327579

Dolly-V2-3B

BackendInputOutputCores/InstancePrecisionCompute TypeGroup SizeNext Token(ms)Memory mean used (Top 50%) MBFirst Token(ms)Total Latency(ms)P90 Latency(ms)P99 Latency(ms)
LLM Runtime323232INT4INT812821.84265378.37755.4322.2961.18
LLM Runtime10243232INT4INT812824.4626533725448324.692578
LLM Runtime323248INT4INT812822.76266581.26786.9523.0663.31
LLM Runtime10243248INT4INT812825.5426773399419125.732354
LLM Runtime323256INT4INT812822.02269378.17760.622.1461
LLM Runtime10243256INT4INT812833.4126933799483466.962643
LLM Runtime323232INT4INT83222.5265395.91793.222.7873.27
LLM Runtime10243232INT4INT83225.7726534374517325.883026
LLM Runtime323248INT4INT83223.77266597.84834.523.8475.06
LLM Runtime10243248INT4INT83226.2927284361517626.563018
LLM Runtime323256INT4INT83223.79269388.55826.723.9168.58
LLM Runtime10243256INT4INT83229.427254822573331.213348
GGML323232INT4INT83221.532653219.81887.621.68158.3
GGML10243232INT4INT83224.4326538011876824.65535
GGML323248INT4INT83222.042665178.7861.622.12129.6
GGML10243248INT4INT83223.8526936342708124.054384
GGML323256INT4INT83222.182693166.6853.722.26121.6
GGML10243256INT4INT83228.8427038715960956.396034

OPT-1.3B

BackendInputOutputCores/InstancePrecisionCompute TypeGroup SizeNext Token(ms)Memory mean used (Top 50%) MBFirst Token(ms)Total Latency(ms)P90 Latency(ms)P99 Latency(ms)
LLM Runtime323232INT4INT81289.85 1680104.88410.29.9575.58
LLM Runtime10243232INT4INT812811.3817023080343311.832129
LLM Runtime20123232INT4INT812813.1525137516792413.415190
LLM Runtime323248INT4INT81289.252709110.7397.39.379.38
LLM Runtime10243248INT4INT812811.126983064340811.152118
LLM Runtime20123248INT4INT812812.7727018045844113.025555
LLM Runtime323256INT4INT81289.782742112.7415.899.8480.95
LLM Runtime10243256INT4INT812816.9627373125365054.162174
LLM Runtime20123256INT4INT83216.6927297929844724.515488
LLM Runtime323232INT4INT83210.01 1703109.6419.910.178.87
LLM Runtime10243232INT4INT83211.7117603389375211.82342
LLM Runtime20123232INT4INT83213.5827208061848213.635566
LLM Runtime323248INT4INT8329.692709116.5416.99.8183.67
LLM Runtime10243248INT4INT83211.5126863290364711.552274
LLM Runtime20123248INT4INT83213.0927538101850713.145594
LLM Runtime323256INT4INT83210.42742117.3439.810.4884.37
LLM Runtime10243256INT4INT83215.6527303494397937.892427
LLM Runtime20123256INT4INT83220.5227588395903155.675811
GGML323232INT4INT8328.47 1699170432.68.88120.12
GGML10243232INT4INT83210.0717024940525210.133412
GGML20123232INT4INT83211.712709117411210411.758105
GGML323248INT4INT8328.92709154.83430.69.05109.7
GGML10243248INT4INT83210.1226694409472310.23046
GGML20123248INT4INT83212.162742110091138612.197600
GGML323256INT4INT8329.482742152.31446.049.56108.14
GGML10243256INT4INT83214.3927215843628927.994049
GGML20123256INT4INT83217.012751130011352951.848989

StarCoder-3B

BackendInputOutputCores/InstancePrecisionCompute TypeGroup SizeNext Token(ms)Memory mean used (Top 50%) MBFirst Token(ms)Total Latency(ms)P90 Latency(ms)P99 Latency(ms)
LLM Runtime323232INT4INT812826.852868175.2100727.12129.3
LLM Runtime323248INT4INT812826.782868172.1100226.95127.2
LLM Runtime323256INT4INT812828.312763173.05105028.53128.7
LLM Runtime323232INT4INT83227.82868200.74106228.2147.4
LLM Runtime323248INT4INT83227.972896193.84106028.12142.9
LLM Runtime323256INT4INT83229.162876195.67109929.31144.7
GGML323232INT4INT83226.572868368.5119226.74262.1
GGML323248INT4INT83226.52842310.5113226.67222.3
GGML323256INT4INT83227.172825293.92113627.28211.2

LLM Finetuning

Environments:

PyTorch: 2.0.1+cpu

FrameworkHidden SizeDataset (Alpaca)ConcatenationNodesPPNPrecisionLoRALoRA rank/alphaEpochesTime/EpochTotal TimeTruthfulQA (mc1/mc2)Global Batch SizeLearning Rate
PyTorch409613KYes11BF16Yes8/1633.2 Hour9.6 Hours0.30/0.451281.00E-04
PyTorch409613KYes22BF16Yes8/1631.2 Hour3.6 Hours0.30/0.451281.00E-04
PyTorch409613KYes42BF16Yes8/1630.67 Hour2 Hours0.30/0.451281.00E-04

Intel Gaudi2 Environments:

Driver version 1.13.0-ee32e42, synapse AI v1.13.0

We will release data soon.