Supported Models

March 18, 2026 ยท View on GitHub

The following tables detail the models supported by LMDeploy's TurboMind engine and PyTorch engine across different platforms.

TurboMind on CUDA Platform

ModelSizeTypeFP16/BF16KV INT8KV INT4W4A16
Llama7B - 65BLLMYesYesYesYes
Llama27B - 70BLLMYesYesYesYes
Llama38B, 70BLLMYesYesYesYes
Llama3.18B, 70BLLMYesYesYesYes
Llama3.2[2]1B, 3BLLMYesYes*Yes*Yes
InternLM7B - 20BLLMYesYesYesYes
InternLM27B - 20BLLMYesYesYesYes
InternLM2.57BLLMYesYesYesYes
InternLM38BLLMYesYesYesYes
InternLM-XComposer27B, 4khd-7BMLLMYesYesYesYes
InternLM-XComposer2.57BMLLMYesYesYesYes
Intern-S1241BMLLMYesYesYesNo
Intern-S1-mini8.3BMLLMYesYesYesNo
Qwen1.8B - 72BLLMYesYesYesYes
Qwen1.5[1]1.8B - 110BLLMYesYesYesYes
Qwen2[2]0.5B - 72BLLMYesYes*Yes*Yes
Qwen2-MoE57BA14BLLMYesYesYesYes
Qwen2.5[2]0.5B - 72BLLMYesYes*Yes*Yes
Qwen30.6B-235BLLMYesYesYes*Yes*
Qwen3.5[3]0.8B-397BMLLMYesYesNoYes
Mistral[1]7BLLMYesYesYesNo
Mixtral8x7B, 8x22BLLMYesYesYesYes
DeepSeek-V216B, 236BLLMYesYesYesNo
DeepSeek-V2.5236BLLMYesYesYesNo
Qwen-VL7BMLLMYesYesYesYes
DeepSeek-VL7BMLLMYesYesYesYes
Baichuan7BLLMYesYesYesYes
Baichuan27BLLMYesYesYesYes
Code Llama7B - 34BLLMYesYesYesNo
YI6B - 34BLLMYesYesYesYes
LLaVA(1.5,1.6)7B - 34BMLLMYesYesYesYes
InternVLv1.1 - v1.5MLLMYesYesYesYes
InternVL2[2]1 - 2B, 8B - 76BMLLMYesYes*Yes*Yes
InternVL2.5(MPO)[2]1 - 78BMLLMYesYes*Yes*Yes
InternVL3[2]1 - 78BMLLMYesYes*Yes*Yes
InternVL3.5[3]1 - 241BA28BMLLMYesYes*Yes*No
ChemVLM8B - 26BMLLMYesYesYesYes
MiniCPM-Llama3-V-2_5-MLLMYesYesYesYes
MiniCPM-V-2_6-MLLMYesYesYesYes
GLM49BLLMYesYesYesYes
CodeGeeX49BLLMYesYesYes-
Molmo7B-D,72BMLLMYesYesYesNo
gpt-oss20B,120BLLMYesYesYesYes

"-" means not verified yet.

* [1] The TurboMind engine doesn't support window attention. Therefore, for models that have applied window attention and have the corresponding switch "use_sliding_window" enabled, such as Mistral, Qwen1.5 and etc., please choose the PyTorch engine for inference.
* [2] When the head_dim of a model is not 128, such as llama3.2-1B, qwen2-0.5B and internvl2-1B, turbomind doesn't support its kv cache 4/8 bit quantization and inference
* [3] TurboMind does not currently support the vision encoder for the Qwen3.5 series.

PyTorchEngine on CUDA Platform

ModelSizeTypeFP16/BF16KV INT8KV INT4W8A8W4A16
Llama7B - 65BLLMYesYesYesYesYes
Llama27B - 70BLLMYesYesYesYesYes
Llama38B, 70BLLMYesYesYesYesYes
Llama3.18B, 70BLLMYesYesYesYesYes
Llama3.21B, 3BLLMYesYesYesYesYes
Llama4Scout, MaverickMLLMYesYesYes--
InternLM7B - 20BLLMYesYesYesYesYes
InternLM27B - 20BLLMYesYesYesYesYes
InternLM2.57BLLMYesYesYesYesYes
InternLM38BLLMYesYesYesYesYes
Intern-S1241BMLLMYesYesYesYes-
Intern-S1-mini8.3BMLLMYesYesYesYes-
Intern-S1-Pro1TBMLLMYes---No
Baichuan27BLLMYesYesYesYesNo
Baichuan213BLLMYesYesYesNoNo
ChatGLM26BLLMYesYesYesNoNo
YI6B - 34BLLMYesYesYesYesYes
Mistral7BLLMYesYesYesYesYes
Mixtral8x7B, 8x22BLLMYesYesYesNoNo
QWen1.8B - 72BLLMYesYesYesYesYes
QWen1.50.5B - 110BLLMYesYesYesYesYes
QWen1.5-MoEA2.7BLLMYesYesYesNoNo
QWen20.5B - 72BLLMYesYesNoYesYes
Qwen2.50.5B - 72BLLMYesYesNoYesYes
Qwen30.6B - 235BLLMYesYesYes*-Yes*
QWen3-Next80BLLMYesNoNoNoNo
QWen2-VL2B, 7BMLLMYesYesNoNoYes
QWen2.5-VL3B - 72BMLLMYesNoNoNoNo
QWen3-VL2B - 235BMLLMYesNoNoNoNo
QWen3.50.8B-397BMLLMYesNoNoNoNo
DeepSeek-MoE16BLLMYesNoNoNoNo
DeepSeek-V216B, 236BLLMYesNoNoNoNo
DeepSeek-V2.5236BLLMYesNoNoNoNo
DeepSeek-V3685BLLMYesNoNoNoNo
DeepSeek-V3.2685BLLMYesNoNoNoNo
DeepSeek-VL23B - 27BMLLMYesNoNoNoNo
MiniCPM34BLLMYesYesYesNoNo
MiniCPM-V-2_68BLLMYesNoNoNoYes
Gemma2B-7BLLMYesYesYesNoNo
StarCoder23B-15BLLMYesYesYesNoNo
Phi-3-mini3.8BLLMYesYesYesYesYes
Phi-3-vision4.2BMLLMYesYesYes--
Phi-4-mini3.8BLLMYesYesYesYesYes
CogVLM-Chat17BMLLMYesYesYes--
CogVLM2-Chat19BMLLMYesYesYes--
LLaVA(1.5,1.6)[2]7B-34BMLLMNoNoNoNoNo
InternVL(v1.5)2B-26BMLLMYesYesYesNoYes
InternVL21B-76BMLLMYesYesYes--
InternVL2.5(MPO)1B-78BMLLMYesYesYes--
InternVL31B-78BMLLMYesYesYes--
InternVL3.51B-241BA28BMLLMYesYesYesNoNo
Mono-InternVL[1]2BMLLMYesYesYes--
ChemVLM8B-26BMLLMYesYesNo--
Gemma29B-27BLLMYesYesYes--
Gemma31B-27BMLLMYesYesYes--
GLM-49BLLMYesYesYesNoNo
GLM-4-04149BLLMYesYesYes--
GLM-4V9BMLLMYesYesYesNoYes
GLM-4.1V-Thinking9BMLLMYesYesYes--
GLM-4.5355BLLMYesYesYes--
GLM-4.5-Air106BLLMYesYesYes--
CodeGeeX49BLLMYesYesYes--
Phi-3.5-mini3.8BLLMYesYesNo--
Phi-3.5-MoE16x3.8BLLMYesYesNo--
Phi-3.5-vision4.2BMLLMYesYesNo--
SDAR1.7B-30BLLMYesYesNo--
GLM-4.7-Flash30BLLMYesNoNoNoNo
GLM-5754BLLMYesNoNoNoNo
* [1] Currently Mono-InternVL does not support FP16 due to numerical instability. Please use BF16 instead.
* [2] PyTorch engine removes the support of original llava models after v0.6.4. Please use their corresponding transformers models instead, which can be found in https://huggingface.co/llava-hf
Starting from version 0.11.1, PytorchEngine no longer provides support for mllama.

PyTorchEngine on Other Platforms

Atlas 800T A2Atlas 800T A2Atlas 800T A2Atlas 800T A2Atlas 300I DuoAtlas 800T A3Maca C500Cambricon
ModelSizeTypeFP16/BF16(eager)FP16/BF16(graph)W8A8(graph)W4A16(eager)FP16(graph)FP16/BF16(eager)BF/FP16BF/FP16
Llama27B - 70BLLMYesYesYesYes-YesYesYes
Llama38BLLMYesYesYesYesYesYesYesYes
Llama3.18BLLMYesYesYesYesYesYesYesYes
InternLM27B - 20BLLMYesYesYesYesYesYesYesYes
InternLM2.57B - 20BLLMYesYesYesYesYesYesYesYes
InternLM38BLLMYesYesYesYesYesYesYesYes
Mixtral8x7BLLMYesYesNoNoYes-YesYes
QWen1.5-MoEA2.7BLLMYes-NoNo--Yes-
QWen2(.5)7BLLMYesYesYesYesYes-YesYes
QWen2-VL2B, 7BMLLMYesYes----YesNo
QWen2.5-VL3B - 72BMLLMYesYes--Yes-YesNo
QWen2-MoEA14.57BLLMYes-NoNo--Yes-
QWen30.6B-235BLLMYesYesNoNoYesYesYesYes
DeepSeek-V216BLLMNoYesNoNo----
InternVL(v1.5)2B-26BMLLMYes-YesYes--Yes-
InternVL21B-40BMLLMYesYesYesYesYes-YesYes
InternVL2.51B-78BMLLMYesYesYesYesYes-YesYes
InternVL31B-78BMLLMYesYesYesYesYes-YesYes
CogVLM2-chat19BMLLMYesNo----Yes-
GLM4V9BMLLMYesNo------