Get Started with Huawei Ascend

September 8, 2025 · View on GitHub

We currently support running lmdeploy on Atlas 800T A3, Atlas 800T A2 and Atlas 300I Duo. The usage of lmdeploy on a Huawei Ascend device is almost the same as its usage on CUDA with PytorchEngine in lmdeploy. Please read the original Get Started guide before reading this tutorial.

Here is the supported model list.

Important

We have uploaded a docker image with KUNPENG CPU to aliyun. Please try to pull the image by following command:

Atlas 800T A3:

docker pull crpi-4crprmm5baj1v8iv.cn-hangzhou.personal.cr.aliyuncs.com/lmdeploy_dlinfer/ascend:a3-latest

(Atlas 800T A3 currently supports only the Qwen-series with eager mode.)

Atlas 800T A2:

docker pull crpi-4crprmm5baj1v8iv.cn-hangzhou.personal.cr.aliyuncs.com/lmdeploy_dlinfer/ascend:a2-latest

300I Duo:

docker pull crpi-4crprmm5baj1v8iv.cn-hangzhou.personal.cr.aliyuncs.com/lmdeploy_dlinfer/ascend:300i-duo-latest

(Atlas 300I Duo currently works only with graph mode.)

To build the environment yourself, refer to the Dockerfiles here.

Offline batch inference

LLM inference

Set device_type="ascend" in the PytorchEngineConfig:

from lmdeploy import pipeline
from lmdeploy import PytorchEngineConfig
pipe = pipeline("internlm/internlm2_5-7b-chat",
        backend_config=PytorchEngineConfig(tp=1, device_type="ascend"))
question = ["Shanghai is", "Please introduce China", "How are you?"]
response = pipe(question)
print(response)

VLM inference

Set device_type="ascend" in the PytorchEngineConfig:

from lmdeploy import pipeline, PytorchEngineConfig
from lmdeploy.vl import load_image
pipe = pipeline('OpenGVLab/InternVL2-2B',
        backend_config=PytorchEngineConfig(tp=1, device_type='ascend'))
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
response = pipe(('describe this image', image))
print(response)

Online serving

Serve a LLM model

Add --device ascend in the serve command.

lmdeploy serve api_server --backend pytorch --device ascend internlm/internlm2_5-7b-chat

Run the following commands to launch docker container for lmdeploy LLM serving:

docker run -it --net=host crpi-4crprmm5baj1v8iv.cn-hangzhou.personal.cr.aliyuncs.com/lmdeploy_dlinfer/ascend:a2-latest \
    bash -i -c "lmdeploy serve api_server --backend pytorch --device ascend internlm/internlm2_5-7b-chat"

Serve a VLM model

Add --device ascend in the serve command

lmdeploy serve api_server --backend pytorch --device ascend OpenGVLab/InternVL2-2B

Run the following commands to launch docker container for lmdeploy VLM serving:

docker run -it --net=host crpi-4crprmm5baj1v8iv.cn-hangzhou.personal.cr.aliyuncs.com/lmdeploy_dlinfer/ascend:a2-latest \
    bash -i -c "lmdeploy serve api_server --backend pytorch --device ascend OpenGVLab/InternVL2-2B"

Inference with Command line Interface

Add --device ascend in the serve command.

lmdeploy chat internlm/internlm2_5-7b-chat --backend pytorch --device ascend

Run the following commands to launch lmdeploy chatting after starting container:

docker run -it crpi-4crprmm5baj1v8iv.cn-hangzhou.personal.cr.aliyuncs.com/lmdeploy_dlinfer/ascend:a2-latest \
    bash -i -c "lmdeploy chat --backend pytorch --device ascend internlm/internlm2_5-7b-chat"

Quantization

w4a16 AWQ

Run the following commands to quantize weights on Atlas 800T A2.

lmdeploy lite auto_awq $HF_MODEL --work-dir $WORK_DIR --device npu

Please check supported_models before use this feature.

w8a8 SMOOTH_QUANT

Run the following commands to quantize weights on Atlas 800T A2.

lmdeploy lite smooth_quant $HF_MODEL --work-dir $WORK_DIR --device npu

Please check supported_models before use this feature.

int8 KV-cache Quantization

Ascend backend has supported offline int8 KV-cache Quantization on eager mode.

Please refer this doc for details.

Limitations on 300I Duo

only support dtype=float16.
only support graph mode, please do not add --eager-mode.