InternVL
March 19, 2026 ยท View on GitHub
LMDeploy supports the following InternVL series of models, which are detailed in the table below:
| Model | Size | Supported Inference Engine |
|---|---|---|
| InternVL | 13B-19B | TurboMind |
| InternVL1.5 | 2B-26B | TurboMind, PyTorch |
| InternVL2 | 4B | PyTorch |
| InternVL2 | 1B-2B, 8B-76B | TurboMind, PyTorch |
| InternVL2.5/2.5-MPO/3 | 1B-78B | TurboMind, PyTorch |
| Mono-InternVL | 2B | PyTorch |
The next chapter demonstrates how to deploy an InternVL model using LMDeploy, with InternVL2-8B as an example.
Installation
Please install LMDeploy by following the installation guide, and install other packages that InternVL2 needs
pip install timm
# It is recommended to find the whl package that matches the environment from the releases on https://github.com/Dao-AILab/flash-attention.
pip install flash-attn
Or, you can build a docker image to set up the inference environment. If the CUDA version on your host machine is >=12.4, you can run:
docker build --build-arg CUDA_VERSION=cu12 -t openmmlab/lmdeploy:internvl . -f ./docker/InternVL_Dockerfile
Otherwise, you can go with:
git clone https://github.com/InternLM/lmdeploy.git
cd lmdeploy
docker build --build-arg CUDA_VERSION=cu11 -t openmmlab/lmdeploy:internvl . -f ./docker/InternVL_Dockerfile
Offline inference
The following sample code shows the basic usage of VLM pipeline. For detailed information, please refer to VLM Offline Inference Pipeline
from lmdeploy import pipeline
from lmdeploy.vl import load_image
pipe = pipeline('OpenGVLab/InternVL2-8B')
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
response = pipe((f'describe this image', image))
print(response)
More examples are listed below:
multi-image multi-round conversation, combined images
from lmdeploy import pipeline, GenerationConfig
from lmdeploy.vl.constants import IMAGE_TOKEN
pipe = pipeline('OpenGVLab/InternVL2-8B', log_level='INFO')
messages = [
dict(role='user', content=[
dict(type='text', text=f'{IMAGE_TOKEN}{IMAGE_TOKEN}\nDescribe the two images in detail.'),
dict(type='image_url', image_url=dict(max_dynamic_patch=12, url='https://raw.githubusercontent.com/OpenGVLab/InternVL/main/internvl_chat/examples/image1.jpg')),
dict(type='image_url', image_url=dict(max_dynamic_patch=12, url='https://raw.githubusercontent.com/OpenGVLab/InternVL/main/internvl_chat/examples/image2.jpg'))
])
]
out = pipe(messages, gen_config=GenerationConfig(top_k=1))
messages.append(dict(role='assistant', content=out.text))
messages.append(dict(role='user', content='What are the similarities and differences between these two images.'))
out = pipe(messages, gen_config=GenerationConfig(top_k=1))
multi-image multi-round conversation, separate images
from lmdeploy import pipeline, GenerationConfig
from lmdeploy.vl.constants import IMAGE_TOKEN
pipe = pipeline('OpenGVLab/InternVL2-8B', log_level='INFO')
messages = [
dict(role='user', content=[
dict(type='text', text=f'Image-1: {IMAGE_TOKEN}\nImage-2: {IMAGE_TOKEN}\nDescribe the two images in detail.'),
dict(type='image_url', image_url=dict(max_dynamic_patch=12, url='https://raw.githubusercontent.com/OpenGVLab/InternVL/main/internvl_chat/examples/image1.jpg')),
dict(type='image_url', image_url=dict(max_dynamic_patch=12, url='https://raw.githubusercontent.com/OpenGVLab/InternVL/main/internvl_chat/examples/image2.jpg'))
])
]
out = pipe(messages, gen_config=GenerationConfig(top_k=1))
messages.append(dict(role='assistant', content=out.text))
messages.append(dict(role='user', content='What are the similarities and differences between these two images.'))
out = pipe(messages, gen_config=GenerationConfig(top_k=1))
video multi-round conversation
import numpy as np
from lmdeploy import pipeline, GenerationConfig
from decord import VideoReader, cpu
from lmdeploy.vl.constants import IMAGE_TOKEN
from lmdeploy.vl import encode_image_base64
from PIL import Image
pipe = pipeline('OpenGVLab/InternVL2-8B', log_level='INFO')
def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
if bound:
start, end = bound[0], bound[1]
else:
start, end = -100000, 100000
start_idx = max(first_idx, round(start * fps))
end_idx = min(round(end * fps), max_frame)
seg_size = float(end_idx - start_idx) / num_segments
frame_indices = np.array([
int(start_idx + (seg_size / 2) + np.round(seg_size * idx))
for idx in range(num_segments)
])
return frame_indices
def load_video(video_path, bound=None, num_segments=32):
vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
max_frame = len(vr) - 1
fps = float(vr.get_avg_fps())
pixel_values_list, num_patches_list = [], []
frame_indices = get_index(bound, fps, max_frame, first_idx=0, num_segments=num_segments)
imgs = []
for frame_index in frame_indices:
img = Image.fromarray(vr[frame_index].asnumpy()).convert('RGB')
imgs.append(img)
return imgs
video_path = 'red-panda.mp4'
imgs = load_video(video_path, num_segments=8)
question = ''
for i in range(len(imgs)):
question = question + f'Frame{i+1}: {IMAGE_TOKEN}\n'
question += 'What is the red panda doing?'
content = [{'type': 'text', 'text': question}]
for img in imgs:
content.append({'type': 'image_url', 'image_url': {'max_dynamic_patch': 1, 'url': f'data:image/jpeg;base64,{encode_image_base64(img)}'}})
messages = [dict(role='user', content=content)]
out = pipe(messages, gen_config=GenerationConfig(top_k=1))
messages.append(dict(role='assistant', content=out.text))
messages.append(dict(role='user', content='Describe this video in detail. Don\'t repeat.'))
out = pipe(messages, gen_config=GenerationConfig(top_k=1))
Online serving
You can launch the server by the lmdeploy serve api_server CLI:
lmdeploy serve api_server OpenGVLab/InternVL2-8B
You can also start the service using the aforementioned built docker image:
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=<secret>" \
-p 23333:23333 \
--ipc=host \
openmmlab/lmdeploy:internvl \
lmdeploy serve api_server OpenGVLab/InternVL2-8B
The docker compose is another option. Create a docker-compose.yml configuration file in the root directory of the lmdeploy project as follows:
version: '3.5'
services:
lmdeploy:
container_name: lmdeploy
image: openmmlab/lmdeploy:internvl
ports:
- "23333:23333"
environment:
HUGGING_FACE_HUB_TOKEN: <secret>
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
stdin_open: true
tty: true
ipc: host
command: lmdeploy serve api_server OpenGVLab/InternVL2-8B
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: "all"
capabilities: [gpu]
Then, you can execute the startup command as below:
docker-compose up -d
If you find the following logs after running docker logs -f lmdeploy, it means the service launches successfully.
HINT: Please open http://0.0.0.0:23333 in a browser for detailed api usage!!!
HINT: Please open http://0.0.0.0:23333 in a browser for detailed api usage!!!
HINT: Please open http://0.0.0.0:23333 in a browser for detailed api usage!!!
INFO: Started server process [2439]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:23333 (Press CTRL+C to quit)
The arguments of lmdeploy serve api_server can be reviewed in detail by lmdeploy serve api_server -h.
More information about api_server as well as how to access the service can be found from here