MiniCPM-V 4.6 - SGLang Documentation
May 22, 2026 · View on GitHub
Note
MiniCPM-V 4.6 is supported on the official SGLang main branch since PR #24998 (merged 2026-05-12). No fork is required — install directly from upstream, making sure your checkout is at or after that commit.
MiniCPM-V 4.6 is registered in transformers>=5.7.0 as a standalone architecture (MiniCPMV4_6ForConditionalGeneration); the SGLang adapter follows that layout.
MiniCPM-V 4.6 ships as two checkpoints:
- Instruct —
openbmb/MiniCPM-V-4.6 - Thinking —
openbmb/MiniCPM-V-4.6-Thinking
1. Installing SGLang
Install SGLang from upstream main
Until a tagged SGLang release ships with #24998, install from upstream main (watch SGLang Releases; once a release includes that PR, pip install -U "sglang[all]" is enough):
git clone https://github.com/sgl-project/sglang.git
cd sglang
pip install --upgrade pip
pip install -e "python[all]"
transformers>=5.7.0 is installed automatically — this in turn requires a
recent PyTorch (≥ 2.6 at the time of writing). Verify the resolved versions
match what FlashInfer needs before installing FlashInfer below:
python -c "import torch, transformers; print('torch', torch.__version__, '| cuda', torch.version.cuda, '| transformers', transformers.__version__)"
Installing FlashInfer (optional but recommended)
Important
FlashInfer wheels are pinned to a specific (torch, cuda) combo. Pick the
wheel index that matches the torch + CUDA you just verified above —
don't blindly copy a cu121/torch2.4 URL, that will silently downgrade
torch and break the SGLang / transformers install.
The general index lives at https://flashinfer.ai/whl/. Pick the directory matching your environment, for example:
| Your torch / CUDA | Index URL |
|---|---|
| torch 2.6 + CUDA 12.4 | https://flashinfer.ai/whl/cu124/torch2.6/ |
| torch 2.6 + CUDA 12.6 | https://flashinfer.ai/whl/cu126/torch2.6/ |
| torch 2.7 + CUDA 12.8 | https://flashinfer.ai/whl/cu128/torch2.7/ |
Then install via either:
# Method 1 — pip from the right index (slow / blocked in CN)
pip install flashinfer-python -i <index URL from table above>
# Method 2 — download the matching wheel manually
# 1) Open the index URL in a browser, find a wheel that matches your
# python version (cp310 / cp311 / ...) and platform (linux_x86_64 / win_amd64)
# 2) pip install <downloaded-wheel.whl>
For everything else (Docker images, CPU-only fallback, etc.) see the official SGLang installation docs.
2. Launching the Inference Server
By default the server downloads weights from the HuggingFace Hub:
python -m sglang.launch_server --model-path openbmb/MiniCPM-V-4.6 --port 30000 --trust-remote-code --dtype bfloat16
Or specify a local path:
python -m sglang.launch_server --model-path /your/local/MiniCPM-V-4.6 --port 30000 --trust-remote-code --dtype bfloat16
To serve the Thinking variant, swap the model id:
python -m sglang.launch_server --model-path openbmb/MiniCPM-V-4.6-Thinking --port 30000 --trust-remote-code --dtype bfloat16
3. Calling the Service
-
Bash / curl:
curl -s http://localhost:30000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "MiniCPM-V-4.6", "messages": [ { "role": "user", "content": [ {"type": "text", "text": "What is in this image?"}, { "type": "image_url", "image_url": { "url": "https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/inference/assets/airplane.jpeg?raw=true" } } ] } ], "max_tokens": 300, "top_k": -1 }' -
Python (OpenAI client):
from openai import OpenAI client = OpenAI(base_url="http://localhost:30000/v1", api_key="None") response = client.chat.completions.create( model="MiniCPM-V-4.6", messages=[ { "role": "user", "content": [ {"type": "text", "text": "What is in this image?"}, { "type": "image_url", "image_url": { "url": "https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/inference/assets/airplane.jpeg?raw=true", }, }, ], } ], max_tokens=300, extra_body={"top_k": -1, "stop_token_ids": [248044, 248046]}, ) print(response.choices[0].message.content)
If
image_urlis not reachable from your machine, replace it with a local path / base64 data URL.v4.6 uses the Qwen3.5 vocabulary — pass
stop_token_ids = [248044, 248046]if you observe the model continuing past the answer.For more invocation patterns, see the SGLang documentation.