MiniCPM-V 4.6 - SGLang Documentation

May 22, 2026 · View on GitHub

Note

MiniCPM-V 4.6 is supported on the official SGLang main branch since PR #24998 (merged 2026-05-12). No fork is required — install directly from upstream, making sure your checkout is at or after that commit.

MiniCPM-V 4.6 is registered in transformers>=5.7.0 as a standalone architecture (MiniCPMV4_6ForConditionalGeneration); the SGLang adapter follows that layout.

MiniCPM-V 4.6 ships as two checkpoints:

1. Installing SGLang

Install SGLang from upstream main

Until a tagged SGLang release ships with #24998, install from upstream main (watch SGLang Releases; once a release includes that PR, pip install -U "sglang[all]" is enough):

git clone https://github.com/sgl-project/sglang.git
cd sglang

pip install --upgrade pip
pip install -e "python[all]"

transformers>=5.7.0 is installed automatically — this in turn requires a recent PyTorch (≥ 2.6 at the time of writing). Verify the resolved versions match what FlashInfer needs before installing FlashInfer below:

python -c "import torch, transformers; print('torch', torch.__version__, '| cuda', torch.version.cuda, '| transformers', transformers.__version__)"

Important

FlashInfer wheels are pinned to a specific (torch, cuda) combo. Pick the wheel index that matches the torch + CUDA you just verified above — don't blindly copy a cu121/torch2.4 URL, that will silently downgrade torch and break the SGLang / transformers install.

The general index lives at https://flashinfer.ai/whl/. Pick the directory matching your environment, for example:

Your torch / CUDAIndex URL
torch 2.6 + CUDA 12.4https://flashinfer.ai/whl/cu124/torch2.6/
torch 2.6 + CUDA 12.6https://flashinfer.ai/whl/cu126/torch2.6/
torch 2.7 + CUDA 12.8https://flashinfer.ai/whl/cu128/torch2.7/

Then install via either:

# Method 1 — pip from the right index (slow / blocked in CN)
pip install flashinfer-python -i <index URL from table above>

# Method 2 — download the matching wheel manually
#   1) Open the index URL in a browser, find a wheel that matches your
#      python version (cp310 / cp311 / ...) and platform (linux_x86_64 / win_amd64)
#   2) pip install <downloaded-wheel.whl>

For everything else (Docker images, CPU-only fallback, etc.) see the official SGLang installation docs.

2. Launching the Inference Server

By default the server downloads weights from the HuggingFace Hub:

python -m sglang.launch_server --model-path openbmb/MiniCPM-V-4.6 --port 30000 --trust-remote-code --dtype bfloat16

Or specify a local path:

python -m sglang.launch_server --model-path /your/local/MiniCPM-V-4.6 --port 30000 --trust-remote-code --dtype bfloat16

To serve the Thinking variant, swap the model id:

python -m sglang.launch_server --model-path openbmb/MiniCPM-V-4.6-Thinking --port 30000 --trust-remote-code --dtype bfloat16

3. Calling the Service

  • Bash / curl:

    curl -s http://localhost:30000/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "MiniCPM-V-4.6",
        "messages": [
          {
            "role": "user",
            "content": [
              {"type": "text", "text": "What is in this image?"},
              {
                "type": "image_url",
                "image_url": {
                  "url": "https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/inference/assets/airplane.jpeg?raw=true"
                }
              }
            ]
          }
        ],
        "max_tokens": 300,
        "top_k": -1
      }'
    
  • Python (OpenAI client):

    from openai import OpenAI
    
    client = OpenAI(base_url="http://localhost:30000/v1", api_key="None")
    
    response = client.chat.completions.create(
        model="MiniCPM-V-4.6",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "What is in this image?"},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/inference/assets/airplane.jpeg?raw=true",
                        },
                    },
                ],
            }
        ],
        max_tokens=300,
        extra_body={"top_k": -1, "stop_token_ids": [248044, 248046]},
    )
    
    print(response.choices[0].message.content)
    

If image_url is not reachable from your machine, replace it with a local path / base64 data URL.

v4.6 uses the Qwen3.5 vocabulary — pass stop_token_ids = [248044, 248046] if you observe the model continuing past the answer.

For more invocation patterns, see the SGLang documentation.