MiMo-V2-Flash Usage Guide
April 23, 2026 · View on GitHub
MiMo-V2-Flash is a high-performance Mixture-of-Experts (MoE) large language model developed by Xiaomi. It features several key architectural innovations:
- A hybrid attention design mixing Full Attention and Sliding Window Attention (SWA) with a 1:5 ratio
- A highly sparse MoE structure with 256 routed experts and sigmoid top-8 routing with 309B total parameters and 15B active parameters
- Natively trained Multi-Token Prediction (MTP) with 3 independent draft layers for speculative decoding
Preparing environment
Pull the latest docker from https://hub.docker.com/r/rocm/atom/ :
docker pull rocm/atom:latest
All the operations below will be executed inside the container.
Launching server
Serving on 4xMI355X GPUs (TP4, FP8 KV Cache)
python -m atom.entrypoints.openai_server \
--model XiaomiMiMo/MiMo-V2-Flash \
--kv_cache_dtype fp8 -tp 4 --trust-remote-code
Serving on 4xMI355X GPUs (TP4, BF16 KV Cache)
python -m atom.entrypoints.openai_server \
--model XiaomiMiMo/MiMo-V2-Flash \
-tp 4 --trust-remote-code
Serving with MTP Speculative Decoding
# only support num-speculative-tokens=1 now
python -m atom.entrypoints.openai_server \
--model XiaomiMiMo/MiMo-V2-Flash \
--kv_cache_dtype fp8 -tp 4 --trust-remote-code \
--method mtp
Performance Metrics
The following script can be used to benchmark the performance:
python -m atom.benchmarks.benchmark_serving \
--model=XiaomiMiMo/MiMo-V2-Flash --backend=vllm --base-url=http://localhost:8000 \
--dataset-name=random \
--random-input-len=${ISL} --random-output-len=${OSL} \
--random-range-ratio=0.8 \
--num-prompts=$(( $CONC * 10 )) \
--max-concurrency=$CONC \
--request-rate=inf --ignore-eos \
--save-result --percentile-metrics="ttft,tpot,itl,e2el"
Accuracy test
We use gsm8k dataset for accuracy test. Install lm-eval first:
pip install lm-eval[api]
Run the evaluation:
lm_eval \
--model local-completions \
--model_args model=XiaomiMiMo/MiMo-V2-Flash,base_url=http://localhost:8000/v1/completions,num_concurrent=64,max_retries=3,tokenized_requests=False \
--tasks gsm8k \
--num_fewshot 5
Here is the reference value when deploying with tp4 fp8 kvcache:
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8279|± |0.0104|
| | |strict-match | 5|exact_match|↑ |0.8211|± |0.0106|