vllm-plugin-FL
April 20, 2026 · View on GitHub
vllm-plugin-FL is a plugin for the vLLM inference/serving framework, built on FlagOS's unified multi-chip backend — including the unified operator library FlagGems and the unified communication library FlagCX. It extends vLLM's capabilities and performance across diverse hardware environments. Without changing vLLM's original interfaces or usage patterns, the same command can run model inference/serving on different chips.
Supported Models and Chips
In theory, vllm-plugin-FL can support all models available in vLLM, as long as no unsupported operators are involved. The tables below summarize the current support status of end-to-end verified models and chips, including both fully supported and in-progress ("Merging") entries.
Supported Models
| Model | Status | Reference |
|---|---|---|
| Qwen3.5-397B-A17B | Supported | example |
| Qwen3-Next-80B-A3B | Supported | example |
| Qwen3-4B | Supported | example |
| MiniCPM-o 4.5 | Supported | example |
| GLM-5 | Supported | example |
| Qwen3.5-35B-A3B | Supported | example |
| BAAI/bge-m3 | Supported | implementation |
| MiniMax-M2.7 | Supported | implementation |
Supported Chips
| Chip Vendor | Status | Reference |
|---|---|---|
| NVIDIA | Supported | - |
| Ascend | Supported | - |
| MetaX | Supported | - |
| Pingtouge-Zhenwu | Supported | - |
| Iluvatar | Supported | - |
| Tsingmicro | Merging | PR #52 |
| Moore Threads | Supported | - |
| Hygon | Merging | PR #78 |
Quick Start
Setup
-
Install vllm from the official v0.19.0 (optional if the correct version is installed) or from the fork vllm-FL.
-
Install vllm-plugin-FL
2.1 Clone the repository:
git clone https://github.com/flagos-ai/vllm-plugin-FL2.2 install
cd vllm-plugin-FL pip install --no-build-isolation . # or editble install pip install --no-build-isolation -e . -
Install FlagGems
3.1 Install Build Dependencies
pip install -U scikit-build-core==0.11 pybind11 ninja cmake3.2 Installation FlagGems
git clone https://github.com/flagos-ai/FlagGems cd FlagGems git checkout v5.0.0 pip install --no-build-isolation . # or editble install pip install --no-build-isolation -e . -
(Optional) Install FlagCX
4.1 Clone the repository:
git clone https://github.com/flagos-ai/FlagCX.git cd FlagCX git checkout -b v0.9.0 git submodule update --init --recursive4.2 Build the library with different flags targeting to different platforms:
make USE_NVIDIA=14.3 Set environment
export FLAGCX_PATH="$PWD"4.4 Installation FlagCX
cd plugin/torch/ FLAGCX_ADAPTOR=[xxx] pip install . --no-build-isolation # or editable install FLAGCX_ADAPTOR=[xxx] pip install -e . --no-build-isolationNote: [xxx] should be selected according to the current platform, e.g., nvidia, ascend, etc.
If there are multiple plugins in the current environment, you can specify use vllm-plugin-fl via VLLM_PLUGINS='fl'.
Additional Steps for Ascend
-
Install FlagTree
RES="--index-url=https://resource.flagos.net/repository/flagos-pypi-hosted/simple --trusted-host=https://resource.flagos.net" python3 -m pip install flagtree==0.4.0+ascend3.2 $RES -
Set required environment variable
export TRITON_ALL_BLOCKS_PARALLEL=1 -
Enable eager execution
Ascend requires eager execution. Add
enforce_eager=Trueto theLLMconstructor or pass--enforce-eageron the command line.
Run a Task
Offline Batched Inference
With vLLM and vLLM-fl installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing). See the example script: offline_inference. Or use blow python script directly.
from vllm import LLM, SamplingParams
import torch
from vllm.config.compilation import CompilationConfig
if __name__ == '__main__':
prompts = [
"Hello, my name is",
]
# Create a sampling params object.
sampling_params = SamplingParams(max_tokens=10, temperature=0.0)
# Create an LLM.
llm = LLM(model="Qwen/Qwen3-4B", max_num_batched_tokens=16384, max_num_seqs=2048)
# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
Advanced use
For dispatch environment variable usage, see environment variables usage.
Using Cuda Communication library
If you want to use the original Cuda Communication, you can unset the following environment variables.
unset FLAGCX_PATH
Using native CUDA operators
If you want to use the original CUDA operators, you can set the following environment variables.
export USE_FLAGGEMS=0