vllm-plugin-FL

June 25, 2026 · View on GitHub

vllm-plugin-FL is a plugin for the vLLM inference/serving framework, built on FlagOS's unified multi-chip backend — including the unified operator library FlagGems and the unified communication library FlagCX. It extends vLLM's capabilities and performance across diverse hardware environments. Without changing vLLM's original interfaces or usage patterns, the same command can run model inference/serving on different chips.

Supported Models and Chips

In theory, vllm-plugin-FL can support all models available in vLLM, as long as no unsupported operators are involved. The tables below summarize the current support status of end-to-end verified models and chips, including both fully supported and in-progress ("Merging") entries.

Supported Models

ModelStatusReference
Qwen3.5-397B-A17BSupportedexample
Qwen3-Next-80B-A3BSupportedexample
Qwen3-4BSupportedexample
MiniCPM-o 4.5Supportedexample
GLM-5Supportedexample
Qwen3.5-35B-A3BSupportedexample
BAAI/bge-m3Supportedimplementation
MiniMax-M2.7Supportedimplementation

Supported Chips

Chip VendorStatusReference
NVIDIASupported-
AscendSupported-
MetaXSupported-
T-HeadSupported-
IluvatarSupported-
TsingmicroSupported-
Moore ThreadsSupported-
HygonSupported-
SunriseSupported-

Quick Start

Setup

  1. Install vllm from the official v0.20.2 (optional if the correct version is installed)

  2. Install vllm-plugin-FL

    2.1 Clone the repository:

    git clone https://github.com/flagos-ai/vllm-plugin-FL
    

    2.2 install

    cd vllm-plugin-FL
    pip install --no-build-isolation .
    # or editble install
    pip install --no-build-isolation -e .
    

    For CUDA-like devices, including CUDA and HIP/ROCm environments that use PyTorch's CUDA dispatch key, build the plugin native extension by setting VLLM_VENDOR=cuda during installation:

    cd vllm-plugin-FL
    VLLM_VENDOR=cuda pip install --no-build-isolation .
    # or editable install
    VLLM_VENDOR=cuda pip install --no-build-isolation -e .
    

    This builds and installs vllm_fl._C, which provides native C++ support required by some graph/custom-op paths, especially when vLLM is installed with VLLM_TARGET_DEVICE=empty.

    If VLLM_VENDOR is not set, vllm-plugin-FL is installed as a Python-only plugin and the native extension is skipped.

  3. Install FlagGems

    3.1 Install Build Dependencies

    pip install -U scikit-build-core==0.11 pybind11 ninja cmake
    

    3.2 Installation FlagGems

    git clone https://github.com/flagos-ai/FlagGems
    cd FlagGems
    git checkout 3b2b55c8eda5de44ba3476d26566ecf134db0662
    pip install --no-build-isolation .
    # or editble install
    pip install --no-build-isolation -e .
    
  4. (Optional) Install FlagCX

    4.1 Clone the repository:

    git clone https://github.com/flagos-ai/FlagCX.git
    cd FlagCX
    git checkout -b v0.9.0
    git submodule update --init --recursive
    

    4.2 Build the library with different flags targeting to different platforms:

    make USE_NVIDIA=1
    

    4.3 Set environment

    export FLAGCX_PATH="$PWD"
    

    4.4 Installation FlagCX

    cd plugin/torch/
    FLAGCX_ADAPTOR=[xxx] pip install . --no-build-isolation
    # or editable install
    FLAGCX_ADAPTOR=[xxx] pip install -e . --no-build-isolation
    

    Note: [xxx] should be selected according to the current platform, e.g., nvidia, ascend, etc.

If there are multiple plugins in the current environment, you can specify use vllm-plugin-fl via VLLM_PLUGINS='fl'.

Additional Steps for Ascend

  1. Install FlagTree

    RES="--index-url=https://resource.flagos.net/repository/flagos-pypi-hosted/simple --trusted-host=https://resource.flagos.net"
    python3 -m pip install flagtree==0.4.0+ascend3.2 $RES
    
  2. Set required environment variable

    export TRITON_ALL_BLOCKS_PARALLEL=1
    
  3. Enable eager execution

    Ascend requires eager execution. Add enforce_eager=True to the LLM constructor or pass --enforce-eager on the command line.

Run a Task

Offline Batched Inference

With vLLM and vLLM-fl installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing). See the example script: offline_inference. Or use blow python script directly.

from vllm import LLM, SamplingParams
import torch
from vllm.config.compilation import CompilationConfig


if __name__ == '__main__':
    prompts = [
        "Hello, my name is",
    ]
    # Create a sampling params object.
    sampling_params = SamplingParams(max_tokens=10, temperature=0.0)
    # Create an LLM.
    llm = LLM(model="Qwen/Qwen3-4B", max_num_batched_tokens=16384, max_num_seqs=2048)
    # Generate texts from the prompts.
    outputs = llm.generate(prompts, sampling_params)
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Advanced use

For dispatch environment variable usage, see environment variables usage.

Using Cuda Communication library

If you want to use the original Cuda Communication, you can unset the following environment variables.

unset FLAGCX_PATH

Using native CUDA operators

If you want to use the original CUDA operators, you can set the following environment variables.

export USE_FLAGGEMS=0