FastServe

September 5, 2025 · View on GitHub

FastServe is a fast, efficient and scalable inference serving system for Large Language Models (LLMs). It serves as an easy-to-use Python front-end and utilizes a high performance LLM inference C++ library SwiftTransformer.

It is fast with:

preemptive scheduling
continuous batching
custom attention kernels
C++ model implementation

It is memory efficient with:

proactive memory swapping
paged attention kernels

It is scalable with:

megatron-LM tensor parallelism
streaming pipeline parallelism

It currently supports:

OPT (facebook/opt-1.3b, facebook/opt-6.7b, ...)
LLaMA2 (meta-llama/Llama-2-7b, meta-llama/Llama-2-13b, ...)

Build && Install

# git clone the project
git clone git@github.com:LLMServe/FastServe.git && cd FastServe

# setup the fastserve conda environment
conda env create -f environment.yml && conda activate fastserve

# clone and build the SwiftTransformer library  
git clone https://github.com/LLMServe/SwiftTransformer.git && cd SwiftTransformer && git submodule update --init --recursive && cmake -B build && cmake --build build -j$(nproc) && cd ..

# install fastserve
pip install -e .

Artifact Evaluation

See benchmarks/artifact-evaluation/README.md for detailed instructions.

Run

Offline case

python fastserve/examples/offline.py

Online case

# launch api server
python -m fastserve.api_server.fastserve_api_server

# launch client
python fastserve/examples/online.py

Contribution

If you want to contribute to the project, please read contribution.md.

Acknowledgement

The architecture design of FastServe is greatly inspired by vLLM.

Citation

If you use FastServe for your research, please cite our paper:

@misc{wu2023fast,
      title={Fast Distributed Inference Serving for Large Language Models}, 
      author={Bingyang Wu and Yinmin Zhong and Zili Zhang and Gang Huang and Xuanzhe Liu and Xin Jin},
      year={2023},
      eprint={2305.05920},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}