README.md

January 23, 2026 · View on GitHub

FLOOD

FLOOD, a throughput-oriented framework with pipeline parallism and segmentable cache.

News or Update 🔥

  • [2025/10] We suport for Lookahead in hybrid linear models, including Ring-mini-linear-2.0 and Ring-flash-linear-2.0.
  • [2025/09] We release segment linear attention for better performance.
  • [2025/05] We integrade Lookahead into FLOOD.
  • [2025/03] We release the code of our inference framework FLOOD.

Introduction

Flood is a highly effective inference framework designed for offline applications. It employs a pipeline parallelism (PP) approach to minimize communication costs associated with tensor parallelism (TP). This framework incorporates advanced scheduling strategies tailored for offline inference processes to optimize GPU utilization to its fullest potential.

Furthermore, Flood utilizes segmentable blocks instead of paged blocks for kvcache management, thereby enhancing the continuity of the kvcache for requests.

Additionally, we have developed an attention kernel, termed SegmentAttention, to function with the segmentable kvcache. Flood currently supports a range of features, including:

  • Zero-overhead continuous batching
  • Chunked prefill
  • Inference of Quantization(FP8/INT8) models
  • Inference of multi-modal models
  • Streaming inference
  • PPL (Perplexity) evaluation
  • Sampling methods
  • Multi-node inference(experimental)

Our framework is undergoing rapid iteration, which may result in some features having bugs. If you encounter any issues, please feel free to report them.

Models we support

  • Ling MoE Linear V1, V2
  • Ling MoE V1, V2
  • Ling
  • Llama
  • Qwen
  • Qwen3
  • Deepseek V1, V2, V3

Roadmap

  • Improve prefill performance with Prefix caching.

  • Improve performance with CUDA-Graph.

  • Implement segment attention with CUTE for better performance, especially with FP8 kvcache.

  • Reduce pickle/unpickle overhead in multiprocessing.queue.

Performance Comparison

Throughput

Performance is measured by token/s(tokens per second) of generated tokens. The version of vLLM is 0.6.6.post2, we enable the chunk prefill with chunk size 2048, other parameters are the same as default. The model archetechure of Ling can be found in the Ling technical report.

modeldatasetGPUvLLMfloodspeedup
Llama3-8BshareGPT1*A100320145291.41
Ling-LiteshareGPT1 * H20435558691.35
Ling-LiteshareGPT1 * A100357654511.52
Ling-Plus(FP8)shareGPT8 * H20274265692.40
Ring-Mini-Linear-V2shareGPT1 * A1004992.036777.641.36
Ring-Mini-Linear-V2shareGPT1 * H206016.049117.561.52

Kernels

Seg-attn

Performance of Seg-attn is measured by TFLOPS (TFLOPs/second). Attention head number is 64, kv head number is 8, and kv head dimension is 128. We use flash_attn_2_cuda.varlen_fwd of flash-attn-2 in A100 and flash_attn_3_cuda.fwd of flash-attn-3 in H20. More detail can be found in benchmark/ops/bench_seg_attn.py.

DeviceBatchSizeQ_lenK_lenflash-attnseg-attnspeedup
A10011024102499.19107.351.08
A1001281102410.6513.561.27
H2011024102490.2896.051.06
H20128110247.1622.633.16

Seg-linear-attn

Performance of Seg-linear-attn is measured by microseconds(µs). Attention head number is 16, kv head number is 16, and kv head dimension is 128. We use fla.ops.simple_gla.chunk_simple_gla of flash-linear-attention in prefilling and fla.ops.simple_gla.fused_recurrent.fused_recurrent_simple_gla of flash-linear-attention in decoding. The test device is H20. More detail can be found in benchmark/ops/bench_seg_la.py.

BatchSizeSeq_lenflash-linear-attention (µs)seg-linear-attn (µs)speedup
11024245.5180.11.36
21024227.4132.51.72
641129.551.82.50
2561190.4132.01.44

Installation

  1. Clone this repository and navigate to PainlessInferenceAcceleration
git clone https://github.com/alipay/PainlessInferenceAcceleration.git
cd PainlessInferenceAcceleration/flood
  1. Install Package
python setup.py install

requirements

We mainly develop and benchmark on the environment below, lower version may also be OK.

  • cuda >= 12.4 (higher is better)
  • torch >= 2.5.0 (higher is better)
  • triton >= 3.1.0 (higher is better)
  • accelerate >= 1.4.0
  • transformers >= 4.54.0
  • flash-attn >= 2.6.3 is required if use fa2 kernel
  • flash-attn-3 >= 3.0.0 is required if use fa3 kernel
  • vLLM >= 0.6.2 is required if use INT8 quantization

Quick Start

A simple example can be found in example/simple_example.py.

To reproduce the reported performance, run the benchmark/bench_flood.py.

ACKNOWLEDGE

Flood is inspired by FlashAttention 2&3, FasterTransformer, vLLM, flashinfer projects.

Citations

[TBD]

@misc{zhao2025flood,
title={Flood: A throughput-oriented Inference Framework for Large Language Model with pipeline parallelism and segmentable cache},
author={Yao Zhao and Chen Liang and Jingyu Hu and Zixuan Cheng and Zhen Wang and Longfei Li}
}

Contact Us

For technical questions and feature requests, please use Github issues or discussions.