Changelog

November 27, 2025 ยท View on GitHub

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[Unreleased]

[1.2.0] - 2025-11-25

Feature

Universal:

  • Add support for distributed sharing of the KV Cache, to suppot KV Cache sharing between CPU and SSD, as well as distributed sharing of PCFS (#17)
  • Add GDS (GPU Direct Storage) Support (#25)
  • TP16 support (#26)
  • Support more kv cache layout. Now include: vLLM, SGLang, TensorRT-LM (#27)
  • GDS refactor & gtensor support (#42)
  • Support construct TensorSharedHandle directly from CUDA IPC Handle (#44)

Targeting vllm:

  • Support dp > 1 while integrated with vllm (#18)
  • Add launch scripts for vllm adaption (#47)
  • Support TP16 for vLLM+FlexKV (#59)

Targeting TensorRT-LLM:

  • Support using FlexKV on TensorRT-LLM (#48)
  • Support TP16 for TensorRT-LLM+FlexKV (#53)

Optimization

  • Mla d2h transfer optimization (#19)
  • optimize SSD I/O (#33)
  • Enhance cache eviction with frequency-aware grace time mechanism (#38)
  • Replace std::map with std::unordered_map in RadixTree (#41)

Bugfix

  • Fix wrong head number for DeepSeek for vllm integration (#23)
  • Fix bug, if cpu match len is bigger than ssd when put, it will cause error (#24)
  • Fix benchmark_worker (#31)
  • Fix segfault caused by radix tree array out-of-bounds access (#39)
  • Fix cache_info (#40)
  • Fix port for GPU registration (#45)
  • Fix SSD allocator (#46)
  • Fix vllm init num_kv_heads bug (#67)
  • Fix model_config for non-MLA models (#68)

Misc

  • Add doc for: FlexKV + TensorRT-LLM (#52)
  • For config: Simplify user configuration (#37), and other slight update (#43)

[1.1.0] - 2025-09-15

  • Add op-level callback for local get/put #13
  • Add doc for: FlexKV + Dynamo (#14), flexkv_config.json (#15),

[1.0.0] - 2025-09-11

Added

  • C++ radix tree for fast match, need set "index_accel": true in cache_config
  • sync kernel launch
  • a huge change that move cache engine to a library for accelerator(vLLM e.g.) to use instead of server-client mode. This accelerate the get and put when no KVCache is matched. This version includes breaking API changes and is not backward compatible.
  • add evict_ratio, need set "evict_ratio": 0.05 in cache_config
  • reducing the bubble inner the launch kernel
  • add vLLM 0.10.1.1 adapter

Fixed

  • cython release package

[0.1.0] - 2025-08-29

Init

  • init version
  • add license