Changelog

November 27, 2025 · View on GitHub

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[Unreleased]

Universal:

Add support for distributed sharing of the KV Cache, to suppot KV Cache sharing between CPU and SSD, as well as distributed sharing of PCFS (#17)
Add GDS (GPU Direct Storage) Support (#25)
TP16 support (#26)
Support more kv cache layout. Now include: vLLM, SGLang, TensorRT-LM (#27)
GDS refactor & gtensor support (#42)
Support construct TensorSharedHandle directly from CUDA IPC Handle (#44)

Targeting vllm:

Targeting TensorRT-LLM:

Fix wrong head number for DeepSeek for vllm integration (#23)
Fix bug, if cpu match len is bigger than ssd when put, it will cause error (#24)
Fix benchmark_worker (#31)
Fix segfault caused by radix tree array out-of-bounds access (#39)
Fix cache_info (#40)
Fix port for GPU registration (#45)
Fix SSD allocator (#46)
Fix vllm init num_kv_heads bug (#67)
Fix model_config for non-MLA models (#68)

C++ radix tree for fast match, need set "index_accel": true in cache_config
sync kernel launch
a huge change that move cache engine to a library for accelerator(vLLM e.g.) to use instead of server-client mode. This accelerate the get and put when no KVCache is matched. This version includes breaking API changes and is not backward compatible.
add evict_ratio, need set "evict_ratio": 0.05 in cache_config
reducing the bubble inner the launch kernel
add vLLM 0.10.1.1 adapter