Prefix Cache (KV Reuse)
May 19, 2026 ยท View on GitHub
Prefix cache lets candle-vllm reuse KV cache blocks from prior requests when new prompts share a prefix.
How it works
- Finished sequences contribute reusable full KV blocks.
- New requests find the longest cached block-aligned prefix.
- Remaining tokens are prefetched normally after the reused prefix.
Prefix cache is block-granular. If a shared prefix ends in the middle of a block, the remainder of that block is recomputed. When a prompt is fully cached at block boundaries, the last block is recomputed so the request still has a non-empty prefill step.
Flags
- Prefix cache is enabled by default. Use
--disable-prefix-cacheto turn it off. --prefix-cache-max-tokens <N>: cap cache size in tokens, rounded down to block size
If --prefix-cache-max-tokens is omitted, the cache defaults to roughly 50% of GPU KV blocks in this project.
Usage Reporting
OpenAI-compatible chat responses include prefix-cache and reasoning token details when they are non-zero:
{
"usage": {
"prompt_tokens": 128,
"completion_tokens": 64,
"total_tokens": 192,
"prompt_time_costs": 4,
"completion_time_costs": 250,
"prompt_tokens_details": {
"cached_tokens": 64
},
"completion_tokens_details": {
"reasoning_tokens": 32
}
}
}
prompt_tokens_details.cached_tokens reports the number of prompt tokens reused from the prefix cache. completion_tokens_details.reasoning_tokens reports generated tokens inside reasoning blocks such as <think>...</think>. Both detail objects are omitted when their count is zero.
Hybrid Mamba snapshot stride
For hybrid Mamba models, prefix reuse also needs compatible snapshot boundaries.
Use CANDLE_VLLM_MAMBA_SNAPSHOT_STRIDE_BLOCKS to control sparse decode-time snapshot capture (larger stride size useful for limited GPU memory).
export CANDLE_VLLM_MAMBA_SNAPSHOT_STRIDE_BLOCKS=1
Behavior:
- Default:
1 - Minimum valid value:
1 - Effective token stride:
block_size * stride
Example with block_size=64 and stride 8:
- snapshot boundary every
512tokens - hybrid prefix reuse aligns to the nearest captured boundary
This setting only affects decode-time sparse snapshot capture. Prompt/prefill snapshot capture remains dense.
Notes
- Prefix cache shares the same KV memory pool as active sequences.
- Larger caches reduce maximum concurrent live tokens.
- Sliding-window attention limits how much cached context is effectively reused.