FastDeploy Parameter Documentation

June 2, 2026 · View on GitHub

简体中文

FastDeploy Parameter Documentation

Parameter Description

When using FastDeploy to deploy models (including offline inference and service deployment), the following parameter configurations are involved. Please note that for offline inference, the parameter configurations are the parameter names as shown below; while when starting the service via command line, the separators in the corresponding parameters need to be changed from _ to -, for example max_model_len becomes --max-model-len in command line.

Parameter NameTypeDescription
portintOnly required for service deployment, HTTP service port number, default: 8000
metrics_portintOnly required for service deployment, metrics monitoring port number, default: None (shares port with main service)
max_waiting_timeintOnly required for service deployment, maximum wait time for establishing a connection upon service request. Default: -1 (indicates no wait time limit).
max_concurrencyintOnly required for service deployment, the actual number of connections established by the service, default 512
engine_worker_queue_portlist[int]FastDeploy internal engine communication port list, auto-allocated based on data_parallel_size
cache_queue_portlist[int]FastDeploy internal KVCache process communication port list, auto-allocated based on data_parallel_size
max_model_lenintDefault maximum supported context length for inference, default: 2048
max_completion_tokensintServer-level maximum allowed completion token length (hard cap). Per-request max_tokens will be clamped to this value. Default: None (bounded by max_model_len - input_len)
reasoning_max_tokensintServer-level maximum allowed reasoning/thinking token length (hard cap). Per-request value will be clamped to this value. Default: None (no cap)
response_max_tokensintServer-level maximum allowed response token length (hard cap). Per-request value will be clamped to this value. Default: None (no cap)
min_completion_tokensintServer-level minimum generation length floor. Effective min_tokens = max(server_value, per-request value). Default: None (no floor)
input_max_tokensintServer-level maximum input token length. Requests with prompt longer than this will be rejected. Default: None (no limit, bounded by max_model_len)
tensor_parallel_sizeintDefault tensor parallelism degree for model, default: 1
data_parallel_sizeintDefault data parallelism degree for model, default: 1
block_sizeintKVCache management granularity (Token count), recommended default: 64
max_num_seqsintMaximum concurrent number in Decode phase, default: 8
mm_processor_kwargsdict[str]Multimodal processor parameter configuration, e.g.: {"image_min_pixels": 3136, "video_fps": 2}
tokenizerstrTokenizer name or path, defaults to model path
use_warmupintWhether to perform warmup at startup, will automatically generate maximum length data for warmup, default: 0 (disabled)
limit_mm_per_promptdict[str]Limit the amount of multimodal data per prompt, e.g.: {"image": 10, "video": 3}, default: 1 for all
enable_mmbool[DEPRECATED] Whether to support multimodal data (for multimodal models only), model architecture automatically detects multimodal models, no manual setting needed
quantizationstrModel Quantization Strategy: When loading a BF16 checkpoint (CKPT), specifying wint4, wint8, block_wise_fp8 or wfp8afp8 enables lossless online 4-bit/8-bit quantization of weights, KVCache is not quantized by default; if this parameter is parsed as a dictionary (dict), mix_quant (mixed quantization) can be specified, where dense_quant_type, moe_quant_type and kv_cache_quant_type specify the quantization types for DenseGEMM, MoE and KVCache respectively, no quantization is applied to the corresponding modules if the relevant parameters are not specified (e.g., '{"quantization":"mix_quant","dense_quant_type":"wint8","moe_quant_type":"wint4","kv_cache_quant_type":"block_wise_fp8"}'); Note: Online quantization of KVCache to block_wise_fp8 is only supported by the AppendAttn backend.
gpu_memory_utilizationfloatGPU memory utilization, default: 0.9
num_gpu_blocks_overrideintPreallocated KVCache blocks, this parameter can be automatically calculated by FastDeploy based on memory situation, no need for user configuration, default: None
max_num_batched_tokensintMaximum batch token count in Prefill phase, default: None (same as max_model_len)
kv_cache_ratiofloatKVCache blocks are divided between Prefill phase and Decode phase according to kv_cache_ratio ratio, default: 0.75
enable_prefix_cachingboolWhether to enable Prefix Caching, default: True (on GPU/XPU/HPU platforms), False on other platforms
swap_spacefloatWhen Prefix Caching is enabled, CPU memory size for KVCache swapping, unit: GB, default: None
enable_chunked_prefillboolEnable Chunked Prefill, default: False
max_num_partial_prefillsintWhen Chunked Prefill is enabled, maximum concurrent number of partial prefill batches, default: 1
max_long_partial_prefillsintWhen Chunked Prefill is enabled, maximum number of long requests in concurrent partial prefill batches, default: 1
long_prefill_token_thresholdintWhen Chunked Prefill is enabled, requests with token count exceeding this value are considered long requests, default: max_model_len*0.04
static_decode_blocksintDuring inference, each request is forced to allocate corresponding number of blocks from Prefill's KVCache for Decode use, default: 2
reasoning_parserstrSpecify the reasoning parser to extract reasoning content from model output
use_cudagraphbool[DEPRECATED since version 2.3] CUDAGraph is enabled by default. Now controlled via use_cudagraph parameter in graph_optimization_config, see graph_optimization.md for details
graph_optimization_configdict[str]Can configure parameters related to calculation graph optimization, the default value is'{"use_cudagraph":true, "graph_opt_level":0}',Detailed description reference graph_optimization.md
disable_custom_all_reduceboolDisable Custom all-reduce, default: False
use_internode_ll_two_stageboolUse two stage communication in deepep moe, default: False
disable_sequence_parallel_moeboolDisable sequence parallel moe, default: False
splitwise_rolestrWhether to enable splitwise inference, default value: mixed, supported parameters: ["mixed", "decode", "prefill"]
innode_prefill_portsstrInternal engine startup ports for prefill instances (only required for single-machine PD separation), default: None
guided_decoding_backendstrSpecify the guided decoding backend to use, supports auto, xgrammar, guidance, off, default: off
guided_decoding_disable_any_whitespaceboolWhether to disable whitespace generation during guided decoding, default: False
speculative_configdict[str]Speculative decoding configuration, only supports standard format JSON string, default: None
dynamic_load_weightboolWhether to enable dynamic weight loading, default: False
enable_expert_parallelboolWhether to enable expert parallel, default: False
enable_logprobboolWhether to enable return log probabilities of the output tokens, default: False. If logprob is not used, this parameter can be omitted when starting
logprobs_modestrSpecifies the content returned in logprobs, default: raw_logprobs. Supported modes: raw_logprobs, processed_logprobs, raw_logits, processed_logits. Processed means values after applying logit processors (temperature, penalties, bad words)
max_logprobsintMaximum number of log probabilities to return, default: 20. -1 means vocab_size.
served_model_namestrThe model name used in the API. If not specified, the model name will be the same as the --model argument
revisionstrThe specific model version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version.
chat_templatestrSpecify the template used for model concatenation, It supports both string input and file path input. The default value is None. If not specified, the model's default template will be used.
tool_call_parserstrSpecify the function call parser to be used for extracting function call content from the model's output.
tool_parser_pluginstrSpecify the file path of the tool parser to be registered, so as to register parsers that are not in the code repository. The code format within these parsers must adhere to the format used in the code repository.
load_choicesstrWeight loader selection, default: "default_v1". Supports "default", "default_v1", and "dummy". "default_v1" is used for loading torch weights and weight acceleration. "dummy" is used for quickly and randomly initializes weights for testing
model_loader_extra_configdict[str]Additional configuration options for the model loader. Supports:
- enable_multithread_load (bool): Enable multi-threaded weight loading.
- num_threads (int): Number of threads for loading. Defaults to 8.
- disable_mmap (bool): Disable memory-mapped file access. Useful when mmap is not supported.
Example: '{"enable_multithread_load": true, "num_threads": 8}'
max_encoder_cacheintMaximum number of tokens in the encoder cache (use 0 to disable), default: -1 (auto-calculated)
max_processor_cachefloatMaximum number of bytes(in GiB) in the processor cache (use 0 to disable), default: -1 (auto-calculated)
api_keylist[str]Validate API keys in the service request headers, supporting multiple key inputs. Same effect as environment variable FD_API_KEY, with higher priority
enable_output_cachingboolWhether to enable KV cache for output tokens, only valid in V1 scheduler (ENABLE_V1_KVCACHE_SCHEDULER=1), default: True
workersintOnly required for service deployment, number of API server worker processes, default: 1
timeoutintOnly required for service deployment, worker silent timeout (seconds), set to 0 to disable timeout, default: 0
timeout_graceful_shutdownintOnly required for service deployment, graceful shutdown timeout (seconds), set to 0 for infinite timeout, default: 0
routerstrRouter server URL for request routing in splitwise deployment, e.g., http://127.0.0.1:8000
disable_chunked_mm_inputboolDisable chunked processing for multimodal inputs, default: False
logits_processorslist[str]List of fully qualified class names (FQCN) of logits processors supported by the service, e.g., fastdeploy.model_executor.logits_processor:LogitBiasLogitsProcessor

1. Relationship between KVCache allocation, num_gpu_blocks_override and block_size?

During FastDeploy inference, GPU memory is occupied by model weights, preallocated KVCache blocks and model computation intermediate activation values. The preallocated KVCache blocks are determined by num_gpu_blocks_override, with block_size (default: 64) as its unit, meaning one block can store KVCache for 64 Tokens.

In actual inference, it's difficult for users to know how to properly configure num_gpu_blocks_override, so FastDeploy uses the following method to automatically derive and configure this value:

  • Load the model, after completing model loading, record current memory usage total_memory_after_load and FastDeploy framework memory usage fd_memory_after_load; note the former is actual GPU memory usage (may include other processes), the latter is memory used by FD framework itself;

  • According to user-configured max_num_batched_tokens (default: max_model_len), perform fake prefill computation with corresponding length input data, record current maximum FastDeploy framework memory allocation fd_memory_after_prefill, thus model computation intermediate activation values can be considered as fd_memory_after_prefill - fd_memory_after_load;

    • At this point, available GPU memory for KVCache allocation (taking A800 80G as example) is 80GB * gpu_memory_utilization - total_memory_after_load - (fd_memory_after_prefill - fd_memory_after_load)
    • Based on model KVCache precision (e.g. 8bit/16bit), calculate memory size per block, then calculate total allocatable blocks, assign to num_gpu_blocks_override

In service startup logs, we can find Reset block num, the total_block_num:17220, prefill_kvcache_block_num:12915 in log/fastdeploy.log, where total_block_num is the automatically calculated KVCache block count, multiply by block_size to get total cacheable Tokens.

2. Relationship between kv_cache_ratio, block_size and max_num_seqs?

  • FastDeploy divides KVCache between Prefill and Decode phases according to kv_cache_ratio. When configuring this parameter, you can use kv_cache_ratio = average input Tokens / (average input + average output Tokens). Typically input is 3x output, so can be configured as 0.75.
  • max_num_seqs is the maximum concurrency in Decode phase, generally can be set to maximum 128, but users can also configure based on KVCache situation, e.g. output KVCache Token amount is decode_token_cache = total_block_num * (1 - kv_cache_ratio) * block_size, to prevent extreme OOM situations, can configure max_num_seqs = decode_token_cache / average output Tokens, not exceeding 128.

3. enable_chunked_prefill parameter description

When enable_chunked_prefill is enabled, the service processes long input sequences through dynamic chunking, significantly improving GPU resource utilization. In this mode, the original max_num_batched_tokens parameter no longer constrains the batch token count in prefill phase (limiting single prefill token count), thus introducing max_num_partial_prefills parameter specifically to limit concurrently processed partial batches.

To optimize scheduling priority for short requests, new max_long_partial_prefills and long_prefill_token_threshold parameter combination is added. The former limits the number of long requests in single prefill batch, the latter defines the token threshold for long requests. The system will prioritize batch space for short requests, thereby reducing short request latency in mixed workload scenarios while maintaining stable throughput.

4. api_key parameter description

Multi-value configuration method in startup. That takes precedence over environment variable configuration.

  --api-key "key1"
  --api-key "key2"

Environment variable multi-value configuration method (use , separation):

  export FD_API_KEY="key1,key2"

When making requests using Curl, add the validation header. Any matching api_key will pass.

curl -X POST "http://0.0.0.0:8265/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer key1" \
-d '{
  "messages": [
    {"role": "user", "content":"你好"}
  ],
  "stream": false,
  "return_token_ids": true,
  "chat_template_kwargs": {"enable_thinking": true}
}'

The system will validate key1 after parsing Authorization: Bearer.

When using the openai SDK for requests, pass the api_key parameter:

client = OpenAI(
    api_key="your-api-key-here",
    base_url="http://localhost:8000/v1"
)