Model Configuration

February 6, 2026 ยท View on GitHub

Model Parameters

The following tables show the parameters in the config.pbtxt of the models in all_models/inflight_batcher_llm. that can be modified before deployment. For optimal performance or custom parameters, please refer to perf_best_practices.

The names of the parameters listed below are the values in the config.pbtxt that can be modified using the fill_template.py script.

NOTE For fields that have comma as the value (e.g. gpu_device_ids, participant_ids), you need to escape the comma with a backslash. For example, if you want to set gpu_device_ids to 0,1 you need to run python3 fill_template.py -i config.pbtxt "gpu_device_ids:0\,1".

The mandatory parameters must be set for the model to run. The optional parameters are not required but can be set to customize the model.

ensemble model

See here to learn more about ensemble models.

Mandatory parameters

NameDescription
triton_max_batch_sizeThe maximum batch size that the Triton model instance will run with. Note that for the tensorrt_llm model, the actual runtime batch size can be larger than triton_max_batch_size. The runtime batch size will be determined by the TRT-LLM scheduler based on a number of parameters such as number of available requests in the queue, and the engine build trtllm-build parameters (such max_num_tokens and max_batch_size).
logits_datatypeThe data type for context and generation logits.

preprocessing model

Mandatory parameters

NameDescription
triton_max_batch_sizeThe maximum batch size that Triton should use with the model.
tokenizer_dirThe path to the tokenizer for the model.
preprocessing_instance_countThe number of instances of the model to run.
max_queue_delay_microsecondsThe maximum queue delay in microseconds. Setting this parameter to a value greater than 0 can improve the chances that two requests arriving within max_queue_delay_microseconds will be scheduled in the same TRT-LLM iteration.
max_queue_sizeThe maximum number of requests allowed in the TRT-LLM queue before rejecting new requests.

Optional parameters

NameDescription
add_special_tokensThe add_special_tokens flag used by HF tokenizers.
multimodal_model_pathThe vision engine path used in multimodal workflow.
engine_dirThe path to the engine for the model. This parameter is only needed for multimodal processing to extract the vocab_size from the engine_dir's config.json for fake_prompt_id mappings.

multimodal_encoders model

Mandatory parameters

NameDescription
triton_max_batch_sizeThe maximum batch size that Triton should use with the model.
max_queue_delay_microsecondsThe maximum queue delay in microseconds. Setting this parameter to a value greater than 0 can improve the chances that two requests arriving within max_queue_delay_microseconds will be scheduled in the same TRT-LLM iteration.
max_queue_sizeThe maximum number of requests allowed in the TRT-LLM queue before rejecting new requests.
multimodal_model_pathThe vision engine path used in multimodal workflow.
hf_model_pathThe Huggingface model path used for llava_onevision and mllama models.

postprocessing model

Mandatory parameters

NameDescription
triton_max_batch_sizeThe maximum batch size that Triton should use with the model.
tokenizer_dirThe path to the tokenizer for the model.
postprocessing_instance_countThe number of instances of the model to run.

Optional parameters

NameDescription
skip_special_tokensThe skip_special_tokens flag used by HF detokenizers.

tensorrt_llm model

The majority of the tensorrt_llm model parameters and input/output tensors can be mapped to parameters in the TRT-LLM C++ runtime API defined in executor.h. Please refer to the Doxygen comments in executor.h for a more detailed description of the parameters below.

Mandatory parameters

NameDescription
triton_backendThe backend to use for the model. Set to tensorrtllm to utilize the C++ TRT-LLM backend implementation. Set to python to utlize the TRT-LLM Python runtime.
triton_max_batch_sizeThe maximum batch size that the Triton model instance will run with. Note that for the tensorrt_llm model, the actual runtime batch size can be larger than triton_max_batch_size. The runtime batch size will be determined by the TRT-LLM scheduler based on a number of parameters such as number of available requests in the queue, and the engine build trtllm-build parameters (such max_num_tokens and max_batch_size).
decoupled_modeWhether to use decoupled mode. Must be set to true for requests setting the stream tensor to true.
max_queue_delay_microsecondsThe maximum queue delay in microseconds. Setting this parameter to a value greater than 0 can improve the chances that two requests arriving within max_queue_delay_microseconds will be scheduled in the same TRT-LLM iteration.
max_queue_sizeThe maximum number of requests allowed in the TRT-LLM queue before rejecting new requests.
engine_dirThe path to the engine for the model.
batching_strategyThe batching strategy to use. Set to inflight_fused_batching when enabling in-flight batching support. To disable in-flight batching, set to V1
encoder_input_features_data_typeThe dtype for the input tensor encoder_input_features. For the mllama model, this must be TYPE_BF16. For other models like whisper, this is TYPE_FP16.
logits_datatypeThe data type for context and generation logits.

Optional parameters

  • General
NameDescription
encoder_engine_dirWhen running encoder-decoder models, this is the path to the folder that contains the model configuration and engine for the encoder model.
max_attention_window_sizeWhen using techniques like sliding window attention, the maximum number of tokens that are attended to generate one token. Defaults attends to all tokens in sequence. (default=max_sequence_length)
sink_token_lengthNumber of sink tokens to always keep in attention window.
exclude_input_in_outputSet to true to only return completion tokens in a response. Set to false to return the prompt tokens concatenated with the generated tokens. (default=false)
cancellation_check_period_msThe time for cancellation check thread to sleep before doing the next check. It checks if any of the current active requests are cancelled through triton and prevent further execution of them. (default=100)
stats_check_period_msThe time for the statistics reporting thread to sleep before doing the next check. (default=100)
recv_poll_period_msThe time for the receiving thread in orchestrator mode to sleep before doing the next check. (default=0)
iter_stats_max_iterationsThe maximum number of iterations for which to keep statistics. (default=ExecutorConfig::kDefaultIterStatsMaxIterations)
request_stats_max_iterationsThe maximum number of iterations for which to keep per-request statistics. (default=executor::kDefaultRequestStatsMaxIterations)
normalize_log_probsControls if log probabilities should be normalized or not. Set to false to skip normalization of output_log_probs. (default=true)
gpu_device_idsComma-separated list of GPU IDs to use for this model. Use semicolons to separate multiple instances of the model. If not provided, the model will use all visible GPUs. (default=unspecified)
participant_idsComma-separated list of MPI ranks to use for this model. Mandatory when using orchestrator mode with -disable-spawn-process (default=unspecified)
num_nodesNumber of MPI nodes to use for this model. (default=1)
gpu_weights_percentSet to a number between 0.0 and 1.0 to specify the percentage of weights that reside on GPU instead of CPU and streaming load during runtime. Values less than 1.0 are only supported for an engine built with weight_streaming on. (default=1.0)
  • KV cache

Note that the parameter enable_trt_overlap has been removed from the config.pbtxt. This option allowed to overlap execution of two micro-batches to hide CPU overhead. Optimization work has been done to reduce the CPU overhead and it was found that the overlapping of micro-batches did not provide additional benefits.

NameDescription
max_tokens_in_paged_kv_cacheThe maximum size of the KV cache in number of tokens. If unspecified, value is interpreted as 'infinite'. KV cache allocation is the min of max_tokens_in_paged_kv_cache and value derived from kv_cache_free_gpu_mem_fraction below. (default=unspecified)
kv_cache_free_gpu_mem_fractionSet to a number between 0 and 1 to indicate the maximum fraction of GPU memory (after loading the model) that may be used for KV cache. (default=0.9)
cross_kv_cache_fractionSet to a number between 0 and 1 to indicate the maximum fraction of KV cache that may be used for cross attention, and the rest will be used for self attention. Optional param and should be set for encoder-decoder models ONLY. (default=0.5)
kv_cache_host_memory_bytesEnable offloading to host memory for the given byte size.
enable_kv_cache_reuseSet to true to reuse previously computed KV cache values (e.g. for system prompt)
  • LoRA cache
NameDescription
lora_cache_optimal_adapter_sizeOptimal adapter size used to size cache pages. Typically optimally sized adapters will fix exactly into 1 cache page. (default=8)
lora_cache_max_adapter_sizeUsed to set the minimum size of a cache page. Pages must be at least large enough to fit a single module, single later adapter_size maxAdapterSize row of weights. (default=64)
lora_cache_gpu_memory_fractionFraction of GPU memory used for LoRA cache. Computed as a fraction of left over memory after engine load, and after KV cache is loaded. (default=0.05)
lora_cache_host_memory_bytesSize of host LoRA cache in bytes. (default=1G)
lora_prefetch_dirFolder to store the LoRA weights we hope to load during engine initialization.
  • Decoding mode
NameDescription
max_beam_widthThe beam width value of requests that will be sent to the executor. (default=1)
decoding_modeSet to one of the following: {top_k, top_p, top_k_top_p, beam_search, medusa, redrafter, lookahead, eagle} to select the decoding mode. The top_k mode exclusively uses Top-K algorithm for sampling, The top_p mode uses exclusively Top-P algorithm for sampling. The top_k_top_p mode employs both Top-K and Top-P algorithms, depending on the runtime sampling params of the request. Note that the top_k_top_p option requires more memory and has a longer runtime than using top_k or top_p individually; therefore, it should be used only when necessary. beam_search uses beam search algorithm. If not specified, the default is to use top_k_top_p if max_beam_width == 1; otherwise, beam_search is used. When Medusa model is used, medusa decoding mode should be set. However, TensorRT-LLM detects loaded Medusa model and overwrites decoding mode to medusa with warning. Same applies to the ReDrafter, Lookahead and Eagle.
  • Optimization
NameDescription
enable_chunked_contextSet to true to enable context chunking. (default=false)
multi_block_modeSet to false to disable multi block mode. (default=true)
enable_context_fmha_fp32_accSet to true to enable FMHA runner FP32 accumulation. (default=false)
cuda_graph_modeSet to true to enable cuda graph. (default=false)
cuda_graph_cache_sizeSets the size of the CUDA graph cache, in numbers of CUDA graphs. (default=0)
  • Scheduling
NameDescription
batch_scheduler_policySet to max_utilization to greedily pack as many requests as possible in each current in-flight batching iteration. This maximizes the throughput but may result in overheads due to request pause/resume if KV cache limits are reached during execution. Set to guaranteed_no_evict to guarantee that a started request is never paused. (default=guaranteed_no_evict)
  • Medusa
NameDescription
medusa_choicesTo specify Medusa choices tree in the format of e.g. "{0, 0, 0}, {0, 1}". By default, mc_sim_7b_63 choices are used.
  • Eagle
NameDescription
eagle_choicesTo specify default per-server Eagle choices tree in the format of e.g. "{0, 0, 0}, {0, 1}". By default, mc_sim_7b_63 choices are used.
  • Guided decoding
NameDescription
guided_decoding_backendSet to xgrammar to activate guided decoder.
tokenizer_dirThe guided decoding of tensorrt_llm python backend requires tokenizer's information.
xgrammar_tokenizer_info_pathThe guided decoding of tensorrt_llm C++ backend requires xgrammar's tokenizer's info in 'json' format.

tensorrt_llm_bls model

See here to learn more about BLS models.

Mandatory parameters

NameDescription
triton_max_batch_sizeThe maximum batch size that the model can handle.
decoupled_modeWhether to use decoupled mode.
bls_instance_countThe number of instances of the model to run. When using the BLS model instead of the ensemble, you should set the number of model instances to the maximum batch size supported by the TRT engine to allow concurrent request execution.
logits_datatypeThe data type for context and generation logits.

Optional parameters

  • General
NameDescription
accumulate_tokensUsed in the streaming mode to call the postprocessing model with all accumulated tokens, instead of only one token. This might be necessary for certain tokenizers.
  • Speculative decoding

The BLS model supports speculative decoding. Target and draft triton models are set with the parameters tensorrt_llm_model_name tensorrt_llm_draft_model_name. Speculative decodingis performed by setting num_draft_tokens in the request. use_draft_logits may be set to use logits comparison speculative decoding. Note that return_generation_logits and return_context_logits are not supported when using speculative decoding. Also note that requests with batch size greater than 1 is not supported with speculative decoding right now.

NameDescription
tensorrt_llm_model_nameThe name of the TensorRT-LLM model to use.
tensorrt_llm_draft_model_nameThe name of the TensorRT-LLM draft model to use.

Model Input and Output

Below is the lists of input and output tensors for the tensorrt_llm and tensorrt_llm_bls models.

Common Inputs

NameShapeTypeDescription
end_id[1]int32End token ID. If not specified, defaults to -1
pad_id[1]int32Padding token ID
temperature[1]float32Sampling Config param: temperature
repetition_penalty[1]floatSampling Config param: repetitionPenalty
min_tokens[1]int32_tSampling Config param: minTokens
presence_penalty[1]floatSampling Config param: presencePenalty
frequency_penalty[1]floatSampling Config param: frequencyPenalty
seed[1]uint64_tSampling Config param: seed
return_log_probs[1]boolWhen true, include log probs in the output. Note: This requires at least one sampling parameter to be set (e.g., runtime_top_k, runtime_top_p for tensorrt_llm model, or top_k, top_p for tensorrt_llm_bls model).
return_context_logits[1]boolWhen true, include context logits in the output
return_generation_logits[1]boolWhen true, include generation logits in the output
num_return_sequences[1]int32_tNumber of generated sequences per request. (Default=1)
beam_width[1]int32_tBeam width for this request; set to 1 for greedy sampling (Default=1)
prompt_embedding_table[1]float16 (model data type)P-tuning prompt embedding table
prompt_vocab_size[1]int32P-tuning prompt vocab size
return_perf_metrics[1]boolWhen true, include perf metrics in the output, such as kv cache reuse stats
guided_decoding_guide_type[1]stringGuided decoding param: guide_type
guided_decoding_guide[1]stringGuided decoding param: guide

The following inputs for lora are for both tensorrt_llm and tensorrt_llm_bls models. The inputs are passed through the tensorrt_llm model and the tensorrt_llm_bls model will refer to the inputs from the tensorrt_llm model.

NameShapeTypeDescription
lora_task_id[1]uint64The unique task ID for the given LoRA. To perform inference with a specific LoRA for the first time, lora_task_id, lora_weights, and lora_config must all be given. The LoRA will be cached, so that subsequent requests for the same task only require lora_task_id. If the cache is full, the oldest LoRA will be evicted to make space for new ones. An error is returned if lora_task_id is not cached
lora_weights[ num_lora_modules_layers, D x Hi + Ho x D ]float (model data type)Weights for a LoRA adapter. See the config file for more details.
lora_config[ num_lora_modules_layers, 3]int32tModule identifier. See the config file for more details.

Common Outputs

Note: the timing metrics oputputs are represented as the number of nanoseconds since epoch.

NameShapeTypeDescription
cum_log_probs[-1]floatCumulative probabilities for each output
output_log_probs[beam_width, -1]floatPer-token log probabilities for each output. Only returned when return_log_probs is true and sampling parameters are set.
context_logits[-1, vocab_size]floatContext logits for input
generation_logits[beam_width, seq_len, vocab_size]floatGeneration logits for each output
batch_index[1]int32Batch index
kv_cache_alloc_new_blocks[1]int32KV cache reuse metrics. Number of newly allocated blocks per request. Set the optional input return_perf_metrics to true to include kv_cache_alloc_new_blocks in the outputs.
kv_cache_reused_blocks[1]int32KV cache reuse metrics. Number of reused blocks per request. Set the optional input return_perf_metrics to true to include kv_cache_reused_blocks in the outputs.
kv_cache_alloc_total_blocks[1]int32KV cache reuse metrics. Number of total allocated blocks per request. Set the optional input return_perf_metrics to true to include kv_cache_alloc_total_blocks in the outputs.
arrival_time_ns[1]floatTime when the request was received by TRT-LLM. Set the optional input return_perf_metrics to true to include arrival_time_ns in the outputs.
first_scheduled_time_ns[1]floatTime when the request was first scheduled. Set the optional input return_perf_metrics to true to include first_scheduled_time_ns in the outputs.
first_token_time_ns[1]floatTime when the first token was generated. Set the optional input return_perf_metrics to true to include first_token_time_ns in the outputs.
last_token_time_ns[1]floatTime when the last token was generated. Set the optional input return_perf_metrics to true to include last_token_time_ns in the outputs.
acceptance_rate[1]floatAcceptance rate of the speculative decoding model. Set the optional input return_perf_metrics to true to include acceptance_rate in the outputs.
total_accepted_draft_tokens[1]int32Number of tokens accepted by the target model in speculative decoding. Set the optional input return_perf_metrics to true to include total_accepted_draft_tokens in the outputs.
total_draft_tokens[1]int32Maximum number of draft tokens acceptable by the target model in speculative decoding. Set the optional input return_perf_metrics to true to include total_draft_tokens in the outputs.

Unique Inputs for tensorrt_llm model

NameShapeTypeDescription
input_ids[-1]int32Input token IDs
input_lengths[1]int32Input lengths
request_output_len[1]int32Requested output length
draft_input_ids[-1]int32Draft input IDs
decoder_input_ids[-1]int32Decoder input IDs
decoder_input_lengths[1]int32Decoder input lengths
draft_logits[-1, -1]float32Draft logits
draft_acceptance_threshold[1]float32Draft acceptance threshold
stop_words_list[2, -1]int32List of stop words
bad_words_list[2, -1]int32List of bad words
embedding_bias[-1]stringEmbedding bias words
runtime_top_k[1]int32Top-k value for runtime top-k sampling
runtime_top_p[1]float32Top-p value for runtime top-p sampling
runtime_top_p_min[1]float32Minimum value for runtime top-p sampling
runtime_top_p_decay[1]float32Decay value for runtime top-p sampling
runtime_top_p_reset_ids[1]int32Reset IDs for runtime top-p sampling
len_penalty[1]float32Controls how to penalize longer sequences in beam search (Default=0.f)
early_stopping[1]boolEnable early stopping
beam_search_diversity_rate[1]float32Beam search diversity rate
stop[1]boolStop flag
streaming[1]boolEnable streaming

Unique Outputs for tensorrt_llm model

NameShapeTypeDescription
output_ids[-1, -1]int32Output token IDs
sequence_length[-1]int32Sequence length

Unique Inputs for tensorrt_llm_bls model

NameShapeTypeDescription
text_input[-1]stringPrompt text
decoder_text_input[1]stringDecoder input text
image_input[3, 224, 224]float16Input image
max_tokens[-1]int32Number of tokens to generate
bad_words[2, num_bad_words]int32Bad words list
stop_words[2, num_stop_words]int32Stop words list
top_k[1]int32Sampling Config param: topK
top_p[1]float32Sampling Config param: topP
length_penalty[1]float32Sampling Config param: lengthPenalty
stream[1]boolWhen true, stream out tokens as they are generated. When false return only when the full generation has completed (Default=false)
embedding_bias_words[-1]stringEmbedding bias words
embedding_bias_weights[-1]float32Embedding bias weights
num_draft_tokens[1]int32Number of tokens to get from draft model during speculative decoding
use_draft_logits[1]boolUse logit comparison during speculative decoding

Unique Outputs for tensorrt_llm_bls model

NameShapeTypeDescription
text_output[-1]stringText output

Some tips for model configuration

Below are some tips for configuring models for optimal performance. These recommendations are based on our experiments and may not apply to all use cases. For guidance on other parameters, please refer to the perf_best_practices.

  • Setting the instance_count for models to better utilize inflight batching

    The instance_count parameter in the config.pbtxt file specifies the number of instances of the model to run. Ideally, this should be set to match the maximum batch size supported by the TRT engine, as this allows for concurrent request execution and reduces performance bottlenecks. However, it will also consume more CPU memory resources. While the optimal value isn't something we can determine in advance, it generally shouldn't be set to a very small value, such as 1. For most use cases, we have found that setting instance_count to 5 works well across a variety of workloads in our experiments.

  • Adjusting max_batch_size and max_num_tokens to optimize inflight batching

    max_batch_size and max_num_tokens are important parameters for optimizing inflight batching. You can modify max_batch_size in the model configuration file, while max_num_tokens is set during the conversion to a TRT-LLM engine using the trtllm-build command. Tuning these parameters is necessary for different scenarios, and experimentation is currently the best approach to finding optimal values. Generally, the total number of requests should be lower than max_batch_size, and the total tokens should be less than max_num_tokens.