Model Server Parameters {#ovmsdocsparameters}

June 16, 2026 ยท View on GitHub

Model Configuration Options

OptionValue formatDescription
"model_name"/"name"stringModel name exposed over gRPC and REST API.(use model_name in command line, name in json config)
"model_path"/"base_path"stringIf using a Google Cloud Storage, Azure Storage or S3 path, see cloud storage guide. The path may look as follows:
"/opt/ml/models/model"
"gs://bucket/models/model"
"s3://bucket/models/model"
"azure://bucket/models/model"
The path can be also relative to the config.json location
(use model_path in command line, base_path in json config).

For local filesystem paths only, it can also point directly to a single model file (.xml, .onnx, .pdmodel, .pdiparams, .pb, .tflite). In this mode, model version 1 is exposed.
"shape"tuple/json/"auto"shape is optional and takes precedence over batch_size. The shape argument changes the model that is enabled in the model server to fit the parameters. shape accepts three forms of the values: * auto - The model server reloads the model with the shape that matches the input data matrix. * a tuple, such as (1,3,224,224) - The tuple defines the shape to use for all incoming requests for models with a single input. * A dictionary of shapes, such as {"input1":"(1,3,224,224)","input2":"(1,3,50,50)", "input3":"auto"} - This option defines the shape of every included input in the model.Some models don't support the reshape operation.If the model can't be reshaped, it remains in the original parameters and all requests with incompatible input format result in an error. See the logs for more information about specific errors.Learn more about supported model graph layers including all limitations at Shape Inference Document.
"batch_size"integer/"auto"Optional. By default, the batch size is derived from the model, defined through the OpenVINO Model Optimizer. batch_size is useful for sequential inference requests of the same batch size.Some models, such as object detection, don't work correctly with the batch_size parameter. With these models, the output's first dimension doesn't represent the batch size. You can set the batch size for these models by using network reshaping and setting the shape parameter appropriately.The default option of using the Model Optimizer to determine the batch size uses the size of the first dimension in the first input for the size. For example, if the input shape is (1, 3, 225, 225), the batch size is set to 1. If you set batch_size to a numerical value, the model batch size is changed when the service starts.batch_size also accepts a value of auto. If you use auto, then the served model batch size is set according to the incoming data at run time. The model is reloaded each time the input data changes the batch size. You might see a delayed response upon the first request.
"layout" json/stringlayout is optional argument which allows to define or change the layout of model input and output tensors. To change the layout (add the transposition step), specify <target layout>:<source layout>. Example: NHWC:NCHW means that user will send input data in NHWC layout while the model is in NCHW layout.

When specified without colon separator, it doesn't add a transposition but can determine the batch dimension. E.g. --layout CN makes prediction service treat second dimension as batch size.

When the model has multiple inputs or the output layout has to be changed, use a json format. Set the mapping, such as: {"input1":"NHWC:NCHW","input2":"HWN:NHW","output1":"CN:NC"}.

If not specified, layout is inherited from model.

Read more
"mean"array/float/tupleOptional. The value used for preprocessing input data which will be subtracted from pixel values. It may be float value, tuple or array. Tuple or array length should be the same as the number of channels.
"scale"array/float/tupleOptional. The value used for preprocessing input data which will divide pixel values. It may be float value, tuple or array. Tuple or array length should be the same as the number of channels.
"color_format"stringOptional. Allows to define or change color format of model input tensors. To change the color format, specify <target color format>:<source color format>, as the layout option. Possible options: RGB, BGR, GRAY, NV12, NV12_2, I420 or I420_3
"precision"stringOptional. Allows to change precision of model input tensors. To change model's precision, specify <target precision>:<source precision>, as the layout or color_format. Possible options: fp64, fp32, fp16, uint1, int8, uint8, int16, uint16, int32, uint32, int64, uint64 or bf16
"model_version_policy"json/stringOptional. The model version policy lets you decide which versions of a model that the OpenVINO Model Server is to serve. By default, the server serves the latest version. One reason to use this argument is to control the server memory consumption.The accepted format is in json or string. Examples:
{"latest": { "num_versions":2 }
{"specific": { "versions":[1, 3] } }
{"all": {} }
"plugin_config"json/stringList of device plugin parameters. For full list refer to OpenVINO documentation and performance tuning guide. Example:
{"PERFORMANCE_HINT": "LATENCY"}
"nireq"integerThe size of internal request queue. When set to 0 or no value is set value is calculated automatically based on available resources.
"target_device"stringDevice name to be used to execute inference operations. Accepted values are: "CPU"/"GPU"/"MULTI"/"HETERO"
"metrics_enable"boolFlag enabling metrics endpoint on rest_port.
"metrics_list"stringComma separated list of metrics. If unset, only default metrics will be enabled.

Note : Specifying config_path is mutually exclusive with putting model parameters in the CLI (serving multiple models).

OptionValue formatDescription
config_pathstringAbsolute path to json configuration file

Server configuration options

Configuration options for the server are defined only via command-line options and determine configuration common for all served models.

OptionValue formatDescription
portintegerNumber of the port used by gRPC sever.
rest_portintegerNumber of the port used by HTTP server (if not provided or set to 0, HTTP server will not be launched).
grpc_bind_addressstringComma separated list of ipv4/ipv6 network interface addresses or hostnames, to which gRPC server will bind to. Default: all interfaces: 0.0.0.0
rest_bind_addressstringComma separated list of ipv4/ipv6 network interface addresses or hostnames, to which REST server will bind to. Default: all interfaces: 0.0.0.0
grpc_workersintegerNumber of the gRPC server instances (must be from 1 to CPU core count). Default value is 1 and it's optimal for most use cases. Consider setting higher value while expecting heavy load.
rest_workersintegerNumber of HTTP server threads. Effective when rest_port > 0. Default value is set based on the number of CPUs.
file_system_poll_wait_secondsintegerTime interval between config and model versions changes detection in seconds. Default value is 1. Zero value disables changes monitoring.
custom_node_resources_cleaner_interval_secondsintegerTime interval (in seconds) between two consecutive resources cleanup scans. Default is 1. Must be greater than 0. See custom node development.
cpu_extensionstringOptional path to a library with custom layers implementation.
log_level"DEBUG"/"INFO"/"ERROR"Serving logging level
log_pathstringOptional path to the log file.
cache_dirstringPath (absolute or relative to the current directory) to the model cache storage. Caching will be enabled if this parameter is defined or the default path /opt/cache exists
grpc_channel_argumentsstringA comma separated list of arguments to be passed to the grpc server. (e.g. grpc.max_connection_age_ms=2000)
grpc_max_threadsstringMaximum number of threads which can be used by the grpc server. Default value depends on number of CPUs.
grpc_memory_quotastringGRPC server buffer memory quota. Default value set to 2147483648 (2GB).
helpNAShows help message and exit
versionNAShows binary version
allow_credentialsbool (default: false)Whether to allow credentials in CORS requests.
allowed_headersstring (default: *)Comma-separated list of allowed headers in CORS requests.
allowed_methodsstring (default: *)Comma-separated list of allowed methods in CORS requests.
allowed_originsstring (default: *)Comma-separated list of allowed origins in CORS requests.
api_key_filestringPath to the text file with the API key for generative endpoints /v3/. The value of first line is used. If not specified, server is using environment variable API_KEY. If not set, requests will not require authorization.
allowed_local_media_pathstringPath to the directory containing images to include in requests. If unset, local filesystem images in requests are not supported.
allowed_media_domainsstringComma separated list of media domains from which URLs can be used as input for LLMs. Set to "all" to disable this restrictions. If unset, URLs in requests are not supported."

Config management mode options

Configuration options for the config management mode, which is used to manage config file in the model repository.

OptionValue formatDescription
model_repository_pathstringPath to the model repository. This path is prefixed to the relative model path.
list_modelsNAList all models paths in the model repository.
model_namestringName of the model as visible in serving. If --model_path is not provided, path is deduced from name.
model_pathstringOptional. Path to the model repository. If path is relative then it is prefixed with --model_repository_path.
add_to_configNADirective to add new model to the config file.
remove_from_configNADirective to remove model from the config file.
config_pathstringPath to the configuration file.

Pull mode configuration options

Shared configuration options for the pull, and pull & start mode. In the presence of --pull parameter OVMS will only pull model without serving.

Pull Mode Options

OptionValue formatDescription
--pullNARuns the server in pull mode to download the model from the Hugging Face repository.
--source_modelstringName of the model in the Hugging Face repository. If not set, model_name is used.
--model_repository_pathstringDirectory where all required model files will be saved.
--model_namestringName of the model as exposed externally by the server.
--target_devicestringDevice name to be used to execute inference operations. Accepted values are: "CPU"/"GPU"/"MULTI"/"HETERO"
--taskstringTask type the model will support (text_generation, embeddings, rerank, image_generation).
--overwrite_modelsNAIf set, an existing model with the same name will be overwritten. If not set, the server will use existing model files if available.
--gguf_filenamestringFilename of the wanted quantization type from Hugging Face GGUF repository.

NOTE: If you want to use model that is split into several .gguf files, you should specify the filename of the first part only, e.g. --gguf_filename model-name-00001-of-00002.gguf.

Pull Mode Options for optimum-cli mode

When pulling models outside of OpenVINO organization the optimum-cli api is used inside ovms. You can set additional parameters for this mode.

OptionValue formatDescription
--extra_quantization_paramsstringAdd advanced quantization parameters. Check optimum-intel documentation. Example: --sym --group-size -1 --ratio 1.0 --awq --scale-estimation --dataset wikitext2
--weight-formatstringModel precision used in optimum-cli export with conversion. Default int8.
--vocoderstringThe vocoder model to use for text2speech. For example microsoft/speecht5_hifigan.

There are also additional environment variables that may change the behavior of pulling:

Basic Environment Variables for Pull Mode

VariableValue formatDescription
HF_ENDPOINTstringDefault: https://huggingface.co. For users in China, set to https://www.modelscope.cn/models or https://hf-mirror.com if needed.
HF_TOKENstringAuthentication token required for accessing some models from Hugging Face.
https_proxystringIf set, model downloads will use this proxy.
OVMS_MODEL_REPOSITORY_PATHstringIf set, it defines default value for --model_repository_path and --config_path as config.json in the model repository path

Advanced Environment Variables for Pull Mode

VariableFormatDescription
GIT_OPT_SET_SERVER_CONNECT_TIMEOUTintTimeout to attempt connections to a remote server. Default value 4000 ms.
GIT_OPT_SET_SERVER_TIMEOUTintTimeout for reading from and writing to a remote server. Default value 4000 ms.
GIT_OPT_SET_SSL_CERT_LOCATIONSstringPath to check for ssl certificates.
GIT_OPT_SET_ENABLE_SEARCH_PATHSintWhen set to 1, the pull functionality reads host-level git configuration locations like ~/.gitconfig. Default value 0.

Task specific parameters for different tasks (text generation/image generation/embeddings/rerank) are listed below:

Text generation

optionValue formatDescription
--max_num_seqsintegerThe maximum number of sequences that can be processed together. Default: 256.
--pipeline_typestringType of the pipeline to be used. Choices: LM, LM_CB, VLM, VLM_CB, AUTO. Default: AUTO.
--enable_prefix_cachingboolEnables algorithm to cache the prompt tokens. Default: true.
--max_num_batched_tokensintegerThe maximum number of tokens that can be batched together.
--cache_sizeintegerKV Cache size in GB. Default: 0 which is a dynamic allocation.
--draft_source_modelstringHF model name or path to the local folder with PyTorch or OpenVINO draft model.
--dynamic_split_fuseboolEnables dynamic split fuse algorithm. Default: true.
--max_prompt_lenintegerSets NPU specific property for maximum number of tokens in the prompt.
--kv_cache_precisionstringReduced kv cache precision to u8 lowers the cache size consumption. Accepted values: u8 or empty (default).
--model_distribution_policystringTENSOR_PARALLEL distributes tensor to multiple sockets/devices and processes it in parallel. PIPELINE_PARALLEL distributes different tensors to process by each device. Accepted values: TENSOR_PARALLEL, PIPELINE_PARALLEL or empty (default).
--reasoning_parserstringType of parser to use for reasoning content extraction from model output. Currently supported: [qwen3, gptoss, gemma4]
--tool_parserstringType of parser to use for tool calls extraction from model output. Currently supported: [llama3, phi4, hermes3, mistral, qwen3coder, gptoss, devstral, lfm2, gemma4]
--enable_tool_guided_generationboolEnables enforcing tool schema during generation. Requires setting response parser. Default: false.

Image generation

optionValue formatDescription
--resolutionstringAllowed resolutions in a format list of WxH; W=width H=height - space separated. If not specified, inherited from model. If only one is specified, the pipeline will be reshaped to static. Static shape is required for NPU device.
--max_resolutionstringMaximum allowed resolution in the format WxH (W = width, H = height). If not specified, inherited from the model.
--default_resolutionstringDefault resolution in the format WxH when not specified by the client. If not specified, inherited from the model.
--max_num_images_per_promptintegerMaximum number of images a client can request per prompt in a single request. In 2025.2 release only 1 image generation per request is supported.
--num_images_per_promptintegerNumber of images client is allowed to request. Can only be used when resolution parameter is specified and static. By default, inherited from GenAI (1). For dynamic pipelines, by default only max_num_images_per_prompt limits the batch size.
--guidance_scaleintegerGuidance scale used for static pipeline reshape. Can only be used when resolution parameter is specified and static. By default, inherited from GenAI (7.5)
--default_num_inference_stepsintegerDefault number of inference steps when not specified by the client.
--max_num_inference_stepsintegerMaximum number of inference steps a client can request for a given model.
--num_streamsintegerNumber of parallel execution streams for image generation models. Use at least 2 on 2-socket CPU systems.
--source_lorasstringLoRA adapters for image generation. Comma-separated list in format: alias=source. Source can be: HF repo (org/repo), HF repo with explicit file (org/repo@file.safetensors), direct URL (https://url/file.safetensors), local path (/path/to/file.safetensors), or composite referencing other aliases (@alias1:weight+@alias2:weight).

Embeddings

optionValue formatDescription
--num_streamsintegerThe number of parallel execution streams to use for the model. Use at least 2 on 2 socket CPU systems. Default: 1.
--normalizeboolNormalize the embeddings. Default: true.
--truncateboolTruncate input when it exceeds model context length. Default: false
--poolingstringPooling option. One of: CLS, LAST, MEAN. Default: CLS.

Rerank

optionValue formatDescription
--num_streamsintegerThe number of parallel execution streams to use for the model. Use at least 2 on 2 socket CPU systems. Default: 1.
--max_allowed_chunksintegerMaximum allowed chunks. Default: 10000.