OpenAI API responses endpoint {#ovmsdocsrestapiresponses}

May 26, 2026 · View on GitHub

Note: This endpoint works only with LLM graphs.

API Reference

OpenVINO Model Server includes now the responses endpoint using OpenAI API. Please see the OpenAI API Reference for more information on the API. The endpoint is exposed via a path:

http://server_name:port/v3/responses

Example request

curl http://localhost/v3/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3",
    "input": "What is OpenVINO?"
  }'

Example response

{
  "id": "resp-1716825108",
  "object": "response",
  "created_at": 1716825108,
  "completed_at": 1716825110,
  "error": null,
  "model": "llama3",
  "status": "completed",
  "parallel_tool_calls": true,
  "store": true,
  "text": { "format": { "type": "text" } },
  "tool_choice": "auto",
  "tools": [],
  "truncation": "disabled",
  "metadata": {},
  "output": [
    {
      "id": "msg-0",
      "type": "message",
      "role": "assistant",
      "status": "completed",
      "content": [
        {
          "type": "output_text",
          "text": "OpenVINO is an open-source toolkit ...",
          "annotations": []
        }
      ]
    }
  ],
  "usage": {
    "input_tokens": 5,
    "output_tokens": 42,
    "total_tokens": 47
  }
}

In case of VLM models, the request can include images:

curl http://localhost/v3/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llava",
    "input": [
        {
            "role": "user",
            "content": [
                {
                    "type": "input_text",
                    "text": "What is on the picture?"
                },
                {
                    "type": "input_image",
                    "image_url": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBD ..."
                }
            ]
        }
    ],
    "max_output_tokens": 128
}'

Request

Generic

ParamOpenVINO Model ServerOpenAI /responses APITypeDescription
modelstring (required)Name of the model to use. From administrator point of view it is the name assigned to a MediaPipe graph configured to schedule generation using desired model.
inputstring or array (required)The input to generate a response for. Accepts a plain string or an array of message items with input_text / input_image types.
streambool (optional, default: false)If set to true, partial message deltas will be sent to the client as server-sent events as they become available, with the stream terminated by a data: [DONE] message. See Streaming events section for details.
max_output_tokensinteger (optional)An upper bound for the number of tokens that can be generated. If not set, the generation will stop once EOS token is generated. If max_tokens_limit is set in graph.pbtxt it will be the default value.
stopstring/array of strings (optional)Up to 4 sequences where the API will stop generating further tokens. If stream is set to false matched stop string is not included in the output by default. If stream is set to true matched stop string is included in the output by default. It can be changed with include_stop_str_in_output parameter, but for stream=true setting include_stop_str_in_output=false is invalid.
ignore_eosbool (default: false)Whether to ignore the EOS token and continue generating tokens after the EOS token is generated.
include_stop_str_in_outputbool (default: false if stream=false, true if stream=true)Whether to include matched stop string in output. Setting it to false when stream=true is invalid configuration and will result in error.
logprobs⚠️bool (default: false)Include the log probabilities on the logprob of the returned output token. In stream mode logprobs are not supported.
response_formatobject (optional)An object specifying the format that the model must output. Setting to { "type": "json_schema", "json_schema": {...} } enables Structured Outputs. Additionally accepts XGrammar structural tags format. OpenAI Responses API uses text.format instead (not supported in OVMS).
tools⚠️array (optional)A list of tools the model may call. Currently, only function tools are supported. OpenAI also supports built-in tools (web_search, file_search, code_interpreter, etc.) and MCP tools. OVMS additionally accepts a flat {type, name, parameters} format alongside the nested {type, function: {name, parameters}} format. See OpenAI API reference for more details.
tool_choicestring or object (optional)Controls which (if any) tool is called by the model. none means the model will not call any tool and instead generates a message. auto means the model can pick between generating a message or calling one or more tools. required means that model should call at least one tool. Specifying a particular function via {"type": "function", "function": {"name": "my_function"}} forces the model to call that tool.
reasoning⚠️object (optional)Configuration for reasoning/thinking mode. The effort field accepts "low", "medium", or "high" — any value enables thinking mode (enable_thinking: true is injected into chat template kwargs). The summary field is accepted but ignored.
chat_template_kwargsobject (optional)Additional keyword arguments passed to the chat template. When reasoning is also provided, enable_thinking: true is merged into these kwargs.
skip_special_tokensbool (default: true)Whether to remove special tokens (e.g. <|endoftext|>, <|im_end|>) from the generated output. Set to false to include them, which is useful when the model uses special tokens to encode structured information (e.g. bounding boxes, reasoning markers). When false, any tool or reasoning parser configured on the endpoint is silently disabled for the request, so the raw token stream is returned. This option works with most detokenizers exported with OpenVINO Tokenizers 2024.5 or later, unless they are based on custom ops.
stream_optionsNot supported in Responses API. Usage statistics are always included in the response.completed event.

Beam search sampling specific

ParamOpenVINO Model ServerOpenAI /responses APITypeDescription
ninteger (default: 1)Number of output sequences to return for the given prompt. This value must be between 1 <= N <= BEST_OF. For Responses API streaming, only n=1 is supported.
best_ofinteger (default: 1)Number of output sequences that are generated from the prompt. From these best_of sequences, the top n sequences are returned. best_of must be greater than or equal to n. This is treated as the beam width for beam search sampling.
length_penaltyfloat (default: 1.0)Exponential penalty to the length that is used with beam-based generation. It is applied as an exponent to the sequence length, which in turn is used to divide the score of the sequence. Since the score is the log likelihood of the sequence (i.e. negative), length_penalty > 0.0 promotes longer sequences, while length_penalty < 0.0 encourages shorter sequences.

Multinomial sampling specific

ParamOpenVINO Model ServerOpenAI /responses APITypeDescription
temperaturefloat (default: 1.0)The value is used to modulate token probabilities for multinomial sampling. It enables multinomial sampling when set to > 0.0.
top_pfloat (default: 1.0)Controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens.
min_pfloat (default: 0.0)Minimum probability threshold relative to the most likely token. Tokens with probability below min_p × the top token probability are filtered out. 0.0 (default) disables the filter. Typical values: 0.050.1. Must be in [0.0, 1.0).
top_kint (default: 40)Controls the number of top tokens to consider. When multinomial sampling is active, defaults to 40 if not set. Set to -1 to consider all tokens.
repetition_penaltyfloat (default: 1.0)Penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > 1.0 encourage the model to use new tokens, while values < 1.0 encourage the model to repeat tokens. 1.0 means no penalty.
frequency_penaltyfloat (default: 0.0)Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
presence_penaltyfloat (default: 0.0)Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
seedinteger (default: random)Random seed for generation in range [0, 4294967295]. Omit to use a random seed (non-deterministic). Set explicitly to get reproducible output. Note: rng_seed set in generation_config.json is not honoured for multinomial sampling — only a per-request seed is applied.

Speculative decoding specific

Note that below parameters are valid only for speculative pipeline. See speculative decoding demo for details on how to prepare and serve such pipeline.

ParamOpenVINO Model ServerOpenAI /responses APITypeDescription
num_assistant_tokensintThis value defines how many tokens should a draft model generate before main model validates them. Cannot be used with assistant_confidence_threshold.
assistant_confidence_thresholdfloatThis parameter determines confidence level for continuing generation. If draft model generates token with confidence below that threshold, it stops generation for the current cycle and main model starts validation. Cannot be used with num_assistant_tokens.

Prompt lookup decoding specific

Note that below parameters are valid only for prompt lookup pipeline. Add "prompt_lookup": true to plugin_config in your graph config node options to serve it.

ParamOpenVINO Model ServerOpenAI /responses APITypeDescription
num_assistant_tokensintNumber of candidate tokens proposed after ngram match is found
max_ngram_sizeintThe maximum ngram to use when looking for matches in the prompt

Unsupported params from OpenAI Responses API:

  • instructions
  • previous_response_id
  • conversation
  • context_management
  • text
  • truncation
  • top_logprobs
  • include
  • store
  • metadata
  • parallel_tool_calls
  • max_tool_calls
  • background
  • prompt
  • prompt_cache_key
  • prompt_cache_retention
  • service_tier
  • safety_identifier
  • user

Response

ParamOpenVINO Model ServerOpenAI /responses APITypeDescription
idstringA unique identifier for the response. OVMS uses timestamp-based IDs (e.g. resp-1716825108).
objectstringAlways response.
created_atintegerThe Unix timestamp (in seconds) of when the response was created.
completed_atintegerThe Unix timestamp (in seconds) of when the response was completed. Only present when status is completed.
incomplete_detailsobject or nullDetails about why the response is incomplete. Contains {"reason": "max_tokens"} when generation was truncated due to token limit. null otherwise.
errorobject or nullError information. null when no error occurred.
modelstringThe model used for the response.
statusstringcompleted or incomplete for unary requests; transitions from in_progress to completed/incomplete during streaming.
outputarrayA list of output items. May include items of type message, function_call, or reasoning. See Output item types below.
output[].content[].textstringThe generated text content (for message type items).
output[].content[].annotationsarrayAlways an empty array (annotations not yet supported).
usageobjectUsage statistics: input_tokens, output_tokens, total_tokens.
tool_choicestring or objectEchoed back from the request.
toolsarrayEchoed back from the request.
max_output_tokensintegerEchoed back from the request (if set).
parallel_tool_calls⚠️boolHardcoded to true in OVMS.
store⚠️boolHardcoded to true in OVMS.
temperature⚠️floatEchoed back from the request. Only included when explicitly provided. OpenAI always returns this field (default: 1.0).
text⚠️objectHardcoded to {"format": {"type": "text"}} in OVMS.
top_p⚠️floatEchoed back from the request. Only included when explicitly provided. OpenAI always returns this field (default: 1.0).
truncation⚠️stringHardcoded to "disabled" in OVMS.
metadata⚠️objectHardcoded to {} in OVMS.

Output item types

The output array may contain the following item types:

TypeDescription
messageA text message from the assistant. Contains id, type, role, status, and content array with output_text entries.
function_callA tool/function call. Contains id, type, status, call_id, name, and arguments. Emitted when the model invokes a tool.
reasoningReasoning output (for models with thinking/reasoning enabled via chat_template_kwargs). Contains id, type, and summary array with summary_text entries.

Unsupported response fields from OpenAI service:

  • instructions (echoed back)
  • output_text (convenience field)

Streaming events

When stream is set to true, the server emits server-sent events in the following order:

Standard text generation events

EventWhen emittedDescription
response.createdAfter execution is scheduledContains the full response object with status: "in_progress".
response.in_progressWhen the model starts producing tokensSignals that the response is actively being processed. Emitted as part of the first streaming chunk.
response.output_item.addedAfter response.in_progressA new output item (message) has been initialized. Contains output_index and the item object.
response.content_part.addedAfter response.output_item.addedA new content part (output_text) has been initialized. Contains output_index, content_index, item_id and the part object.
response.output_text.deltaFor each text chunk during generationContains the text delta, output_index, content_index, and item_id. May be emitted many times.
response.output_text.doneWhen text generation is finalizedContains the full accumulated text.
response.content_part.doneAfter response.output_text.doneThe content part is complete. Contains the final part object with full text.
response.output_item.doneAfter response.content_part.doneThe output item is complete. Contains the final item object with status: "completed".
response.completedLast event before [DONE]Contains the full response object with status: "completed" and usage statistics.
response.incompleteLast event before [DONE] (when truncated)Emitted instead of response.completed when generation was stopped due to max_output_tokens limit. Contains the response object with status: "incomplete" and incomplete_details.
response.failedOn error during generationContains the response object with status: "failed" and error details.

Reasoning events (for models with thinking enabled)

When using models that support reasoning (e.g., via chat_template_kwargs: {"enable_thinking": true}), the following additional events may be emitted before the standard message events:

EventWhen emittedDescription
response.output_item.addedWhen reasoning beginsA reasoning output item (type: "reasoning") is added at output_index: 0.
response.reasoning_summary_part.addedAfter reasoning item addedA reasoning summary part has been initialized. Contains output_index, summary_index, and item_id.
response.reasoning_summary_text.deltaFor each reasoning text chunkContains the reasoning text delta.
response.reasoning_summary_text.doneWhen reasoning is finalizedContains the full accumulated reasoning text.
response.reasoning_summary_part.doneAfter reasoning text doneThe reasoning summary part is complete.
response.output_item.doneAfter reasoning part doneThe reasoning output item is complete.

When reasoning is present, the subsequent message output item will have output_index: 1 instead of 0.

Function call events (for tool calling)

When the model generates tool/function calls, the following events are emitted (after reasoning events if present, before or instead of message events):

EventWhen emittedDescription
response.output_item.addedWhen a function call beginsA function call output item (type: "function_call") is added. Contains output_index and the item object with call_id, name, and empty arguments.
response.function_call_arguments.deltaFor each arguments chunkContains the arguments text delta, item_id, output_index, and call_id.
response.function_call_arguments.doneWhen arguments are completeContains the full accumulated arguments.
response.output_item.doneAfter arguments doneThe function call output item is complete.

All events include a monotonically increasing sequence_number field.

The stream is terminated by a data: [DONE] message.

NOTE: OpenAI python client supports a limited list of parameters. Those native to OpenVINO Model Server, can be passed inside a generic container parameter extra_body. Below is an example how to encapsulate top_k value.

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v3", api_key="unused")
response = client.responses.create(
    model="llama3",
    input="What is OpenVINO?",
    max_output_tokens=100,
    extra_body={"top_k": 1},
    stream=False
)

References

LLM quick start guide

End to end demo with LLM model serving over OpenAI API

Code snippets

LLM calculator