OpenAI Responses API subset (/v1/responses)

May 18, 2026 ยท View on GitHub

mlxcel serve and mlxcel-server expose a Phase-1 subset of OpenAI's Responses API. Basic client.responses.create(...) and streaming flows can be used with the OpenAI Python SDK when base_url points at the mlxcel server, but this is not a full implementation of every OpenAI Responses feature.

Implementation source map:

ModuleResponsibility
src/server/types/responses_request.rsRequest types.
src/server/types/responses_response.rsResponse types.
src/server/types/responses_stream.rsSSE event enum.
src/server/responses_translator.rsResponses โ†” chat-completions translation.
src/server/responses_store.rsIn-memory response store.
src/server/conversation_store.rsIn-memory conversation transcript store.
src/server/routes/responses.rsRoute handlers.

Implemented endpoints

MethodPathDescription
POST/v1/responsesCreate a response, either non-streaming or streaming.
GET/v1/responses/{id}Retrieve a stored response.
DELETE/v1/responses/{id}Delete a stored response.
POST/v1/responses/{id}/cancelBest-effort cancellation / cancellation marking.

Aliases without /v1 are also mounted for the same implemented routes: /responses, /responses/{id}, and /responses/{id}/cancel.

The following OpenAI-style surfaces are not mounted in this implementation:

  • GET /v1/responses/{id}/input_items
  • POST /v1/responses/compact
  • POST /v1/responses/input_tokens

Quickstart

from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="sk-local")

resp = client.responses.create(
    model="qwen3-0.6b-4bit",
    input="Reply with: hello",
    max_output_tokens=64,
)
print(resp.status)
print(resp.output_text)

with client.responses.stream(
    model="qwen3-0.6b-4bit",
    input="Count to 5.",
    max_output_tokens=64,
) as stream:
    for event in stream:
        print(event.type, getattr(event, "delta", ""))
    final = stream.get_final_response()
    print(final.usage)

Supported request fields

FieldStatusNotes
modelrequiredMust match the loaded model alias/path accepted by the server.
inputsupportedString or typed input item array.
instructionssupportedPrepended as a system-style message; not inherited through previous_response_id.
toolsfunction-onlyOnly {"type":"function", ...} is accepted.
tool_choicesupported subsetString or named function choice compatible with chat-completions tooling.
parallel_tool_callsacceptedForwarded to existing tool-call handling.
text.formatsupported subsettext and json_schema shapes are handled through existing structured-output code.
reasoningechoed/advisoryRecorded and echoed; model-specific thinking behavior remains template/runtime dependent.
conversationsupportedString id or { "id": "..." }; uses in-memory conversation store.
previous_response_idsupportedRehydrates stored prior input/output items. Mutually exclusive with conversation.
storesupportedDefaults to true; false skips persistence.
streamsupportedStreams typed SSE events.
max_output_tokenssupportedMust be greater than zero.
max_tool_callssupportedSoft cap on emitted function-call items.
temperature, top_p, top_logprobssupported subsetMapped to chat-completions sampling fields.
metadatasupportedMaximum 16 entries.
prompt_cache_keyacceptedForwarded to prompt-cache plumbing.
user, safety_identifieraccepteduser is used when both are present; safety_identifier is used as a fallback.
backgroundrejected when trueAsync polling is not implemented.
truncationonly disabledOther values, including auto, return 400.
service_tieracceptedEchoed/ignored; no scheduling tier is implemented.

Input items

Phase 1 supports these typed items:

[
  {"type":"message", "role":"user", "content":"hello"},
  {"type":"message", "role":"system", "content":[{"type":"text", "text":"sys"}]},
  {"type":"function_call", "call_id":"call_abc", "name":"f", "arguments":"{}"},
  {"type":"function_call_output", "call_id":"call_abc", "output":"ok"},
  {"type":"reasoning", "content":[{"type":"reasoning_text", "text":"..."}]}
]

developer role is treated like system. Reasoning input items are accepted but are not fed back into the next prompt.

Message content parts reuse mlxcel's chat-completions content part types. This means text, image_url, video_url, and input_audio can deserialize, but actual execution still depends on the loaded model's media support. OpenAI Responses-specific input_image / input_file compatibility is not complete.

Response shape

Responses use an OpenAI-like object shape:

{
  "id": "resp_...",
  "object": "response",
  "created_at": 1234.0,
  "completed_at": 1235.0,
  "status": "completed",
  "model": "...",
  "output": [
    {"type":"reasoning", "id":"rs_...", "status":"completed", "content":[...]},
    {"type":"function_call", "id":"fc_...", "call_id":"call_...", "name":"...", "arguments":"{}", "status":"completed"},
    {"type":"message", "id":"msg_...", "role":"assistant", "status":"completed", "content":[...]}
  ],
  "output_text": "...",
  "usage": {
    "input_tokens": 12,
    "output_tokens": 34,
    "total_tokens": 46,
    "input_tokens_details": {"cached_tokens": 0},
    "output_tokens_details": {"reasoning_tokens": 0}
  }
}

Several request fields are echoed back when present. Treat this as compatibility surface, not as proof that every echoed field changes runtime behavior.

Streaming events

SSE frames are typed and include a monotonic sequence_number per response. Phase 1 emits events such as:

  • response.created
  • response.in_progress
  • response.output_item.added
  • response.content_part.added
  • response.output_text.delta
  • response.output_text.done
  • response.content_part.done
  • response.output_item.done
  • response.function_call_arguments.delta
  • response.function_call_arguments.done
  • response.reasoning_text.delta
  • response.reasoning_text.done
  • response.completed
  • failure/incomplete/error events on error paths

Response and conversation stores

The stores are in memory and are bounded by entry count and TTL.

FlagDefaultEnv varNotes
--responses-store-max-entries1024LLAMA_ARG_RESPONSES_STORE_MAX_ENTRIES0 disables response persistence.
--responses-store-ttl-secs3600LLAMA_ARG_RESPONSES_STORE_TTL_SECS0 disables TTL.
--conversation-store-max-entries256LLAMA_ARG_CONVERSATION_STORE_MAX_ENTRIES0 disables conversations.
--conversation-store-ttl-secs3600LLAMA_ARG_CONVERSATION_STORE_TTL_SECS0 disables TTL.

When response storage is disabled, retrieve/delete/cancel-by-id and previous_response_id chaining return an error. When conversation storage is disabled, requests using conversation return an error.

Chaining semantics

  • previous_response_id loads the stored response's input items and output items as prior conversation history, then appends the new input.
  • conversation loads and appends to an in-memory transcript by id.
  • The two fields are mutually exclusive.
  • instructions from the referenced prior response are not carried over.

Unsupported tool types

Only function tools are accepted. Built-in/external tool types such as web_search, file_search, computer_use_preview, code_interpreter, image_generation, mcp, custom, apply_patch, and function_shell return 400 responses. mlxcel does not execute external tools for the Responses API.

Differences from OpenAI's full API

Notable gaps:

  • no background job mode;
  • no input-items pagination endpoint;
  • no server-side compaction endpoint;
  • no token-count endpoint;
  • no built-in tools or MCP connector execution;
  • no disk-persisted response store;
  • incomplete OpenAI Responses multimodal part compatibility.