How to use the Apple Foundation Model from Python

April 15, 2026 · View on GitHub

Call Apple's on-device Foundation Model from Python using the official openai SDK, pointed at a local apfel --serve. 100% on-device, zero API cost, no network required for inference.

This guide shows the canonical patterns: one-shot completion, streaming, JSON mode, error handling, tool calling, and a real text-summarization example. Every code block was run against a live apfel server; the output below each snippet is the real unedited stdout.

Runnable scripts + tests: Arthur-Ficial/apfel-guides-lab/scripts/python.

Prerequisites

  • macOS 26+ Tahoe, Apple Silicon, Apple Intelligence enabled
  • brew install apfel
  • apfel --serve running (default port 11434)
  • Python 3.11+
  • pip install openai (or uv add openai)

1. One-shot chat completion

Point the openai SDK at your local apfel server and call chat.completions.create:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="apple-foundationmodel",
    messages=[
        {"role": "user", "content": "In one sentence, what is the Swift programming language?"},
    ],
    max_tokens=80,
)

print((response.choices[0].message.content or "").strip())

Real output:

Swift is a modern, high-performance, and safe programming language developed by Apple for developing iOS, macOS, watchOS, and tvOS applications.

Lab script: 01_oneshot.py.

2. Streaming

Pass stream=True and iterate. Guard against empty choices on the final usage chunk:

import sys
from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="not-needed")

stream = client.chat.completions.create(
    model="apple-foundationmodel",
    messages=[{"role": "user", "content": "List three Apple silicon chips, one per line."}],
    max_tokens=80,
    stream=True,
)

for chunk in stream:
    if not chunk.choices:
        continue
    delta = chunk.choices[0].delta.content or ""
    sys.stdout.write(delta)
    sys.stdout.flush()
print()

Real output:

Apple M1  
Apple M2  
Apple M3

Lab script: 02_stream.py.

3. JSON mode / structured output

Request response_format: {"type": "json_object"} and parse. apfel may wrap output in markdown fences - the fence-strip regex below handles both cases cleanly:

import json, re
from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="apple-foundationmodel",
    messages=[{
        "role": "user",
        "content": "Return JSON with fields 'chip', 'year', 'cores'. Describe the Apple M1 chip. Return ONLY JSON.",
    }],
    response_format={"type": "json_object"},
    max_tokens=120,
)

raw = (response.choices[0].message.content or "").strip()
raw = re.sub(r"^```(?:json)?\s*|\s*```$", "", raw, flags=re.MULTILINE).strip()
data = json.loads(raw)
print(json.dumps(data, indent=2, sort_keys=True))

Real output:

{
  "chip": "Apple M1",
  "cores": 8,
  "year": 2020
}

Lab script: 03_json.py.

4. Error handling

apfel returns honest HTTP errors for unsupported features. Embeddings return 501:

from openai import APIStatusError, OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="not-needed")

try:
    client.embeddings.create(
        model="apple-foundationmodel",
        input="apfel runs 100% on-device.",
    )
except APIStatusError as e:
    print(f"Got expected error: HTTP {e.status_code} - {e.message}")

Real output:

Got expected error: HTTP 501 - Error code: 501 - {'error': {'message': "Embeddings not supported by Apple's on-device model.", 'type': 'invalid_request_error'}}

Lab script: 04_errors.py.

5. Tool calling

Define a tool schema, send a prompt, handle the tool call, post the result, get the final answer:

import json
from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="not-needed")

TOOLS = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get the current temperature in Celsius for a city.",
        "parameters": {
            "type": "object",
            "properties": {"city": {"type": "string", "description": "City name"}},
            "required": ["city"],
        },
    },
}]


def get_weather(city: str, **_: object) -> str:
    fake = {"Vienna": 14, "Cupertino": 19, "Tokyo": 11}
    return json.dumps({"city": city, "temp_c": fake.get(city, 15)})


messages = [{"role": "user", "content": "What is the temperature in Vienna right now?"}]

first = client.chat.completions.create(
    model="apple-foundationmodel", messages=messages, tools=TOOLS, max_tokens=256,
)

msg = first.choices[0].message
messages.append(msg.model_dump(exclude_none=True))

if msg.tool_calls:
    for call in msg.tool_calls:
        args = json.loads(call.function.arguments)
        result = get_weather(**args)
        messages.append({"role": "tool", "tool_call_id": call.id, "content": result})

    final = client.chat.completions.create(
        model="apple-foundationmodel", messages=messages, max_tokens=120,
    )
    print((final.choices[0].message.content or "").strip())

Real output:

The current temperature in Vienna is 14°C.

Lab script: 05_tools.py.

6. Real example - summarize a file from stdin

import sys
from openai import OpenAI

text = sys.stdin.read().strip()
if not text:
    sys.exit("usage: cat file.txt | python 06_example.py")

client = OpenAI(base_url="http://localhost:11434/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="apple-foundationmodel",
    messages=[
        {"role": "system", "content": "You are a concise summarizer. Reply with one short paragraph."},
        {"role": "user", "content": f"Summarize:\n\n{text}"},
    ],
    max_tokens=150,
)
print((response.choices[0].message.content or "").strip())
cat README.md | python 06_example.py

Real output (summarizing a paragraph about the M1 chip):

The Apple M1 chip, released in November 2020, was Apple's first ARM-based system-on-a-chip for Mac computers. It features an 8-core CPU with four performance and four efficiency cores, plus an integrated GPU with up to 8 cores. The chip combines CPU, GPU, memory, and neural engine on a single die, delivering significant performance-per-watt improvements over the Intel chips it replaced.

Lab script: 06_example.py.

Troubleshooting

  • Connection refused on port 11434 - run apfel --serve first.
  • Embeddings not supported - apfel is text-only; use sentence-transformers or another embedder for vectors.
  • JSONDecodeError in JSON mode - keep the fence-strip regex; apfel sometimes wraps JSON in ```json ... ```.
  • Empty streaming output - make sure your client handles the final usage chunk with empty choices. The if not chunk.choices: continue above covers it.
  • Model refuses a tool call - small on-device models occasionally decline. Retry the whole call.

Tested with

  • apfel v1.0.3
  • macOS 26.3.1, Apple Silicon
  • Python 3.11 / openai 2.31.0
  • Date: 2026-04-16

Full runnable test suite + captured outputs: apfel-guides-lab/tests/test_python.py.

See also