Runtime, Providers, and Setup
May 29, 2026 · View on GitHub
This doc covers first-run setup, local runtime management, and online-provider routing.
Primary Files
AC/Core/AppController+RuntimeSetup.swiftAC/Services/RuntimeSetupService.swiftAC/Services/DependencyInstallerService.swiftAC/Services/LocalModelRuntime.swiftAC/Services/OnlineModelService.swiftAC/Services/OnlineProviderRouting.swiftAC/Services/PromoRedemptionService.swiftACShared/AITier.swiftAC/UI/OnboardingDialogView.swiftAC/UI/OnboardingWizardView.swiftAC/UI/Settings/AITab.swift
Local Runtime
The local path is llama.cpp plus managed model artifacts.
Key facts:
- runtime repo remote:
https://github.com/ggml-org/llama.cpp.git - pinned commit lives in
RuntimeSetupService.pinnedLlamaCommit - preferred install root is under
~/Library/Application Support/AC/runtime - legacy installs under
~/accountycatare still detected
RuntimeSetupService owns:
- runtime diagnostics
- free-disk-space checks
- clone / fetch / checkout / build
- managed Hugging Face cache paths
- warm-up and readiness polling
- cleanup helpers for managed models
Build-tool PATH for setup subprocesses
A GUI app launched from Finder inherits launchd's minimal PATH
(/usr/bin:/bin:/usr/sbin:/sbin), which omits the Homebrew prefixes
(/opt/homebrew/bin, /usr/local/bin) where users install cmake/ninja. AC
finds the tools itself via resolvedToolPath (which checks those prefixes
explicitly), but the build subprocesses it spawns do not — so cmake could find
ninja only at lookup time, then fail at configure time with "CMake unable to
find a build program corresponding to Ninja". runStreaming therefore prepends
the standard tool prefixes to the subprocess PATH (augmentedSubprocessPATH),
and installRuntime additionally passes the resolved ninja path to cmake via
-DCMAKE_MAKE_PROGRAM. This only reproduces in a packaged .app launched from
Finder; running from Xcode masks it because Xcode passes the developer's shell
PATH.
Model download authentication (HF token)
Models download via llama-cli -hf <id>, which is a single-stream libcurl fetch from the
Hugging Face CDN. Hugging Face's response headers explicitly ask for a token "to enable
higher rate limits and faster downloads," and rate-limits unauthenticated traffic.
HuggingFaceTokenService.fetchToken() fetches a token from
https://accountycat.com/api/hf-token (override with AC_HF_TOKEN_ENDPOINT), caches it in
memory for an hour, and returns nil on any failure. installRuntime fetches it and passes
it to warmUpRuntime, which only sets HF_TOKEN in the llama.cpp env when non-empty. No
token → unauthenticated download, identical to the prior behavior — a server outage or
missing key never blocks setup.
Server contract: a 200 JSON body { "token": "hf_..." } enables auth; anything else
(including 204) → unauthenticated fallback. The app sends an X-AC-Install header (the
per-install id) so the endpoint can rate-limit abuse via Upstash if desired.
The token must be fine-grained, read-only, public-repos-only. It is effectively public (any user can extract it from the endpoint), so it guards only access to files that are already public — a leak exposes nothing private; worst case the server rotates it. Never use a write-scoped or account-wide token here.
Measured download throughput is logged to the activity log as setup-download-speed
(X MB in Ys (Z MB/s) [authenticated|unauthenticated]) on real downloads only, so we can
tell in the wild whether users are throttle-capped or pipe-limited. Note: unauthenticated
speed is highly variable (measured 1.4–6.8 MB/s on the same file/connection), so the token is
a cheap insurance/rate-limit win, not a guaranteed speedup — let the telemetry decide whether
it's worth deeper investment (e.g. a self-hosted mirror).
Setup Guardrails
When changing first run or setup, preserve all of these:
- free-disk-space verification before large writes
- cleanup of interrupted downloads / partial state
- user-readable subprocess failures
- an explicit "setup is done" signal
- cancellable installs: switching away from Local (or to another tier whose model is already present) must actually stop an in-flight install, not just orphan it
Setup bugs are high-impact because they block the whole product.
Install cancellation
RuntimeSetupService.runStreaming wraps the subprocess wait in
withTaskCancellationHandler and calls process.terminate() on cancel, then throws
CancellationError. Swift task cancellation is cooperative, so without the explicit
terminate the git / cmake / llama-cli subprocess would keep running to completion
after the user switched away — leaving installingRuntime == true, which pins
setupStatus at .installing until an app restart.
AppController.cancelRuntimeInstall() is the single entry point that cancels the install
task, stops byte-progress polling, and resets installingRuntime + all setupProgress* /
setupDownloaded/TotalBytes state. It is invoked from updateMonitoringInferenceBackend
(switch to OpenRouter) and the local-tier-change path in applyTierToActiveBackend. The
install task's own cancellation branch deliberately leaves these flags for the canceller
to reset.
Terminating mid-download leaves llama.cpp's *.downloadInProgress blobs in place; the
next Local install resumes them via range requests rather than restarting from zero.
Download progress
Progress is reported two ways, with byte-level data preferred:
AppController.startDownloadProgressPolling(modelIdentifier:)pollsRuntimeSetupService.downloadedModelBytes(for:)(sum ofblobs/file sizes, including*.downloadInProgresspartials) every ~0.6s, and fetches the expected total once from the Hugging Face tree API viaRuntimeSetupService.expectedDownloadBytes(for:)(best-effort; nil on failure). This drives a determinate bar and the "X of Y" display inAITab.AppController.updateSetupProgress(from:)scrapes percentages from subprocess log lines as a fallback, but defers to byte polling whenever a real total is known (setupTotalBytes != nil) so the two don't fight.
Surfacing failures
setupErrorMessage is rendered in both OnboardingDialogView and the AITab
local-model section (with a "Try again" button). A failure after onboarding must not be
silent — the AI tab is where post-onboarding local-model management lives.
Local Runtime Request Coordination
LocalModelRuntime runs a single shared llama.cpp server process for all inference (monitoring and chat).
Two request counters gate concurrent access:
activeSharedServerRequests— incremented for every in-flight server request (monitoring or chat)activeInteractiveRequests— incremented only for user-facing chat requests viawithInteractiveRequest { }
When the server needs to be reconfigured (different model or capacity), LocalModelRuntime waits up to 60 seconds for activeSharedServerRequests to reach zero before stopping the old server. This prevents mid-request kills and double-RAM fallbacks.
BrainService reads hasInteractiveRequestInFlight() at the start of each monitoring tick and defers evaluation when a chat request is in flight (see monitoring-pipeline.md, "Deterministic Gates").
Shared llama-server defaults
LocalModelRuntime keeps one shared llama-server alive for local monitoring and chat. The shared server is launched with a fixed performance-oriented flag set that does not participate in capacity reuse decisions:
-ngl 999to offload all supported layers to Metal--threads <perf-core-count>where the count comes fromhw.perflevel0.physicalcpuwhen available, elseProcessInfo.processInfo.activeProcessorCount--cache-type-k q8_0--cache-type-v q8_0
Prompt caching is enabled on every shared-server request via cache_prompt: true. AC currently uses a single shared slot and relies on the fact that the monitoring/system prompts are prefix-stable, so no explicit slot management is needed.
KV-cache quantization is a memory-saving tradeoff for local monitoring: q8 K/V materially lowers cache RAM pressure while keeping quality stable on the small Q4/Q5 local models AC targets for v1.0.
Local prompt budgets
The default staged local runtime profile in ACShared/ACPromptSets.swift currently uses:
| Stage | ctxSize | batchSize | ubatchSize |
|---|---|---|---|
perception_vision | 6144 | 2048 | 512 |
perception_title | 2048 | 512 | 256 |
decision | 3072 | 512 | 256 |
online_decision | 3072 | 512 | 256 |
nudge_copy | 2048 | 512 | 256 |
appeal_review | 2048 | 512 | 256 |
policy_memory | 3072 | 512 | 256 |
safelist_appeal | 2048 | 768 | 384 |
The shared server therefore runs at the largest requested capacity (ctxSize = 6144 for the vision stage) and smaller text stages reuse that server without forcing a restart.
Prompt overflow guard
PromptBudgetGuard sits in front of local shared-server requests. It:
- estimates prompt size heuristically from text length plus vision tile count
- prefers preserving the rendered text payload intact: when the heuristic says a request is too large, AC first grows the per-request shared-server context budget (up to a bounded ceiling) instead of immediately trimming text
- on vision requests, progressively reduces the image max dimension before falling back to text truncation
- optionally calls
POST /tokenizeon the localllama-serverwhen the heuristic is already close to the context limit - only as a last resort, trims the user-prompt tail proportionally and verifies the reduced prompt once via
/tokenize - records
prompt_budget_truncatedtelemetry/activity when truncation actually happens
If tokenization fails or times out, AC falls back to the heuristic path and continues the request rather than blocking inference.
Monitoring Backend Selection
MonitoringConfiguration.inferenceBackend selects the backend:
.local.openRouter
The current default is local inference.
When local inference is active and macOS Low Power Mode is on, AppController sets localModelLowPowerNotice = true. ChatPanelView renders this as a dismissible yellow banner. The notice auto-clears when either condition goes away and the dismissed state resets at that point so it reappears if the user re-enters the same condition.
Model selection is split by text vs image where supported:
onlineModelIdentifierTextonlineModelIdentifierImagelocalModelIdentifierTextlocalModelIdentifierImage
AITier supplies the user-facing defaults.
Limited Trial Redemption
The onboarding wizard includes an optional promo-code step before normal backend selection.
- A user without a code skips the step and follows the existing Local or BYOK setup flow.
- A valid code is redeemed through
PromoRedemptionService, which contactshttps://accountycat.com/api/redeemwith the code and a random per-install identifier. - The endpoint returns a capped OpenRouter key. AC saves that key using the same Keychain-backed BYOK credential path, switches to OpenRouter with the Balanced tier, and sends the user directly to permissions and completion.
- After activation, monitoring and chat traffic goes directly from AC to OpenRouter; it is not proxied through the AccountyCat site.
- The trial path is preserved through the permission-related app relaunch, and users can replace the trial key or switch to Local mode later in Settings -> AI.
Online Routing
OnlineModelService owns HTTP execution, retry behavior, telemetry, and OpenRouter fallback handling.
OnlineProviderRouting owns:
- active provider selection
- direct-OpenAI toggle lookup
- ZDR (Zero Data Retention) toggle lookup
- provider-specific API-key lookup
- effective model identifier when direct OpenAI is enabled
Current behavior:
- default online path: OpenRouter
- API keys live in macOS Keychain via
OnlineProviderCredentialStore - ZDR toggle lives in
UserDefaultsviaOnlineProviderRoutingStore; on by default, opt-out via the AI tab's advanced section behind an explicit confirmation alert - direct-OpenAI routing code exists in
OnlineProviderRoutingbut its UI was removed fromAITab; the toggle can be re-exposed if needed (seedocs/experiments/direct-openai-routing.md) ConnectivityServiceprovides a lightweightNWPathMonitor-backed reachability signal used byBrainServiceto pause online monitoring quickly when the machine is offline- online monitoring may transiently use the online text-only pipeline after repeated vision timeouts, but this degradation lives only in
BrainService; it does not rewriteMonitoringConfiguration
OpenRouter request shape
Each OpenRouter request sets:
response_format: {"type": "json_object"}reasoning: {"enabled": false}wheneveroptions.thinkingEnabledis false. This is the documented OpenRouter shape; the older{"max_reasoning_tokens": 0}form was silently ignored by some providers (notably Together-served Kimi) and produced empty completions.max_tokens: the larger ofoptions.maxTokensandOnlineModelService.openRouterMinMaxTokens(currently 1500). Local stage configs are tuned forllama.cppmemory pre-allocation; online billing is per-actual-token, so the floor avoidsfinish_reason=lengthwhen a provider emits hidden reasoning before content.provider: ZDR flag (perOnlineProviderRouting.isZDREnforced()),allow_fallbacks: true,require_parameters: true,sort: "latency", and apreferred_max_latencyprofile per request source.
Fallback chain
OnlineModelService.requestFallbackModelIdentifiers builds the per-request chain in three layers, then filters by OpenRouterHealthStatsService.sortedHealthyModels:
- The non-
:freevariant of the requested model, when applicable. - Tier alternatives. Image requests fall back to
AITier.economy.byokModelIdentifierImagethenAITier.smartest.byokModelIdentifierImage. Text requests interleave the balanced image model first (it handles text well) then the economy and smartest text models. - For premium-path requests (the first few successful monitoring/chat calls of a session), the balanced and smartest image models are appended as extra runway.
The chain is capped to maxOpenRouterModelsArrayCount (currently 3) and passed to OpenRouter via the models array.
OpenRouter billing failures are not model reliability failures. A 402 (or a
provider response that clearly says the account/key has no credits, budget, or
balance) is surfaced as an OpenRouter budget problem, is not retried as a rate
limit, and does not count toward model bans. If every fallback candidate is
locally banned, AC keeps the original fallback chain so an external recovery
such as topping up credits can be observed without waiting for local ban expiry.
When key info later shows usable budget after a connection problem, AC clears
local OpenRouter bans and drops the cooldown.
Tier → model identifiers
Defaults live in ACShared/AITier.swift. As of v1.0:
| Tier | Text | Image |
|---|---|---|
| Economy | deepseek/deepseek-v4-flash | qwen/qwen3.5-9b |
| Balanced (Default) | deepseek/deepseek-v4-flash | qwen/qwen3.6-35b-a3b |
| Smartest | moonshotai/kimi-k2.6 | moonshotai/kimi-k2.6 |
If you change a tier's model, also update the friendly-name lookups in AppController.shortModelName and AppController.veryShortModelName (AC/Core/AppController+RuntimeSetup.swift) so the menu bar, settings, and onboarding render the new model.
Request Sources
Online requests are tagged by source:
- chat
- chat-action resolution
- policy memory
- memory consolidation
- monitoring text
- monitoring vision
- safelist appeal
This matters because fallback behavior and telemetry are source-aware.
Practical Ownership
If the change is about:
- install/build/download behavior: start in
RuntimeSetupServiceandAC/Core/AppController+RuntimeSetup.swift - local inference execution or runtime stdout/stderr handling: start in
LocalModelRuntime - remote HTTP failures or fallback chains: start in
OnlineModelService - provider toggles or key lookup: start in
OnlineProviderRouting - settings/onboarding copy or controls: start in
AITaband the onboarding views