nnInteractive Server / Client
June 26, 2026 · View on GitHub
The default nnInteractiveInferenceSession runs the model in the same Python
process as your GUI. If the machine running the GUI does not have a powerful
GPU, you can instead run the model on a remote box and drive it over the
network using nnInteractiveRemoteInferenceSession — a drop-in replacement
with the same public API as the local session.
[GUI client A] ─┐
│ HTTP
[GUI client B] ─┼────► nninteractive-server ──► one shared model on GPU
│ (per-client sessions: (loaded once at startup)
[GUI client C] ─┘ image, target_buffer,
interactions per session)
The server loads the model once at startup and hosts up to --max-sessions
concurrent client sessions. Each client gets its own session (its own image,
target buffer, and interactions) via a lease token; the client library
handles the lease handshake transparently. Predictions are GPU-serialized
across sessions — two clients can preprocess images at the same time, but
only one prediction runs at a time.
This document covers how to start the server, point a client at it, the concurrency / session model, and common deployment gotchas.
Installation
There are two distributions, both providing the same nnInteractive import namespace:
- GPU / server machine:
pip install nnInteractive— the full stack: torch, nnU-Net, fastapi and uvicorn, the local inference engine, and thenninteractive-serverentry point. It depends onnninteractive-client, so it also includes the remote client. - GUI / client machine (lightweight, torch-free):
pip install nninteractive-client— a separate, much smaller distribution that pulls in only the wire stack (numpy,httpx,blosc2) and ships justnnInteractive.inference.remote. This is allnnInteractiveRemoteInferenceSessionneeds and avoids pulling torch / nnU-Net.
The two are layered, not mutually exclusive: the full package depends on the client and
ships disjoint files, so they coexist cleanly. A client-only machine can be upgraded to the
full stack at any time with pip install nnInteractive — no uninstall, no --force-reinstall.
Client code is identical either way. Both distributions expose
from nnInteractive.inference.remote import nnInteractiveRemoteInferenceSession, so a GUI written against the remote session runs unchanged whether it has the lightweight client or the full package installed. If a client-only install reaches for a full-only feature (local inference, the server,nninteractive-server), it gets a clear error telling it topip install nnInteractive. (Thenninteractive-servercommand is only provided by the full package.)
torch is optional for the client.
nninteractive-clientdoes not depend on torch. The remote session works with numpytarget_buffers out of the box, and transparently supportstorch.Tensorbuffers too if torch happens to be importable in the host (e.g. 3D Slicer).
Models: list, download, use by name
You don't have to hand the server a checkpoint folder — it can pull official models straight
from the manifest and you select them by name. Models are stored under
$NNINTERACTIVE_MODEL_DIR (default ~/.nninteractive) and downloaded on first use.
# See which models exist and which are already downloaded
nninteractive-available-models
# Optionally pre-download one by id (the server also downloads on first use)
nninteractive-download-model nnInteractive_v1.0
Both commands (and the server) honor NNINTERACTIVE_MODEL_DIR to change where models are
stored, e.g. export NNINTERACTIVE_MODEL_DIR=/data/nninteractive_models.
Use --model-dir only for a custom / local checkpoint folder that isn't in the manifest
(and then also pass --fold).
Starting the server
Start the server by model name with --model; it resolves the model at startup
(downloading it on first use if needed), loads it once, and subsequent client requests reuse the
loaded model. Omit --model to use the manifest's default model.
nninteractive-server \
--model nnInteractive_v1.0 \
--host 0.0.0.0 \
--port 1527 \
--device cuda:0 \
--api-key "$(openssl rand -hex 32)"
(To serve a custom checkpoint folder instead, swap --model nnInteractive_v1.0 for
--model-dir /path/to/checkpoint_folder --fold all.)
| Flag | Description |
|---|---|
--model | Official model id from the manifest (e.g. nnInteractive_v1.0), downloaded on first use into $NNINTERACTIVE_MODEL_DIR (default ~/.nninteractive). List ids with nninteractive-available-models. Mutually exclusive with --model-dir. If neither --model nor --model-dir is given, the manifest's default model is used. |
--model-dir | Path to a custom trained model folder containing inference_info.json (or legacy inference_session_class.json), plans.json, dataset.json, and fold_*/checkpoint_*.pth. Use this only for checkpoints not in the manifest. Mutually exclusive with --model; requires --fold. |
--fold | 0, 1, …, or all. Only relevant with --model-dir. If omitted, the server auto-detects when exactly one fold_* folder is present. |
--checkpoint | Checkpoint filename inside the fold folder. Default: checkpoint_final.pth. |
--host | Bind address. 127.0.0.1 (default) — local only; 0.0.0.0 — listen on all interfaces. |
--port | TCP port. Default: 1527. |
--device | Torch device string, e.g. cuda, cuda:0, cpu. Default: cuda. |
--torch-n-threads | CPU threads for torch. Default: 8. |
--no-torch-compile | Disable compiling the network with torch.compile (compile is on by default). With compile enabled the server runs a dummy warmup forward pass at startup to trigger the (slow) one-time compilation up front, so clients never see the first-prediction delay — startup just takes longer, every prediction is faster, and the cost is amortized across the long-lived process. Pass this flag to skip compilation (faster startup, or to work around a compile/backend issue). |
--no-autozoom | Disable adaptive zoom-out (rarely needed; on by default). |
--max-sessions | Maximum number of concurrent client sessions. Each holds its own image, target buffer, and interaction state; the network module (and therefore its weights) is shared by reference across all sessions — exactly one copy on the GPU regardless of session count. Predictions stay GPU-serialized across sessions. Default: 1 (single-tenant — same behavior as before). |
--idle-timeout-seconds | Inactivity timeout in seconds after which a session is reaped and its slot freed. Refreshed only by real user actions (set_image, add_*_interaction, …) — not by heartbeats — so a connected-but-idle client is still reaped here. Default: 600 (10 min). |
--liveness-timeout-seconds | Liveness timeout in seconds: a session is reaped if the server sees no request at all (not even a heartbeat) from the client for this long. This is how a crashed or disconnected client's slot is reclaimed quickly. The client heartbeats automatically at half this interval. Keep it well below --idle-timeout-seconds. Default: 60. |
--api-key | Bearer token required on every request. See Authentication below. |
--verbose | Verbose session-side logging. |
--log-level | uvicorn log level (info, warning, error, …). Default: info. |
A successful startup looks like:
... INFO ... Loading checkpoint from /path/to/checkpoint_folder ...
session initialized
... INFO ... Checkpoint loaded; serving on http://0.0.0.0:1527
INFO: Uvicorn running on http://0.0.0.0:1527 (Press CTRL+C to quit)
You can sanity-check the server from anywhere that can reach the port:
curl http://<server-host>:1527/healthz
# -> {"ok":true}
Using the client
The nnInteractiveRemoteInferenceSession mirrors the public API of
nnInteractiveInferenceSession (set_image, set_target_buffer,
add_bbox_interaction, add_point_interaction, add_scribble_interaction,
add_lasso_interaction, add_initial_seg_interaction, reset_interactions,
set_do_autozoom) and exposes the same capability attributes
(supported_interactions, channel_mapping, num_interaction_channels,
supports_initial_label, supports_zero_shot_label_refinement,
preferred_scribble_thickness, interaction_decay, original_image_shape,
do_autozoom).
Minimal usage:
from nnInteractive.inference.remote import nnInteractiveRemoteInferenceSession
import numpy as np
session = nnInteractiveRemoteInferenceSession(
server_url="http://gpu-box.lab:1527",
api_key="…", # optional; see Authentication
)
session.set_image(image_4d) # numpy, [C, X, Y, Z]
target_buffer = np.zeros(image_4d.shape[1:], dtype=np.uint8)
session.set_target_buffer(target_buffer)
session.add_bbox_interaction([[40, 80], [50, 90], [30, 31]],
include_interaction=True)
# target_buffer is now updated in place with the predicted region.
session.add_point_interaction([60, 70, 30], include_interaction=True)
# … and so on. Same calls as the local session.
target_buffer is mutated in place exactly the same way as with the local
session. Under the hood, the server returns just the bbox region it touched
(blosc2-compressed), and the client writes that into your buffer — typical
binary masks compress to a tiny fraction of their raw size, so this stays
fast even on slow links.
Timeouts
The client uses per-phase timeouts so "server unreachable" is reported quickly while real predictions still get the time they need:
| Constructor kwarg | Default | Covers | On expiry |
|---|---|---|---|
connect_timeout | 10 s | TCP / TLS handshake | httpx.ConnectTimeout |
read_timeout | 60 s | server thinking time per call (predictions observed at 100 ms – ~10 s) | httpx.ReadTimeout |
write_timeout | 120 s | uploading the request body (mostly set_image) | httpx.WriteTimeout |
pool_timeout | 10 s | acquiring a connection from the pool | httpx.PoolTimeout |
All four are subclasses of httpx.TimeoutException, which itself is a
subclass of httpx.HTTPError — catch HTTPError for a generic "something
went wrong with the server" and TimeoutException for "the server didn't
respond in time."
import httpx
try:
session.add_point_interaction([60, 70, 30], include_interaction=True)
except httpx.ConnectTimeout:
# Server is unreachable. Likely down, wrong host/port, or a firewall.
...
except httpx.ReadTimeout:
# Server accepted the request but didn't finish in read_timeout seconds.
# Either the prediction is unusually slow or the server is stuck.
...
except httpx.HTTPStatusError as e:
# Server responded with 4xx/5xx. e.response.status_code / e.response.text
...
Probing reachability — session.ping()
For a "Test connection" button in a GUI, the client exposes:
ok: bool = session.ping(timeout=5.0) # GET /healthz with a tight timeout
It returns True if the server answered 200 and False on any HTTP /
network error (timeout, refused connection, wrong auth, proxy
interception). Non-raising on purpose so UI code can just check the bool.
One-line swap from local to remote
# Local
session = nnInteractiveInferenceSession(device=torch.device("cuda"))
session.initialize_from_trained_model_folder("/path/to/checkpoint", use_fold="all")
# Remote — same API from here on
session = nnInteractiveRemoteInferenceSession("http://gpu-box:1527", api_key=KEY)
Note: on the remote session, initialize_from_trained_model_folder() is a
no-op (with a warning). The server already loaded the checkpoint at startup.
Switching checkpoints at runtime is on the roadmap.
Concurrency and sessions
The server hosts up to --max-sessions concurrent client sessions. Each client
holds its own session — its own image, target buffer, and interaction state —
while the network module itself (the nn.Module instance, its weights, and
its buffers) is shared by reference across every session. There is exactly
one network and one copy of the weights resident on the GPU regardless of
how many sessions are active. This gives multiple researchers on one GPU box
independent state without duplicating the model. Sharing is safe because
inference runs under @torch.inference_mode() and a global GPU lock
serializes predict-capable endpoints, so two sessions never touch the
network concurrently and nothing mutates it after startup.
How a client gets a session
A session is claimed automatically when nnInteractiveRemoteInferenceSession(...)
is constructed (the client posts to /claim and stores a lease token, which
then rides on every subsequent request). The client also releases the session
automatically on close() (or context-manager exit). Users and GUI authors
never see the lease token — it's a private implementation detail.
`with nnInteractiveRemoteInferenceSession(server_url, api_key=KEY) as session:`
session.set_image(image)
session.set_target_buffer(buf)
session.add_point_interaction([60, 70, 30], include_interaction=True)
# context manager exit -> client posts /release -> server frees the slot.
GPU serialization
Two clients each calling add_point_interaction(..., run_prediction=True) at the
same moment will see one of the two predictions wait briefly for the other to
finish. This is by design — there is only one GPU. Non-prediction calls
(set_image, set_target_buffer, reset_interactions, add_*_interaction(..., run_prediction=False)) do not contend on this lock and run concurrently across
sessions.
If predictions feel slow under concurrent load, the answer is more GPUs (run one
nninteractive-server per GPU on different ports), not raising --max-sessions.
Session expiry: two independent timeouts
The server reaps a session for either of two distinct reasons, each with its own timeout:
- Liveness (
--liveness-timeout-seconds, default 60 s) — the client process stopped responding entirely (crash, kill, network drop). The client library automatically heartbeats in the background (a daemon thread, every half the liveness timeout), so a healthy client never trips this. When a client dies, its heartbeats stop and the server frees the slot within ~one liveness timeout — instead of holding it for the full idle timeout. You don't have to do anything to get this; it's on by default. - Idle / inactivity (
--idle-timeout-seconds, default 600 s = 10 min) — the client is alive and heartbeating, but the user hasn't done anything. This timer is refreshed only by real interactions (set_image,set_target_buffer,add_*_interaction,reset_interactions, …), not by heartbeats. So a window left open with no clicks is still reclaimed at the idle timeout.
After a reap (for either reason), the next request from the client raises
SessionExpiredError. A session may also be reaped by a server restart.
from nnInteractive.inference.remote import SessionExpiredError
try:
session.add_point_interaction([60, 70, 30], include_interaction=True)
except SessionExpiredError:
# The server-side session is gone. There is nothing to restore: the
# image, target buffer, and the chain of interactions only exist on the
# server, and they were dropped when the lease was reaped. The user has
# to start the segmentation workflow over.
session = nnInteractiveRemoteInferenceSession(server_url, api_key=KEY)
session.set_image(image)
session.set_target_buffer(buf)
# GUI should surface: "Your session timed out. Please redo your prompts."
Note: session.heartbeat() proves liveness only — it does not postpone the
idle timeout. If you want users to keep a session across long idle stretches,
raise --idle-timeout-seconds on the server; there is no client-side way to
suppress the inactivity reap.
session.lease_status() is a read-only probe: it returns the remaining seconds
until the idle timeout without touching either clock — useful for a "your
session expires in N seconds" UI badge.
Capacity (--max-sessions)
If all --max-sessions slots are in use when a new client tries to connect,
constructing the remote session raises ServerAtCapacityError. Typical
handling is "wait a moment and retry":
import time
from nnInteractive.inference.remote import (
nnInteractiveRemoteInferenceSession,
ServerAtCapacityError,
)
for attempt in range(6):
try:
session = nnInteractiveRemoteInferenceSession(server_url, api_key=KEY)
break
except ServerAtCapacityError:
time.sleep(10)
else:
raise SystemExit("server has been at capacity for too long")
For GUI developers
A few contract points worth respecting when wiring this into a GUI:
- Construct the session in a worker thread. HTTP + prediction both block; doing this on the UI thread freezes the app.
- Catch
SessionExpiredErroraround every interaction call. A timed-out session cannot be restored — the image, target buffer, and accumulated prompts are all server-side state that has been freed. The GUI must claim a new session, callset_imageandset_target_bufferagain, and ask the user to redo their prompts. Show a clear "session timed out" message so the user understands why they're being asked to start over. - Call
session.close()on app quit (or use thewithstatement). The destructor also releases the lease, but explicit close is preferred so the server frees the slot immediately for other users. - You don't need to drive
heartbeat()yourself. The session auto-heartbeats from a background thread to keep the server from reaping it as a dead client. This does not extend the idle timeout, though — a window left idle past--idle-timeout-secondsis still reaped. Raise that flag on the server if your UX expects users to sit idle for long stretches. - Surface
ServerAtCapacityErroras "server is full, try again later". It's a transient operator-level condition; the user can't fix it from the GUI.
Authentication
Authentication is a static bearer token shared by everyone who can use the
server. The server requires it if it was started with --api-key; otherwise
it accepts every request without checking.
The bearer token gates access to the server as a whole — anyone who has
it can claim a session. The lease token (issued per client at /claim) is a
separate, per-session ownership mechanism handled transparently by the
client; it is not a second authentication layer and a GUI user never sees
it.
On the server
Pick a strong, random key (anything 32+ random bytes is fine):
export NN_INTERACTIVE_API_KEY="$(openssl rand -hex 32)"
nninteractive-server --model nnInteractive_v1.0 --host 0.0.0.0 --port 1527
# (alternatively: pass --api-key "$KEY" on the command line)
The server reads --api-key first, then falls back to the
NN_INTERACTIVE_API_KEY environment variable. If neither is set, the
server logs a warning at startup and accepts unauthenticated requests.
On the client
session = nnInteractiveRemoteInferenceSession(
server_url="http://gpu-box:1527",
api_key="…",
)
If api_key= is omitted, the client falls back to the
NN_INTERACTIVE_API_KEY environment variable. If the server requires a key
and the client didn't pass one (or passed the wrong one), the very first
request — the capabilities fetch inside __init__ — raises an HTTP 401,
so you find out at session construction time, not later in a prediction.
Rotation: change the key, restart the server, update the client. There is no login flow.
Single-user secure setup: SSH tunnel
This pattern is for a single user only. It binds the server to the GPU box's loopback interface, which means the user running the SSH tunnel is the only one who can reach it. If you want multiple researchers to share one server, skip this section and go to Multi-user deployment below.
If only you will be using the GPU box, the simplest secure setup is to bind
the server to 127.0.0.1 on the GPU box and forward a port over SSH. The
server is unreachable from any other machine; only your SSH session can
talk to it. Start the server with --max-sessions 1 for this pattern —
nobody else can claim a session anyway.
On the GPU box:
nninteractive-server \
--model nnInteractive_v1.0 \
--host 127.0.0.1 --port 1527 \
--max-sessions 1
On the client box:
ssh -N -L 1527:127.0.0.1:1527 you@gpu-box.lab
# Leave this running in a terminal. Now http://127.0.0.1:1527 on the
# client points at the server's 127.0.0.1:1527.
session = nnInteractiveRemoteInferenceSession("http://127.0.0.1:1527")
# No api_key needed — the server is only reachable through your SSH session.
For laptops / unstable links, autossh keeps the tunnel up:
autossh -M 0 -o "ServerAliveInterval=30" -o "ServerAliveCountMax=3" \
-N -L 1527:127.0.0.1:1527 you@gpu-box.lab
Multi-user deployment
For multiple users sharing one server, bind to 0.0.0.0 (or to a non-loopback
interface reachable on your network), pick a --max-sessions value that fits
your GPU, and set an API key:
nninteractive-server \
--model nnInteractive_v1.0 \
--host 0.0.0.0 --port 1527 \
--max-sessions 4 \
--api-key "$(openssl rand -hex 32)"
Distribute the API key to your users via whatever channel you'd use for any other shared credential. Every authorized client claims its own session automatically on construction; users do not coordinate.
Add TLS. The server itself does not terminate TLS. Put it behind a
reverse proxy (nginx, caddy, traefik) that adds HTTPS, especially if the
traffic leaves a trusted network. The proxy should pass through the
Authorization, X-Lease-Token, X-Meta, and Content-Type headers
unchanged and not buffer the response body (the server streams compressed
prediction diffs).
Proxy gotcha
If your client machine has HTTP_PROXY / HTTPS_PROXY / ALL_PROXY set —
common on corporate networks — httpx (which the client uses) will route
every request through the proxy by default, including localhost ones.
Symptoms are 403 responses with HTML error pages from the proxy instead of
JSON from the server, even with the correct API key.
Fix: add the server's host (or 127.0.0.1/localhost for an SSH tunnel) to
NO_PROXY:
export NO_PROXY="127.0.0.1,localhost,gpu-box.lab"
export no_proxy="$NO_PROXY" # both casings — some tools only read one
Then run your client program in the same shell. To make this permanent, add the lines to your shell rc file or to the launcher script that starts the GUI.
Troubleshooting
httpx.HTTPStatusError: 401 Unauthorizedon session construction — the server was started with--api-keybut the client didn't pass it (or passed the wrong one). Setapi_key=orNN_INTERACTIVE_API_KEY.- HTML error pages instead of JSON — almost always an HTTP proxy intercepting the request. See Proxy gotcha.
ConnectionRefusedError/httpx.ConnectError— server isn't running, port is wrong, or a firewall is blocking it. Checkcurl http://<host>:<port>/healthzfrom the client machine, or callsession.ping()from your GUI's "Test connection" path.httpx.ConnectTimeout(after ~10 s) — TCP/TLS handshake didn't complete. The host is reachable but isn't listening, or a firewall is silently dropping packets. Tune viaconnect_timeout=on the session constructor.httpx.ReadTimeout(after ~60 s) — server accepted the request but didn't finish in time. Either the prediction is unusually slow on that hardware/volume, or the server is wedged. Tune viaread_timeout=if your workload legitimately needs more.RuntimeWarning: nnInteractiveRemoteInferenceSession ignores initialize_from_trained_model_folder()— expected. The server picked the checkpoint at startup; this method is a no-op on the remote session.- Predictions seem to hang the GUI — every
add_*_interaction(..., run_prediction=True)call blocks until the server finishes. Run the remote session from a worker thread in the GUI, exactly as you would for a slow local prediction. SessionExpiredError— the server reaped your session, either because the user was inactive longer than--idle-timeout-seconds, because the client stopped heartbeating for longer than--liveness-timeout-seconds(usually a crash or network drop — note the background heartbeat keeps a healthy client well clear of this), or because the server was restarted. A timed-out session cannot be restored; the image, target buffer, and prompts have been freed on the server. Construct a newnnInteractiveRemoteInferenceSession, callset_image+set_target_buffer, and prompt the user to redo their interactions. To allow longer idle stretches, raise--idle-timeout-secondson the server (heartbeats no longer postpone the idle timeout).ServerAtCapacityErroron construction — every session slot is in use (--max-sessionsreached). Wait and retry, ask the operator to bump--max-sessions, or scale out with morenninteractive-serverprocesses on more GPUs.- Predictions feel slower with multiple users — expected: predictions
are serialized on the GPU across all sessions. Two clients each adding
a point at the same time will see one wait briefly for the other. For
higher throughput, run multiple
nninteractive-serverprocesses on multiple GPUs and route clients across them.
Limitations (current version)
- Predictions are GPU-serialized within one server process: multiple clients
can hold sessions and preprocess concurrently, but predictions run one at
a time. For higher throughput across many concurrent users, run multiple
nninteractive-serverprocesses on different GPUs. - Authentication is a single shared bearer token: anyone with the API key can claim a session. There is no per-user identity or quota.
- The checkpoint loaded at startup is fixed for the lifetime of the server process. Switch-by-name is planned.
- The server does not terminate TLS itself — front it with a reverse proxy for any multi-user deployment, or use the single-user SSH-tunnel pattern when only one user needs access.
- No retry/reconnect logic in the client — a network blip or
SessionExpiredErrorraises through to the caller; the GUI is expected to handle this.