Inference Engine Runtime (Patio)

June 12, 2026 · View on GitHub

A Python-based sidecar runtime for AI inference engines on Kubernetes. Patio provides a unified interface between workload controllers (such as RoleBasedGroup) and inference engines like SGLang and vLLM.

Overview

Patio is a lightweight FastAPI server that runs alongside inference engine containers and provides:

LoRA Adapter Management — Dynamically load and unload LoRA adapters at runtime without restarting the engine
Unified Prometheus Metrics — Scrape, normalize, and re-expose inference engine metrics with a patio: prefix
Distributed Topology Management — Register workers with a central router, heartbeat-based recovery, and graceful shutdown
Health Check & Readiness — Standard Kubernetes liveness and readiness probes

Architecture

┌─────────────────────────────────────────────────────────────┐
│  Pod                                                         │
│  ┌──────────────────┐    ┌─────────────────────────────────┐│
│  │  Inference Engine │    │  Patio Sidecar (port 9091)      ││
│  │  (SGLang/vLLM)   │◄──►│                                 ││
│  │                  │    │  - LoRA API                       ││
│  │  :8000           │    │  - Metrics (/metrics)            ││
│  │                  │    │  - Topology Client               ││
│  │                  │    │  - Health (/health)              ││
│  └──────────────────┘    └─────────────────────────────────┘│
└─────────────────────────────────────────────────────────────┘

Patio acts as a proxy and management layer, allowing RBG to control inference engines through a standardized HTTP API.

Features

LoRA Adapter Management

Dynamically manage LoRA adapters without engine restarts:

# Load a LoRA adapter
curl -X POST http://localhost:9091/load_lora_adapter \
  -H "Content-Type: application/json" \
  -d '{"lora_name": "my-adapter", "lora_path": "/models/my-adapter"}'

# Unload a LoRA adapter
curl -X POST http://localhost:9091/unload_lora_adapter \
  -H "Content-Type: application/json" \
  -d '{"lora_name": "my-adapter"}'

Patio proxies these requests to the underlying engine (SGLang or vLLM) using their native APIs.

Unified Prometheus Metrics

Patio scrapes the inference engine's /metrics endpoint and normalizes metric names:

Engine Metric	Patio Metric
`sglang:num_running_reqs`	`patio:num_requests_running`
`vllm:num_requests_running`	`patio:num_requests_running`
`sglang:num_prompt_tokens_total`	`patio:input_tokens_total`

Access metrics at http://localhost:9091/metrics in Prometheus text format.

Distributed Topology Management

For multi-node inference deployments, Patio manages worker registration and discovery:

Worker Registration — Automatically registers with a central router on startup
Heartbeat — Periodic heartbeats ensure automatic recovery if the router restarts
Graceful Shutdown — Unregisters from the router on SIGTERM/SIGINT

Configure via environment variables:

env:
  - name: TOPO_TYPE
    value: "SGLang"
  - name: SGL_ROUTER_ROLE_NAME
    value: "router"
  - name: SGL_ROUTER_PORT
    value: "8000"

Quick Start

Prerequisites

Python 3.8+
Kubernetes cluster with RoleBasedGroup installed
Inference engine (SGLang or vLLM) running in the same pod

Installation

Install dependencies:

pip install -r requirements.txt

For development, also install test dependencies:

pip install -r requirements-dev.txt

Running Locally

Start the Patio server:

python -m patio.app --host 127.0.0.1 --port 9091

Command Line Options

Option	Default	Description
`--host`	`0.0.0.0`	Host to listen on
`--port`	`9091`	Port to listen on
`--log-level`	`INFO`	Logging level (DEBUG, INFO, WARNING, ERROR)
`--enable-fastapi-docs`	`false`	Enable FastAPI documentation endpoints
`--scrape-engine-metrics`	`true`	Enable scraping of engine metrics

Environment Variables

Variable	Description	Default
`INFERENCE_ENGINE`	Inference engine type (`sglang` or `vllm`)	`sglang`
`INFERENCE_ENGINE_VERSION`	Engine version	`v0.5.3`
`INFERENCE_ENGINE_ENDPOINT`	Engine endpoint URL	`http://localhost:8000`
`TOPO_TYPE`	Topology type (`SGLang` or `None`)	`None`
`GROUP_NAME`	RBG group name	`None`
`ROLE_NAME`	RBG role name	`None`
`ROLE_INDEX`	RBG role index	`None`
`HEARTBEAT_INTERVAL`	Topology heartbeat interval (seconds)	`30`

Usage with RoleBasedGroup

Patio is typically deployed as a sidecar container within a RoleBasedGroup role. Use the ClusterEngineRuntimeProfile CRD (defined in RBG) to inject Patio automatically:

apiVersion: workloads.x-k8s.io/v1alpha2
kind: ClusterEngineRuntimeProfile
metadata:
  name: patio-runtime
spec:
  containers:
    - name: patio
      image: rolebasedgroup/rbgs-patio-runtime:latest
      ports:
        - containerPort: 9091
          name: patio
      env:
        - name: INFERENCE_ENGINE
          value: "sglang"
        - name: INFERENCE_ENGINE_ENDPOINT
          value: "http://localhost:8000"
        - name: TOPO_TYPE
          value: "SGLang"
      resources:
        requests:
          cpu: "100m"
          memory: "128Mi"
---
apiVersion: workloads.x-k8s.io/v1alpha2
kind: RoleBasedGroup
metadata:
  name: my-inference
spec:
  roles:
    - name: inference
      replicas: 2
      engineRuntimes:
        - profileName: patio-runtime
      standalonePattern:
        template:
          spec:
            containers:
              - name: sglang
                image: lmsysorg/sglang:latest
                ports:
                  - containerPort: 8000

API Reference

Server Endpoints

Method	Path	Description
`GET`	`/`	Server status
`GET`	`/health`	Liveness probe
`GET`	`/metrics`	Prometheus metrics

LoRA Endpoints

Method	Path	Description
`POST`	`/load_lora_adapter`	Load a LoRA adapter
`POST`	`/unload_lora_adapter`	Unload a LoRA adapter

Request Models

LoadLoraAdapterRequest:

{
  "lora_name": "string",
  "lora_path": "string"
}

UnLoadLoraAdapterRequest:

{
  "lora_name": "string"
}

Project Structure

inference-engine-runtime/
├── app.py                  # Main entry point (FastAPI + uvicorn)
├── config.py               # Configuration constants
├── envs.py                 # Environment variable parsing
├── logger.py               # Logging configuration
├── api/                    # HTTP API layer
│   ├── server_router.py    # Server endpoints
│   ├── lora_router.py      # LoRA endpoints
│   └── protocol.py         # Request/response models
├── engine/                 # Inference engine abstraction
│   ├── base.py             # Base engine interface
│   ├── sglang_engine.py    # SGLang implementation
│   └── vllm_engine.py      # vLLM implementation
├── metrics/                # Prometheus metrics
│   ├── metrics.py          # Built-in metrics
│   ├── engine_collector.py # Engine metrics scraper
│   └── standard_rules.py   # Metric normalization
├── topo/                   # Topology management
│   ├── factory.py          # Client/server factories
│   ├── client/             # Topology clients
│   └── server/             # Topology servers
├── tests/                  # Unit and E2E tests
└── doc/                    # Additional documentation

Development

Running Tests

# Run all unit tests
python -m pytest tests/mock_tests -v

# Run specific test file
python -m pytest tests/mock_tests/test_sglang_engine.py -v

# Run with coverage
python -m pytest tests/mock_tests --cov=patio --cov-report=term-missing

Building Docker Image

docker build -t inference-engine-runtime:latest .

License

Apache License 2.0. See LICENSE.

Acknowledgments

Patio was originally developed as part of the RoleBasedGroup project and has been extracted into a standalone project for independent development and deployment.