Inference Engine Runtime (Patio)

June 12, 2026 · View on GitHub

License

A Python-based sidecar runtime for AI inference engines on Kubernetes. Patio provides a unified interface between workload controllers (such as RoleBasedGroup) and inference engines like SGLang and vLLM.

Overview

Patio is a lightweight FastAPI server that runs alongside inference engine containers and provides:

  • LoRA Adapter Management — Dynamically load and unload LoRA adapters at runtime without restarting the engine
  • Unified Prometheus Metrics — Scrape, normalize, and re-expose inference engine metrics with a patio: prefix
  • Distributed Topology Management — Register workers with a central router, heartbeat-based recovery, and graceful shutdown
  • Health Check & Readiness — Standard Kubernetes liveness and readiness probes

Architecture

┌─────────────────────────────────────────────────────────────┐
│  Pod                                                         │
│  ┌──────────────────┐    ┌─────────────────────────────────┐│
│  │  Inference Engine │    │  Patio Sidecar (port 9091)      ││
│  │  (SGLang/vLLM)   │◄──►│                                 ││
│  │                  │    │  - LoRA API                       ││
│  │  :8000           │    │  - Metrics (/metrics)            ││
│  │                  │    │  - Topology Client               ││
│  │                  │    │  - Health (/health)              ││
│  └──────────────────┘    └─────────────────────────────────┘│
└─────────────────────────────────────────────────────────────┘

Patio acts as a proxy and management layer, allowing RBG to control inference engines through a standardized HTTP API.

Features

LoRA Adapter Management

Dynamically manage LoRA adapters without engine restarts:

# Load a LoRA adapter
curl -X POST http://localhost:9091/load_lora_adapter \
  -H "Content-Type: application/json" \
  -d '{"lora_name": "my-adapter", "lora_path": "/models/my-adapter"}'

# Unload a LoRA adapter
curl -X POST http://localhost:9091/unload_lora_adapter \
  -H "Content-Type: application/json" \
  -d '{"lora_name": "my-adapter"}'

Patio proxies these requests to the underlying engine (SGLang or vLLM) using their native APIs.

Unified Prometheus Metrics

Patio scrapes the inference engine's /metrics endpoint and normalizes metric names:

Engine MetricPatio Metric
sglang:num_running_reqspatio:num_requests_running
vllm:num_requests_runningpatio:num_requests_running
sglang:num_prompt_tokens_totalpatio:input_tokens_total

Access metrics at http://localhost:9091/metrics in Prometheus text format.

Distributed Topology Management

For multi-node inference deployments, Patio manages worker registration and discovery:

  • Worker Registration — Automatically registers with a central router on startup
  • Heartbeat — Periodic heartbeats ensure automatic recovery if the router restarts
  • Graceful Shutdown — Unregisters from the router on SIGTERM/SIGINT

Configure via environment variables:

env:
  - name: TOPO_TYPE
    value: "SGLang"
  - name: SGL_ROUTER_ROLE_NAME
    value: "router"
  - name: SGL_ROUTER_PORT
    value: "8000"

Quick Start

Prerequisites

  • Python 3.8+
  • Kubernetes cluster with RoleBasedGroup installed
  • Inference engine (SGLang or vLLM) running in the same pod

Installation

Install dependencies:

pip install -r requirements.txt

For development, also install test dependencies:

pip install -r requirements-dev.txt

Running Locally

Start the Patio server:

python -m patio.app --host 127.0.0.1 --port 9091

Command Line Options

OptionDefaultDescription
--host0.0.0.0Host to listen on
--port9091Port to listen on
--log-levelINFOLogging level (DEBUG, INFO, WARNING, ERROR)
--enable-fastapi-docsfalseEnable FastAPI documentation endpoints
--scrape-engine-metricstrueEnable scraping of engine metrics

Environment Variables

VariableDescriptionDefault
INFERENCE_ENGINEInference engine type (sglang or vllm)sglang
INFERENCE_ENGINE_VERSIONEngine versionv0.5.3
INFERENCE_ENGINE_ENDPOINTEngine endpoint URLhttp://localhost:8000
TOPO_TYPETopology type (SGLang or None)None
GROUP_NAMERBG group nameNone
ROLE_NAMERBG role nameNone
ROLE_INDEXRBG role indexNone
HEARTBEAT_INTERVALTopology heartbeat interval (seconds)30

Usage with RoleBasedGroup

Patio is typically deployed as a sidecar container within a RoleBasedGroup role. Use the ClusterEngineRuntimeProfile CRD (defined in RBG) to inject Patio automatically:

apiVersion: workloads.x-k8s.io/v1alpha2
kind: ClusterEngineRuntimeProfile
metadata:
  name: patio-runtime
spec:
  containers:
    - name: patio
      image: rolebasedgroup/rbgs-patio-runtime:latest
      ports:
        - containerPort: 9091
          name: patio
      env:
        - name: INFERENCE_ENGINE
          value: "sglang"
        - name: INFERENCE_ENGINE_ENDPOINT
          value: "http://localhost:8000"
        - name: TOPO_TYPE
          value: "SGLang"
      resources:
        requests:
          cpu: "100m"
          memory: "128Mi"
---
apiVersion: workloads.x-k8s.io/v1alpha2
kind: RoleBasedGroup
metadata:
  name: my-inference
spec:
  roles:
    - name: inference
      replicas: 2
      engineRuntimes:
        - profileName: patio-runtime
      standalonePattern:
        template:
          spec:
            containers:
              - name: sglang
                image: lmsysorg/sglang:latest
                ports:
                  - containerPort: 8000

API Reference

Server Endpoints

MethodPathDescription
GET/Server status
GET/healthLiveness probe
GET/metricsPrometheus metrics

LoRA Endpoints

MethodPathDescription
POST/load_lora_adapterLoad a LoRA adapter
POST/unload_lora_adapterUnload a LoRA adapter

Request Models

LoadLoraAdapterRequest:

{
  "lora_name": "string",
  "lora_path": "string"
}

UnLoadLoraAdapterRequest:

{
  "lora_name": "string"
}

Project Structure

inference-engine-runtime/
├── app.py                  # Main entry point (FastAPI + uvicorn)
├── config.py               # Configuration constants
├── envs.py                 # Environment variable parsing
├── logger.py               # Logging configuration
├── api/                    # HTTP API layer
│   ├── server_router.py    # Server endpoints
│   ├── lora_router.py      # LoRA endpoints
│   └── protocol.py         # Request/response models
├── engine/                 # Inference engine abstraction
│   ├── base.py             # Base engine interface
│   ├── sglang_engine.py    # SGLang implementation
│   └── vllm_engine.py      # vLLM implementation
├── metrics/                # Prometheus metrics
│   ├── metrics.py          # Built-in metrics
│   ├── engine_collector.py # Engine metrics scraper
│   └── standard_rules.py   # Metric normalization
├── topo/                   # Topology management
│   ├── factory.py          # Client/server factories
│   ├── client/             # Topology clients
│   └── server/             # Topology servers
├── tests/                  # Unit and E2E tests
└── doc/                    # Additional documentation

Development

Running Tests

# Run all unit tests
python -m pytest tests/mock_tests -v

# Run specific test file
python -m pytest tests/mock_tests/test_sglang_engine.py -v

# Run with coverage
python -m pytest tests/mock_tests --cov=patio --cov-report=term-missing

Building Docker Image

docker build -t inference-engine-runtime:latest .

License

Apache License 2.0. See LICENSE.

Acknowledgments

Patio was originally developed as part of the RoleBasedGroup project and has been extracted into a standalone project for independent development and deployment.