Air-Gapped Deployment Guide

May 1, 2026 · View on GitHub

Deploy LLMKube in environments without internet access. This guide covers deploying models from local file paths, private registries, and pre-downloaded GGUF files.

Use Cases

Government/Defense: Classified networks with no internet access
Healthcare: HIPAA-compliant environments with restricted egress
Finance: Air-gapped trading systems and compliance environments
Edge: Remote locations with limited or no connectivity
Corporate: Private networks with strict firewall rules

Prerequisites

Kubernetes cluster (v1.11.3+) with no internet access
LLMKube operator installed (see offline installation)
Pre-downloaded GGUF model file(s)
llmkube CLI installed on a workstation with cluster access

Quick Start: Deploy from Local Path

Step 1: Pre-download the Model

On a machine with internet access, download the GGUF file:

# Example: Download Llama 3.1 8B
curl -L -o llama-3.1-8b-q4_k_m.gguf \
  "https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf"

# Verify the file
ls -lh llama-3.1-8b-q4_k_m.gguf
# Should show ~4.9GB

Step 2: Transfer to Air-Gapped Environment

Transfer the GGUF file to a location accessible by your Kubernetes nodes:

# Option A: Copy to a shared NFS mount
cp llama-3.1-8b-q4_k_m.gguf /mnt/nfs/models/

# Option B: Copy to each node (if no shared storage)
scp llama-3.1-8b-q4_k_m.gguf node1:/mnt/models/
scp llama-3.1-8b-q4_k_m.gguf node2:/mnt/models/

# Option C: Use a PersistentVolume
# (see PVC-based deployment below)

Step 3: Deploy with CLI

# Deploy using local path
llmkube deploy my-llama --gpu \
  --source /mnt/models/llama-3.1-8b-q4_k_m.gguf \
  --cpu 4 \
  --memory 8Gi \
  --gpu-layers 32

# Or use catalog defaults with local override
llmkube deploy llama-3.1-8b --gpu \
  --source-override /mnt/models/llama-3.1-8b-q4_k_m.gguf

Step 4: Verify Deployment

# Check model status (should show "Copying" then "Ready")
llmkube list models

# Check service status
llmkube list services

# Test the endpoint
kubectl port-forward svc/my-llama 8080:8080 &
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"Hello!"}]}'

Deployment Options

Option 1: Absolute File Path

The simplest approach for nodes with local model storage:

apiVersion: inference.llmkube.dev/v1alpha1
kind: Model
metadata:
  name: local-llama
spec:
  source: /mnt/models/llama-3.1-8b-q4_k_m.gguf
  format: gguf
  hardware:
    accelerator: cuda
    gpu:
      enabled: true
      count: 1

Requirements:

Model file must exist at the same path on all nodes where pods may run
Use a DaemonSet or node affinity to ensure pods land on nodes with the model

Option 2: file:// URL

Equivalent to absolute path, but explicit about the scheme:

apiVersion: inference.llmkube.dev/v1alpha1
kind: Model
metadata:
  name: local-llama
spec:
  source: file:///mnt/models/llama-3.1-8b-q4_k_m.gguf
  format: gguf

Option 3: PVC Source (Recommended for Air-Gapped)

Mount a model directly from an existing PersistentVolumeClaim — no download, no HostPath, portable across nodes:

apiVersion: inference.llmkube.dev/v1alpha1
kind: Model
metadata:
  name: pvc-llama
spec:
  source: pvc://my-models-pvc/llama-3.1-8b-q4_k_m.gguf
  format: gguf
  hardware:
    accelerator: cuda
    gpu:
      enabled: true
      count: 1

The controller validates the PVC exists and is Bound, then sets the model to Ready immediately. The InferenceService mounts the PVC read-only — no init container or download step needed.

CLI equivalent:

llmkube deploy my-llama --gpu \
  --source pvc://my-models-pvc/llama-3.1-8b-q4_k_m.gguf

Requirements:

PVC must exist in the same namespace as the Model
PVC must be Bound
Use ReadOnlyMany access mode for multi-replica deployments

Option 4: Private HTTP Server

For environments with an internal model server:

apiVersion: inference.llmkube.dev/v1alpha1
kind: Model
metadata:
  name: internal-llama
spec:
  source: http://model-server.internal.corp:8080/models/llama-3.1-8b-q4_k_m.gguf
  format: gguf

Setup a simple model server:

# On your internal server
cd /path/to/models
python3 -m http.server 8080

Offline Operator Installation

Option 1: Pre-built Container Images

On a machine with internet access, pull and save the images:

# Pull images
docker pull ghcr.io/defilantech/llmkube:v0.4.9
docker pull ghcr.io/ggml-org/llama.cpp:server-cuda13

# Save to tar files
docker save ghcr.io/defilantech/llmkube:v0.4.9 > llmkube-controller.tar
docker save ghcr.io/ggml-org/llama.cpp:server-cuda13 > llama-server-cuda.tar

Transfer tar files to the air-gapped environment
Load images on each node or into your private registry:

# Load directly on nodes
docker load < llmkube-controller.tar
docker load < llama-server-cuda.tar

# Or push to private registry
docker load < llmkube-controller.tar
docker tag ghcr.io/defilantech/llmkube:v0.4.9 registry.internal/llmkube:v0.4.9
docker push registry.internal/llmkube:v0.4.9

Option 2: Helm with Private Registry

# Add Helm repo (on connected machine)
helm repo add llmkube https://defilantech.github.io/LLMKube
helm pull llmkube/llmkube --untar

# Transfer chart to air-gapped environment, then install:
helm install llmkube ./llmkube \
  --namespace llmkube-system --create-namespace \
  --set image.repository=registry.internal/llmkube \
  --set image.tag=v0.4.9

CLI Commands for Air-Gapped Deployments

# Deploy from local file
llmkube deploy my-model --gpu \
  --source /mnt/models/model.gguf

# Deploy from PVC (recommended)
llmkube deploy my-model --gpu \
  --source pvc://my-models-pvc/model.gguf

# Deploy with SHA256 integrity verification
llmkube deploy my-model --gpu \
  --source http://model-server.internal:8080/model.gguf \
  --sha256 a1b2c3d4...

# Deploy catalog model with local file override
llmkube deploy llama-3.1-8b --gpu \
  --source-override /mnt/models/llama-3.1-8b-q4_k_m.gguf

# Deploy from file:// URL
llmkube deploy my-model --gpu \
  --source file:///mnt/models/model.gguf

# Deploy from internal HTTP server
llmkube deploy my-model --gpu \
  --source http://model-server.internal:8080/model.gguf

Storage Strategies

Shared Storage (NFS/GlusterFS)

Best for multi-node clusters:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: model-storage
spec:
  capacity:
    storage: 100Gi
  accessModes:
    - ReadOnlyMany
  nfs:
    server: nfs.internal
    path: /exports/models
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-storage
spec:
  accessModes:
    - ReadOnlyMany
  resources:
    requests:
      storage: 100Gi

Local Storage with Node Affinity

For single-node or node-specific deployments:

apiVersion: inference.llmkube.dev/v1alpha1
kind: InferenceService
metadata:
  name: local-llama
spec:
  modelRef: local-llama
  replicas: 1
  # Add node selector to ensure pod lands on node with model
  # (configure via Deployment after creation)

Troubleshooting

Model Status Shows "Failed"

# Check model status
kubectl describe model my-model

# Common issues:
# - "file does not exist" - Model path is incorrect or not accessible
# - "permission denied" - File permissions issue
# - "copy incomplete" - Disk space or I/O error

File Not Found on Node

# Verify file exists on the node
kubectl debug node/NODE_NAME -it --image=busybox -- ls -la /mnt/models/

# Check if path is mounted in the pod
kubectl exec -it POD_NAME -- ls -la /models/

Permission Denied

# Check file permissions (should be readable by container user)
ls -la /mnt/models/model.gguf

# Fix permissions
chmod 644 /mnt/models/model.gguf

SHA256 Integrity Verification

LLMKube supports built-in SHA256 verification for model integrity — critical for compliance environments where model provenance must be assured.

Spec-Level Verification

Set the expected hash in the Model spec. The controller computes the hash after download and fails the model if it doesn't match:

apiVersion: inference.llmkube.dev/v1alpha1
kind: Model
metadata:
  name: verified-llama
spec:
  source: http://model-server.internal:8080/llama-3.1-8b-q4_k_m.gguf
  sha256: "a1b2c3d4e5f6...64-char-hex-string..."
  format: gguf

CLI Usage

# Deploy with integrity verification
llmkube deploy my-model --gpu \
  --source http://model-server.internal:8080/model.gguf \
  --sha256 a1b2c3d4e5f6...

# Compute hash of a local file first
sha256sum /mnt/models/model.gguf

Status Tracking

The computed SHA256 is always stored in status.sha256, even when no expected hash is provided. This lets you audit deployed models:

kubectl get model my-model -o jsonpath='{.status.sha256}'

Note: SHA256 verification is not available for pvc:// sources since the controller does not mount the PVC.

Security Considerations

Model Integrity: Use the sha256 field to verify model checksums automatically
File Permissions: Restrict model file access to necessary users/groups
Network Segmentation: Ensure internal model servers are properly firewalled
Audit Logging: Track model deployments and access events through the controller's structured logs and Prometheus metrics. For SOC 2 / HIPAA / FedRAMP environments, forward operator logs to your SIEM via Vector, Fluent Bit, or your existing logging stack.

Next Steps

GPU Setup Guide - Configure GPU acceleration
Model Cache Guide - Manage cached models
Multi-GPU Deployment - Scale to multiple GPUs

Support

Issues: GitHub Issues
Documentation: README.md