Air-Gapped Deployment Guide
May 1, 2026 · View on GitHub
Deploy LLMKube in environments without internet access. This guide covers deploying models from local file paths, private registries, and pre-downloaded GGUF files.
Use Cases
- Government/Defense: Classified networks with no internet access
- Healthcare: HIPAA-compliant environments with restricted egress
- Finance: Air-gapped trading systems and compliance environments
- Edge: Remote locations with limited or no connectivity
- Corporate: Private networks with strict firewall rules
Prerequisites
- Kubernetes cluster (v1.11.3+) with no internet access
- LLMKube operator installed (see offline installation)
- Pre-downloaded GGUF model file(s)
llmkubeCLI installed on a workstation with cluster access
Quick Start: Deploy from Local Path
Step 1: Pre-download the Model
On a machine with internet access, download the GGUF file:
# Example: Download Llama 3.1 8B
curl -L -o llama-3.1-8b-q4_k_m.gguf \
"https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf"
# Verify the file
ls -lh llama-3.1-8b-q4_k_m.gguf
# Should show ~4.9GB
Step 2: Transfer to Air-Gapped Environment
Transfer the GGUF file to a location accessible by your Kubernetes nodes:
# Option A: Copy to a shared NFS mount
cp llama-3.1-8b-q4_k_m.gguf /mnt/nfs/models/
# Option B: Copy to each node (if no shared storage)
scp llama-3.1-8b-q4_k_m.gguf node1:/mnt/models/
scp llama-3.1-8b-q4_k_m.gguf node2:/mnt/models/
# Option C: Use a PersistentVolume
# (see PVC-based deployment below)
Step 3: Deploy with CLI
# Deploy using local path
llmkube deploy my-llama --gpu \
--source /mnt/models/llama-3.1-8b-q4_k_m.gguf \
--cpu 4 \
--memory 8Gi \
--gpu-layers 32
# Or use catalog defaults with local override
llmkube deploy llama-3.1-8b --gpu \
--source-override /mnt/models/llama-3.1-8b-q4_k_m.gguf
Step 4: Verify Deployment
# Check model status (should show "Copying" then "Ready")
llmkube list models
# Check service status
llmkube list services
# Test the endpoint
kubectl port-forward svc/my-llama 8080:8080 &
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages":[{"role":"user","content":"Hello!"}]}'
Deployment Options
Option 1: Absolute File Path
The simplest approach for nodes with local model storage:
apiVersion: inference.llmkube.dev/v1alpha1
kind: Model
metadata:
name: local-llama
spec:
source: /mnt/models/llama-3.1-8b-q4_k_m.gguf
format: gguf
hardware:
accelerator: cuda
gpu:
enabled: true
count: 1
Requirements:
- Model file must exist at the same path on all nodes where pods may run
- Use a DaemonSet or node affinity to ensure pods land on nodes with the model
Option 2: file:// URL
Equivalent to absolute path, but explicit about the scheme:
apiVersion: inference.llmkube.dev/v1alpha1
kind: Model
metadata:
name: local-llama
spec:
source: file:///mnt/models/llama-3.1-8b-q4_k_m.gguf
format: gguf
Option 3: PVC Source (Recommended for Air-Gapped)
Mount a model directly from an existing PersistentVolumeClaim — no download, no HostPath, portable across nodes:
apiVersion: inference.llmkube.dev/v1alpha1
kind: Model
metadata:
name: pvc-llama
spec:
source: pvc://my-models-pvc/llama-3.1-8b-q4_k_m.gguf
format: gguf
hardware:
accelerator: cuda
gpu:
enabled: true
count: 1
The controller validates the PVC exists and is Bound, then sets the model to Ready immediately. The InferenceService mounts the PVC read-only — no init container or download step needed.
CLI equivalent:
llmkube deploy my-llama --gpu \
--source pvc://my-models-pvc/llama-3.1-8b-q4_k_m.gguf
Requirements:
- PVC must exist in the same namespace as the Model
- PVC must be Bound
- Use
ReadOnlyManyaccess mode for multi-replica deployments
Option 4: Private HTTP Server
For environments with an internal model server:
apiVersion: inference.llmkube.dev/v1alpha1
kind: Model
metadata:
name: internal-llama
spec:
source: http://model-server.internal.corp:8080/models/llama-3.1-8b-q4_k_m.gguf
format: gguf
Setup a simple model server:
# On your internal server
cd /path/to/models
python3 -m http.server 8080
Offline Operator Installation
Option 1: Pre-built Container Images
- On a machine with internet access, pull and save the images:
# Pull images
docker pull ghcr.io/defilantech/llmkube:v0.4.9
docker pull ghcr.io/ggml-org/llama.cpp:server-cuda13
# Save to tar files
docker save ghcr.io/defilantech/llmkube:v0.4.9 > llmkube-controller.tar
docker save ghcr.io/ggml-org/llama.cpp:server-cuda13 > llama-server-cuda.tar
-
Transfer tar files to the air-gapped environment
-
Load images on each node or into your private registry:
# Load directly on nodes
docker load < llmkube-controller.tar
docker load < llama-server-cuda.tar
# Or push to private registry
docker load < llmkube-controller.tar
docker tag ghcr.io/defilantech/llmkube:v0.4.9 registry.internal/llmkube:v0.4.9
docker push registry.internal/llmkube:v0.4.9
Option 2: Helm with Private Registry
# Add Helm repo (on connected machine)
helm repo add llmkube https://defilantech.github.io/LLMKube
helm pull llmkube/llmkube --untar
# Transfer chart to air-gapped environment, then install:
helm install llmkube ./llmkube \
--namespace llmkube-system --create-namespace \
--set image.repository=registry.internal/llmkube \
--set image.tag=v0.4.9
CLI Commands for Air-Gapped Deployments
# Deploy from local file
llmkube deploy my-model --gpu \
--source /mnt/models/model.gguf
# Deploy from PVC (recommended)
llmkube deploy my-model --gpu \
--source pvc://my-models-pvc/model.gguf
# Deploy with SHA256 integrity verification
llmkube deploy my-model --gpu \
--source http://model-server.internal:8080/model.gguf \
--sha256 a1b2c3d4...
# Deploy catalog model with local file override
llmkube deploy llama-3.1-8b --gpu \
--source-override /mnt/models/llama-3.1-8b-q4_k_m.gguf
# Deploy from file:// URL
llmkube deploy my-model --gpu \
--source file:///mnt/models/model.gguf
# Deploy from internal HTTP server
llmkube deploy my-model --gpu \
--source http://model-server.internal:8080/model.gguf
Storage Strategies
Shared Storage (NFS/GlusterFS)
Best for multi-node clusters:
apiVersion: v1
kind: PersistentVolume
metadata:
name: model-storage
spec:
capacity:
storage: 100Gi
accessModes:
- ReadOnlyMany
nfs:
server: nfs.internal
path: /exports/models
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-storage
spec:
accessModes:
- ReadOnlyMany
resources:
requests:
storage: 100Gi
Local Storage with Node Affinity
For single-node or node-specific deployments:
apiVersion: inference.llmkube.dev/v1alpha1
kind: InferenceService
metadata:
name: local-llama
spec:
modelRef: local-llama
replicas: 1
# Add node selector to ensure pod lands on node with model
# (configure via Deployment after creation)
Troubleshooting
Model Status Shows "Failed"
# Check model status
kubectl describe model my-model
# Common issues:
# - "file does not exist" - Model path is incorrect or not accessible
# - "permission denied" - File permissions issue
# - "copy incomplete" - Disk space or I/O error
File Not Found on Node
# Verify file exists on the node
kubectl debug node/NODE_NAME -it --image=busybox -- ls -la /mnt/models/
# Check if path is mounted in the pod
kubectl exec -it POD_NAME -- ls -la /models/
Permission Denied
# Check file permissions (should be readable by container user)
ls -la /mnt/models/model.gguf
# Fix permissions
chmod 644 /mnt/models/model.gguf
SHA256 Integrity Verification
LLMKube supports built-in SHA256 verification for model integrity — critical for compliance environments where model provenance must be assured.
Spec-Level Verification
Set the expected hash in the Model spec. The controller computes the hash after download and fails the model if it doesn't match:
apiVersion: inference.llmkube.dev/v1alpha1
kind: Model
metadata:
name: verified-llama
spec:
source: http://model-server.internal:8080/llama-3.1-8b-q4_k_m.gguf
sha256: "a1b2c3d4e5f6...64-char-hex-string..."
format: gguf
CLI Usage
# Deploy with integrity verification
llmkube deploy my-model --gpu \
--source http://model-server.internal:8080/model.gguf \
--sha256 a1b2c3d4e5f6...
# Compute hash of a local file first
sha256sum /mnt/models/model.gguf
Status Tracking
The computed SHA256 is always stored in status.sha256, even when no expected hash is provided. This lets you audit deployed models:
kubectl get model my-model -o jsonpath='{.status.sha256}'
Note: SHA256 verification is not available for pvc:// sources since the controller does not mount the PVC.
Security Considerations
- Model Integrity: Use the
sha256field to verify model checksums automatically - File Permissions: Restrict model file access to necessary users/groups
- Network Segmentation: Ensure internal model servers are properly firewalled
- Audit Logging: Track model deployments and access events through the controller's structured logs and Prometheus metrics. For SOC 2 / HIPAA / FedRAMP environments, forward operator logs to your SIEM via Vector, Fluent Bit, or your existing logging stack.
Next Steps
- GPU Setup Guide - Configure GPU acceleration
- Model Cache Guide - Manage cached models
- Multi-GPU Deployment - Scale to multiple GPUs
Support
- Issues: GitHub Issues
- Documentation: README.md