Chapter 4: Server Mode
April 13, 2026 ยท View on GitHub
Welcome to Chapter 4: Server Mode. In this part of llama.cpp Tutorial: Local LLM Inference, you will build an intuitive mental model first, then move into concrete implementation details and practical production tradeoffs.
Run llama.cpp as an OpenAI-compatible HTTP server for API access and integration with applications.
Server Architecture
flowchart LR
C[Client: OpenAI SDK / curl] -->|POST /v1/chat/completions| S[llama-server :8080]
S --> Q[Request Queue]
Q --> I[llama.cpp Inference Engine]
I --> M[GGUF Model in RAM/VRAM]
I -->|token stream or full response| S
S -->|JSON / SSE stream| C
Overview
llama.cpp includes a built-in HTTP server that provides an OpenAI-compatible API. This allows you to use any OpenAI client or library with your local models.
Starting the Server
Basic Server Setup
# Start server with default settings
./llama-server -m model.gguf
# With custom host and port
./llama-server -m model.gguf --host 0.0.0.0 --port 8080
# With verbose logging
./llama-server -m model.gguf --verbose
Advanced Server Configuration
# Production server configuration
./llama-server -m model.gguf \
--host 0.0.0.0 \
--port 8000 \
--threads $(nproc) \
--ctx-size 4096 \
--batch-size 512 \
--ubatch-size 512 \
--gpu-layers 0 \
--flash-attn \
--mlock \
--parallel 1 \
--cont-batching \
--metrics \
--log-format json
API Endpoints
The server provides OpenAI-compatible endpoints:
Chat Completions
import requests
# POST /v1/chat/completions
response = requests.post("http://localhost:8080/v1/chat/completions",
json={
"model": "local-model",
"messages": [
{"role": "user", "content": "Hello!"}
],
"temperature": 0.7,
"max_tokens": 100
}
)
result = response.json()
print(result["choices"][0]["message"]["content"])
Completions (Legacy)
# POST /v1/completions
response = requests.post("http://localhost:8080/v1/completions",
json={
"model": "local-model",
"prompt": "The capital of France is",
"max_tokens": 50,
"temperature": 0.1
}
)
result = response.json()
print(result["choices"][0]["text"])
Streaming Responses
# Enable streaming
response = requests.post("http://localhost:8080/v1/chat/completions",
json={
"model": "local-model",
"messages": [{"role": "user", "content": "Tell me a story"}],
"stream": True
},
stream=True
)
for line in response.iter_lines():
if line:
line = line.decode('utf-8')
if line.startswith('data: '):
data = line[6:] # Remove 'data: ' prefix
if data == '[DONE]':
break
try:
chunk = json.loads(data)
content = chunk["choices"][0]["delta"].get("content", "")
print(content, end="", flush=True)
except json.JSONDecodeError:
continue
Using OpenAI Client
Drop-in replacement for OpenAI API:
from openai import OpenAI
# Point to your llama.cpp server
client = OpenAI(
api_key="not-needed", # Any string works
base_url="http://localhost:8080/v1"
)
# Use like normal OpenAI API
response = client.chat.completions.create(
model="local-model", # Model name doesn't matter
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
],
temperature=0.7,
max_tokens=100
)
print(response.choices[0].message.content)
Server Configuration Options
Performance Tuning
# High-performance configuration
./llama-server -m model.gguf \
--threads $(nproc) \
--ctx-size 4096 \
--batch-size 2048 \
--ubatch-size 512 \
--gpu-layers 35 \ # Use GPU if available
--flash-attn \ # Flash attention for speed
--mlock \ # Lock model in RAM
--no-mmap \ # Alternative memory management
--numa # NUMA-aware memory allocation
Multi-Model Support
# Load multiple models
./llama-server \
--model models/llama-7b.gguf \
--model models/mistral-7b.gguf \
--model models/codellama.gguf \
--alias llama=llama-7b.gguf \
--alias mistral=mistral-7b.gguf \
--alias code=codellama.gguf
Context and Memory
# Optimize for different use cases
./llama-server -m model.gguf \
--ctx-size 8192 \ # Large context for analysis
--rope-scaling yarn \ # YaRN RoPE scaling
--rope-scale 2.0 \ # Scale factor
--yarn-ext-factor 1.0 \
--yarn-attn-factor 1.0 \
--yarn-beta-fast 32 \
--yarn-beta-slow 1
Authentication and Security
API Key Authentication
# Enable authentication
./llama-server -m model.gguf \
--api-key sk-your-secret-key
# Use in requests
headers = {"Authorization": "Bearer sk-your-secret-key"}
response = requests.post("http://localhost:8080/v1/chat/completions",
headers=headers,
json={...}
)
CORS Configuration
# Enable CORS for web applications
./llama-server -m model.gguf \
--cors \
--cors-origin http://localhost:3000,http://localhost:5173
# Or allow all origins (development only)
./llama-server -m model.gguf --cors
Monitoring and Metrics
Server Metrics
# Enable Prometheus metrics
./llama-server -m model.gguf --metrics
# Access metrics at /metrics
curl http://localhost:8080/metrics
Health Checks
# Health endpoint
response = requests.get("http://localhost:8080/health")
if response.status_code == 200:
print("Server is healthy")
Logging
# Structured JSON logging
./llama-server -m model.gguf \
--log-format json \
--verbose
# Log to file
./llama-server -m model.gguf \
--log-file server.log \
--log-verbosity 1
Load Balancing and Scaling
Multiple Server Instances
#!/bin/bash
# start_multiple_servers.sh
ports=(8080 8081 8082)
models=("llama-7b.gguf" "mistral-7b.gguf" "codellama.gguf")
for i in "${!ports[@]}"; do
port=${ports[$i]}
model=${models[$i]}
echo "Starting server on port $port with $model"
./llama-server -m "models/$model" \
--port $port \
--host 0.0.0.0 \
--threads $(nproc) &
done
wait
Load Balancer Configuration
# nginx.conf
upstream llama_backend {
server localhost:8080;
server localhost:8081;
server localhost:8082;
}
server {
listen 80;
server_name api.yourdomain.com;
location / {
proxy_pass http://llama_backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# Timeout settings
proxy_connect_timeout 60s;
proxy_send_timeout 60s;
proxy_read_timeout 60s;
}
}
Integration Examples
LangChain Integration
from langchain.chat_models import ChatOpenAI
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
# Use llama.cpp server with LangChain
llm = ChatOpenAI(
model="local-model",
openai_api_key="dummy",
openai_api_base="http://localhost:8080/v1",
temperature=0.7
)
chain = LLMChain(
llm=llm,
prompt=PromptTemplate(
input_variables=["topic"],
template="Explain {topic} in simple terms for a beginner."
)
)
result = chain.run(topic="quantum computing")
print(result)
FastAPI Integration
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import requests
app = FastAPI()
class ChatRequest(BaseModel):
message: str
temperature: float = 0.7
max_tokens: int = 100
class ChatResponse(BaseModel):
response: str
@app.post("/chat", response_model=ChatResponse)
async def chat(request: ChatRequest):
try:
response = requests.post("http://localhost:8080/v1/chat/completions",
json={
"model": "local-model",
"messages": [{"role": "user", "content": request.message}],
"temperature": request.temperature,
"max_tokens": request.max_tokens
},
timeout=60
)
if response.status_code == 200:
result = response.json()
return ChatResponse(
response=result["choices"][0]["message"]["content"]
)
else:
raise HTTPException(status_code=500, detail="LLM server error")
except requests.exceptions.Timeout:
raise HTTPException(status_code=504, detail="Request timeout")
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
# Health check
@app.get("/health")
async def health():
try:
response = requests.get("http://localhost:8080/health", timeout=5)
return {"status": "healthy" if response.status_code == 200 else "unhealthy"}
except:
return {"status": "unhealthy"}
Docker Compose Setup
# docker-compose.yml
version: '3.8'
services:
llama-server:
build:
context: .
dockerfile: Dockerfile.server
ports:
- "8080:8080"
environment:
- LLAMA_MODEL_PATH=/models/model.gguf
volumes:
- ./models:/models:ro
deploy:
resources:
limits:
cpus: '4.0'
memory: 8G
reservations:
cpus: '2.0'
memory: 4G
api-gateway:
image: nginx:alpine
ports:
- "80:80"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf:ro
depends_on:
- llama-server
monitoring:
image: prom/prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
# Dockerfile.server
FROM ubuntu:22.04
RUN apt-get update && apt-get install -y \
build-essential \
cmake \
git \
&& rm -rf /var/lib/apt/lists/*
# Clone and build llama.cpp
RUN git clone https://github.com/ggerganov/llama.cpp && \
cd llama.cpp && \
mkdir build && \
cd build && \
cmake .. -DCMAKE_BUILD_TYPE=Release && \
make -j$(nproc) llama-server
WORKDIR /llama.cpp/build/bin
EXPOSE 8080
CMD ["./llama-server", "-m", "/models/model.gguf", "--host", "0.0.0.0", "--port", "8080"]
Production Deployment
Systemd Service
# /etc/systemd/system/llama-server.service
[Unit]
Description=llama.cpp Server
After=network.target
[Service]
Type=simple
User=llama
Group=llama
WorkingDirectory=/opt/llama.cpp
ExecStart=/opt/llama.cpp/build/bin/llama-server \
-m /opt/models/model.gguf \
--host 0.0.0.0 \
--port 8080 \
--threads 8 \
--ctx-size 4096 \
--mlock
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.target
Process Management
# Install service
sudo systemctl daemon-reload
sudo systemctl enable llama-server
sudo systemctl start llama-server
# Monitor service
sudo systemctl status llama-server
sudo journalctl -u llama-server -f
Performance Optimization
GPU Acceleration
# NVIDIA CUDA
./llama-server -m model.gguf \
--gpu-layers 35 \
--main-gpu 0 \
--tensor-split 0,1 # Multi-GPU
# AMD ROCm (Linux)
./llama-server -m model.gguf \
--gpu-layers 35 \
--main-gpu 0
# Apple Metal
./llama-server -m model.gguf \
--gpu-layers 35 \
--metal
Memory Optimization
# Large model optimization
./llama-server -m model.gguf \
--ctx-size 4096 \
--rope-scaling yarn \
--rope-scale 2.0 \
--mlock \
--memory-f32 \
--flash-attn
Concurrent Requests
# Handle multiple requests
./llama-server -m model.gguf \
--parallel 4 \ # Number of parallel requests
--cont-batching \ # Continuous batching
--batch-size 2048 \ # Batch size
--ubatch-size 512 # Micro batch size
Troubleshooting
Common Issues
Server won't start:
# Check model file exists
ls -la model.gguf
# Check port availability
netstat -tlnp | grep 8080
# Run with verbose logging
./llama-server -m model.gguf --verbose
Out of memory:
# Reduce context size
./llama-server -m model.gguf --ctx-size 2048
# Use lower quantization
# Try Q3_K or Q2_K models
Slow responses:
# Increase threads
./llama-server -m model.gguf --threads $(nproc)
# Enable GPU layers
./llama-server -m model.gguf --gpu-layers 35
# Use flash attention
./llama-server -m model.gguf --flash-attn
Connection refused:
# Check server is running
ps aux | grep llama-server
# Check firewall
sudo ufw status
sudo ufw allow 8080
Best Practices
- Resource Planning: Calculate memory requirements before deployment
- Health Checks: Implement proper health checks and monitoring
- Security: Use authentication and restrict network access
- Scaling: Plan for load balancing and horizontal scaling
- Monitoring: Set up comprehensive logging and metrics
- Updates: Keep llama.cpp updated for performance improvements
- Testing: Thoroughly test your API endpoints before production
The server mode makes llama.cpp accessible via standard HTTP APIs, enabling integration with any application that supports OpenAI-compatible endpoints. This is the most practical way to use llama.cpp in production applications.
What Problem Does This Solve?
Most teams struggle here because the hard part is not writing more code, but deciding clear boundaries for llama, server, model so behavior stays predictable as complexity grows.
In practical terms, this chapter helps you avoid three common failures:
- coupling core logic too tightly to one implementation path
- missing the handoff boundaries between setup, execution, and validation
- shipping changes without clear rollback or observability strategy
After working through this chapter, you should be able to reason about Chapter 4: Server Mode as an operating subsystem inside llama.cpp Tutorial: Local LLM Inference, with explicit contracts for inputs, state transitions, and outputs.
Use the implementation notes around gguf, response, size as your checklist when adapting these patterns to your own repository.
How it Works Under the Hood
Under the hood, Chapter 4: Server Mode usually follows a repeatable control path:
- Context bootstrap: initialize runtime config and prerequisites for
llama. - Input normalization: shape incoming data so
serverreceives stable contracts. - Core execution: run the main logic branch and propagate intermediate state through
model. - Policy and safety checks: enforce limits, auth scopes, and failure boundaries.
- Output composition: return canonical result payloads for downstream consumers.
- Operational telemetry: emit logs/metrics needed for debugging and performance tuning.
When debugging, walk this sequence in order and confirm each stage has explicit success/failure conditions.
Source Walkthrough
Key source files in ggerganov/llama.cpp:
examples/server/server.cpp-- HTTP server implementation; OpenAI-compatible route handlers for/v1/chat/completions,/v1/completions,/v1/embeddingsexamples/server/utils.hpp-- JSON serialization helpers for request/response objectsexamples/server/public/index.html-- built-in web UI served athttp://localhost:8080
Suggested trace: find the /v1/chat/completions handler in server.cpp to see how messages are tokenized, queued, and streamed back as SSE.