local-claude
April 28, 2026 · View on GitHub
Run Claude Code with local LLMs instead of the Anthropic API. Keeps the default claude command untouched — use local-claude when you want to go fully offline, offload inference to a remote GPU, or experiment with Apple Intelligence on-device.
Why
Claude Code is an excellent coding agent, but it requires an Anthropic API subscription. This wrapper redirects it to a local or remote inference server (LM Studio, llama.cpp, a remote llama.cpp via SSH, or Apple Intelligence via apfel) so you can experiment with open-weight models at zero cost.
The default claude command remains unchanged — your cloud subscription is never affected.
How it works
local-claude ──► sets env vars ──► claude --model <detected-model>
│
├── ANTHROPIC_BASE_URL → local/remote server
├── ANTHROPIC_API_KEY → "local"
├── CLAUDE_CONFIG_DIR → ~/.claude-local (isolated)
└── CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC → 1
The script:
- Queries the server's
/v1/modelsendpoint (or lists remote GGUF files via SSH) - Lets you pick a model (or auto-selects if only one is loaded)
- For llama.cpp backends: auto-starts
llama-server(locally or via SSH) and kills it on exit - For llama.cpp backends: auto-detects the smallest same-family model for speculative decoding (with automatic fallback if the draft model is incompatible)
- Launches
claudewith the right environment
Prerequisites
- Claude Code installed (
claudein PATH) - Python 3 (for model selection logic)
- One of:
Install
# Clone
git clone https://github.com/lucaspwo/local-claude.git
cd local-claude
# Copy scripts to PATH
cp local-claude apfel-proxy.py ~/.local/bin/
chmod +x ~/.local/bin/local-claude
Usage
With LM Studio (default)
Start LM Studio, load a model, enable the local server (port 1234), then:
local-claude
With llama.cpp (local)
Place .gguf files in ~/Models/gguf/ (or set MODELS_DIR), then:
local-claude --backend llama
The script starts llama-server, lets you pick a model, and automatically stops the server when you exit Claude Code.
With remote llama.cpp (via SSH)
Offload inference to a remote machine (e.g., a desktop with a dedicated GPU). The script SSHs into the remote host, lists available GGUF models, starts llama-server there, and stops it when you exit:
local-claude --backend remote-llama
This is ideal for using a lightweight laptop (e.g., MacBook) as a client while a more powerful machine (e.g., a PC with an NVIDIA RTX GPU) handles inference.
With Apple Intelligence (macOS 26+)
Uses Apple's on-device foundation model via apfel. Requires Apple Silicon, macOS 26 Tahoe, and Apple Intelligence enabled.
brew tap Arthur-Ficial/tap && brew install apfel
local-claude --backend apfel
Important limitations: Apple Intelligence has a 4096-token context window — far too small for Claude Code's tool schemas and system prompt. The backend runs in chat-only mode (
--bare --tools ""): you can have conversations, but the agent cannot use tools (edit files, run commands, etc.). A lightweight proxy (apfel-proxy.py) translates between the Anthropic Messages API that Claude Code speaks and the OpenAI Chat Completions API that apfel exposes.
With a pre-running remote server
If you prefer to manage the remote server yourself, use the remote backend to connect to any already-running OpenAI-compatible server:
local-claude --backend remote --host 192.0.2.100 --port 8091
Shell aliases (optional)
# ~/.zshrc or ~/.bashrc
alias sl='local-claude' # LM Studio
alias sllama='local-claude --backend llama' # llama.cpp (local)
alias sremote='local-claude --backend remote-llama' # llama.cpp (remote via SSH)
alias sapfel='local-claude --backend apfel' # Apple Intelligence
# With SpecStory session recording
alias slocal='specstory run claude -c local-claude --no-cloud-sync'
alias sllama='specstory run claude -c "local-claude --backend llama" --no-cloud-sync'
alias sremote='specstory run claude -c "local-claude --backend remote-llama" --no-cloud-sync'
Backends
| Backend | Server management | Use case |
|---|---|---|
lmstudio (default) | Connects to running LM Studio | GUI-based model management |
llama | Starts/stops local llama-server | Local inference with llama.cpp |
remote-llama | Starts/stops llama-server on remote host via SSH | Offload to a remote GPU |
remote | Connects to any running server | Manual server management |
apfel | Starts apfel + API proxy | Apple Intelligence on-device (chat only) |
Configuration
All settings are via environment variables — no config files needed.
| Variable | Default | Description |
|---|---|---|
LCC_HOST | — | Remote server host IP (used by remote and remote-llama backends only) |
LCC_PORT | 8091 | Remote server port (remote and remote-llama backends) |
LLAMA_PORT | 8090 | llama.cpp local server port |
LLAMA_SERVER | ~/git/llama.cpp/build/bin/llama-server | Path to llama-server binary |
MODELS_DIR | ~/Models/gguf | Directory containing .gguf model files |
LLAMA_DRAFT | (auto-detected) | Explicit path to draft model for speculative decoding |
APFEL_PORT | 11434 | apfel server port |
APFEL_ARGS | — | Extra arguments for apfel --serve (e.g., --cors --max-concurrent 5) |
REMOTE_SSH_HOST | (required) | SSH host for remote-llama backend |
REMOTE_MODELS_DIR | (required) | Absolute path to GGUF directory on the remote host (e.g., /home/<user>/models/gguf) |
REMOTE_LLAMA_DIR | (required) | Absolute path to llama-server directory on the remote host (e.g., /home/<user>/git/llama.cpp/build/bin) |
REMOTE_LLAMA_TQ3_DIR | (required with --tq3) | Absolute path to llama-server directory for the TQ3 fork (e.g., /home/<user>/git/llama.cpp-tq3/build/bin) |
LLAMA_CTX_SIZE | 65536 (or 32768 with --tq3) | Context size for llama and remote-llama backends |
Examples
# Use a remote LM Studio server
local-claude --host 192.0.2.62
# Use a specific draft model
LLAMA_DRAFT=~/Models/gguf/qwen2.5-0.5b-instruct-q8_0.gguf local-claude --backend llama
# Remote llama.cpp via SSH (all 4 vars are required)
REMOTE_SSH_HOST=myserver \
REMOTE_MODELS_DIR=/home/lucas/models/gguf \
REMOTE_LLAMA_DIR=/home/lucas/git/llama.cpp/build/bin \
LCC_HOST=192.0.2.5 \
local-claude --backend remote-llama
# Connect to a pre-running remote server
local-claude --backend remote --host 192.0.2.5 --port 8091
Speculative decoding
When using the llama or remote-llama backends, the script automatically enables speculative decoding if it finds a smaller model from the same family in the models directory.
How it works: A small "draft" model generates candidate tokens that the larger "target" model verifies in a single batch. Accepted tokens are free — rejected ones get regenerated normally. The result is identical output at higher throughput.
Example: With Qwen2.5-7B as target and Qwen2.5-0.5B as draft:
| Target | Draft | Platform | Tokens/sec | Speedup |
|---|---|---|---|---|
| 7B Q8_0 | (none) | Apple M4 Pro (24 GB) | 29 t/s | — |
| 7B Q8_0 | 0.5B Q8_0 | Apple M4 Pro (24 GB) | 57 t/s | ~2x |
| 7B Q8_0 | 0.5B Q8_0 | RTX 4070 Ti SUPER (16 GB) | 177 t/s | ~6x |
| 7B Q8_0 | 1.5B Q4_K_M | Apple M4 Pro (24 GB) | 46 t/s | 1.6x |
| 7B Q8_0 | 3B Q4_K_M | Apple M4 Pro (24 GB) | 36 t/s | 1.2x |
Key insight: The smallest draft model wins. The 3B draft is slower than 1.5B despite higher acceptance rate — verification overhead dominates. The script picks the smallest by default.
To override auto-detection:
LLAMA_DRAFT=/path/to/draft.gguf local-claude --backend llama
MCP servers
If you use MCP servers with Claude Code and want them available in LM Studio too, add them to ~/.lmstudio/mcp.json:
{
"mcpServers": {
"chrome-devtools": {
"type": "stdio",
"command": "npx",
"args": ["-y", "chrome-devtools-mcp@latest"]
},
"a11y-accessibility": {
"type": "stdio",
"command": "npx",
"args": ["-y", "a11y-mcp-server"]
}
}
}
Tip: If
npxisn't in LM Studio's PATH, use the full path (e.g.,~/.nvm/versions/node/v20.20.2/bin/npx) and add aPATHentry inenv.
Architecture notes
Config isolation
The script sets CLAUDE_CONFIG_DIR=~/.claude-local so the local session uses a separate config directory. This prevents conflicts with your cloud claude.ai login — claude and local-claude can coexist without auth errors.
Server lifecycle (llama.cpp)
When using --backend llama or --backend remote-llama, the script:
- Starts
llama-serveras a background process (locally or via SSH) - Registers a
trap EXIThandler to kill it - Runs
claudeas a subprocess (notexec) so the trap survives - When Claude Code exits (
/exit, Ctrl+C, etc.), the trap fires and stops the server
For lmstudio and remote, the script uses exec claude since there's no server lifecycle to manage.
Remote llama.cpp setup
The remote-llama backend expects:
- SSH access to the remote host (configured in
~/.ssh/configor viaREMOTE_SSH_HOST) llama-serverbinary (Linux) on the remote host- GGUF model files in
REMOTE_MODELS_DIRon the remote host (use absolute paths —~is not expanded over SSH) - The remote server listens on
0.0.0.0so it's accessible from the network
Split GGUF support
Large models (e.g., 14B+) are often split into multiple .gguf files. The script handles this transparently:
- Only shows the first part in model selection (e.g.,
qwen2.5-14b-instruct-q4_k_m.ggufinstead of all 3 parts) - Passes the first part to
llama-server, which loads the remaining parts automatically - Draft model detection skips non-first split parts
Context size
Claude Code's system prompt uses ~27K tokens. The script defaults to --ctx-size 65536. If you encounter "exceeds context size" errors, increase this value in the script. Larger context = more RAM/VRAM usage.
Troubleshooting
| Problem | Solution |
|---|---|
| "Auth conflict" error | The script should handle this. If not, run claude /logout in a separate terminal |
| "exceeds context size" | Model context too small. Edit --ctx-size in the script or use a larger value |
| llama-server won't start | Check /tmp/llama-server.log for details (local or remote) |
| Remote server not responding | Check SSH connectivity, firewall rules, and that the port is not in use |
| Port already in use | Another service may be using the port. Try a different LCC_PORT |
| Speculative decoding not activating | Ensure draft model is same family (e.g., both Qwen2.5). Check script output for "Draft model" line |
| Draft model fails to load | Some model pairs are incompatible in certain llama.cpp versions (invalid vector subscript). The script retries without speculative decoding automatically |
| LM Studio speculative decoding error | Disable it in LM Studio's model settings — it conflicts with MLX batched inference |
| apfel "context_length_exceeded" | Apple Intelligence has a hard 4096-token limit. The backend already uses --bare --tools "" to minimize context. Keep messages short |
| apfel proxy connection refused | The apfel server crashed (known issue with FoundationModels framework). Restart with local-claude --backend apfel |
| apfel "model does not exist" | The proxy should rewrite all model names. Check /tmp/apfel-proxy.log for details |
| Model too slow | Use a smaller quantization or smaller model. 7B Q8_0 + 0.5B draft is a good sweet spot |
| CUDA not loading on remote host | Run nvidia-smi over SSH. If it fails, the NVIDIA driver isn't installed or the user can't access it. Prebuilt llama.cpp Linux binaries bundle the CUDA runtime — only the driver is required |
Setting up a remote Ubuntu Server host with NVIDIA GPU
Complete step-by-step guide to set up an Ubuntu Server PC as a remote llama.cpp inference server. This was tested with an NVIDIA RTX 4070 Ti SUPER (16 GB VRAM) and Qwen2.5 models.
1. Enable SSH access
Ubuntu Server ships with openssh-server enabled by default. If not:
sudo apt update && sudo apt install -y openssh-server
sudo systemctl enable --now ssh
On the client machine, add an entry to ~/.ssh/config:
Host my-remote-pc
HostName <IP or Tailscale address>
User <username>
IdentityFile ~/.ssh/id_ed25519
Test: ssh my-remote-pc "uname -a" — should show a Linux kernel.
2. Install the NVIDIA driver and CUDA toolkit
llama.cpp publishes no prebuilt Ubuntu+CUDA binary, so we need both the driver and the CUDA toolkit (used at build time).
# Driver — pick the recommended one for the GPU
sudo ubuntu-drivers install
# Or pin a specific version (CUDA 13.x needs driver ≥ 580; CUDA 12.x needs ≥ 545):
sudo apt install -y nvidia-driver-580
sudo reboot
After reboot, verify the driver:
nvidia-smi
# Should show the GPU model and driver version
Install the CUDA toolkit. Easiest path is NVIDIA's apt repo (more recent than nvidia-cuda-toolkit from Ubuntu's archive):
# https://developer.nvidia.com/cuda-downloads — pick "deb (network)" for your Ubuntu version
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install -y cuda-toolkit # Pulls the latest available (13.x at time of writing)
# Add to PATH (in ~/.bashrc):
echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc
nvcc --version # Verify
3. Build llama.cpp with CUDA support
sudo apt install -y build-essential cmake git
# Use absolute paths — ~ is not expanded over SSH
mkdir -p "$HOME/git"
git clone https://github.com/ggml-org/llama.cpp.git "$HOME/git/llama.cpp"
cd "$HOME/git/llama.cpp"
cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j "$(nproc)"
# The binary lands in build/bin/
ls -lh build/bin/llama-server
Verify CUDA is wired in:
"$HOME/git/llama.cpp/build/bin/llama-server" --help 2>&1 | head -5
# Should show: "ggml_cuda_init: found 1 CUDA devices"
# If it only shows "load_backend: loaded CPU backend", CUDA wasn't detected at build time
Set REMOTE_LLAMA_DIR to /home/<user>/git/llama.cpp/build/bin (where llama-server lives).
Create the models directory:
mkdir -p "$HOME/models/gguf" # Or use a larger drive, e.g., /data/models/gguf
Optional — TQ3 fork. If you also want the TQ3 quantization fork (used by
local-claude --backend remote-llama --tq3), clone and build it side-by-side as$HOME/git/llama.cpp-tq3, then setREMOTE_LLAMA_TQ3_DIRto itsbuild/bindirectory.
4. Download GGUF models
Download models from Hugging Face. For Qwen2.5 with speculative decoding:
MODELS="$HOME/models/gguf" # Adjust to your storage path
# Main model — Qwen2.5-7B-Instruct Q8_0 (~8 GB, fits in 16 GB VRAM)
# Note: this model is split into 3 files, download ALL of them
for i in 1 2 3; do
curl -L -o "$MODELS/qwen2.5-7b-instruct-q8_0-0000${i}-of-00003.gguf" \
"https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF/resolve/main/qwen2.5-7b-instruct-q8_0-0000${i}-of-00003.gguf"
done
# Draft model — Qwen2.5-0.5B-Instruct Q8_0 (~645 MB, for speculative decoding)
curl -L -o "$MODELS/qwen2.5-0.5b-instruct-q8_0.gguf" \
"https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct-GGUF/resolve/main/qwen2.5-0.5b-instruct-q8_0.gguf"
# Optional: other sizes
# 3B Q4_K_M (~2 GB) — fast, lower quality
curl -L -o "$MODELS/qwen2.5-3b-instruct-q4_k_m.gguf" \
"https://huggingface.co/Qwen/Qwen2.5-3B-Instruct-GGUF/resolve/main/qwen2.5-3b-instruct-q4_k_m.gguf"
# 14B Q4_K_M (~9 GB, split) — better quality, tight fit in 16 GB VRAM
for i in 1 2 3; do
curl -L -o "$MODELS/qwen2.5-14b-instruct-q4_k_m-0000${i}-of-00003.gguf" \
"https://huggingface.co/Qwen/Qwen2.5-14B-Instruct-GGUF/resolve/main/qwen2.5-14b-instruct-q4_k_m-0000${i}-of-00003.gguf"
done
Tip: Keep llama.cpp binaries on a fast SSD, but models can live on a slower drive — they're read sequentially into VRAM at startup and not accessed from disk again.
5. Configure the client
On your local machine (the one running Claude Code), set the environment variables. Best done in your shell profile:
# ~/.zshrc or ~/.bashrc
export REMOTE_SSH_HOST="my-remote-pc" # SSH config host name
export REMOTE_MODELS_DIR="/home/lucas/models/gguf" # absolute path on the remote host
export REMOTE_LLAMA_DIR="/home/lucas/git/llama.cpp/build/bin" # absolute path on the remote host
export LCC_HOST="192.0.2.5" # IP of remote host (reachable from client)
Then run:
local-claude --backend remote-llama
6. Gotchas we discovered
- Use absolute paths in
REMOTE_MODELS_DIR/REMOTE_LLAMA_DIR. The script passes these values inside single quotes over SSH, so~is not expanded on the remote side. Use/home/<user>/...(or wherever the data lives). - Port 8090 collisions are rare on Ubuntu Server, but the script still defaults to 8091. If that's taken, set
LCC_PORTto another value. - CUDA toolkit is NOT required. The prebuilt Linux binaries include the CUDA runtime. Only the NVIDIA display driver must be installed.
- Disk space: Check free space before downloading models. A full disk causes
curl: (23) Failure writing output to destinationerrors without clear explanation. - Split GGUF files: Some models (7B Q8_0, 14B Q4_K_M) are split into multiple files on Hugging Face. Download all parts. The script auto-detects them and only shows the model name once in the selection menu.
- Firewall: if
ufwis enabled, allow the port:sudo ufw allow 8091/tcp.
VRAM sizing guide
| Model | Quantization | VRAM (approx) | Quality | Fits 8 GB | Fits 16 GB | Fits 24 GB |
|---|---|---|---|---|---|---|
| 0.5B | Q8_0 | ~0.7 GB | Draft only | ✅ | ✅ | ✅ |
| 3B | Q4_K_M | ~2.5 GB | Basic | ✅ | ✅ | ✅ |
| 7B | Q8_0 | ~9.3 GB | Good | ❌ | ✅ | ✅ |
| 14B | Q4_K_M | ~10 GB | Better | ❌ | ✅ | ✅ |
| 14B | Q8_0 | ~16 GB | Best 14B | ❌ | ⚠️ tight | ✅ |
The draft model (0.5B) adds ~0.7 GB on top. With a 7B Q8_0 + 0.5B draft, total VRAM is ~10 GB.
Credits
Inspired by this XDA article on running Claude Code with local LLMs.
License
MIT