LLM Setup for FileScopeMCP
April 29, 2026 · View on GitHub
FileScopeMCP uses llama.cpp's llama-server as a local OpenAI-compatible LLM backend. The default model is Qwen3.6 35B A3B MoE (UD-IQ4_XS quant, ~3B active params per token). The model alias llm-model is what the broker expects — pass --alias llm-model on the llama-server command line.
Without a running llama-server, FileScopeMCP still works for file tracking and dependency analysis — you just won't get auto-generated summaries, concepts, or change-impact assessments.
Pick the guide that matches your setup:
- Same machine (Linux/macOS) — default — llama-server and FileScopeMCP on the same Linux or macOS host (this is what agent runtimes like Hermes use)
- Remote / LAN server — llama-server on a different machine on your network
- WSL2 + Windows GPU — alternative for Windows users: FileScopeMCP in WSL2, llama-server on the Windows host for GPU access
Same Machine (Linux/macOS)
./setup-llm.sh
This prints a platform-specific setup guide. It does NOT install anything for you — you build or install llama.cpp yourself.
Linux
Build from source with the backend that matches your GPU:
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# NVIDIA:
cmake -B build -DGGML_CUDA=ON
# AMD (Vulkan is the recommended backend — see WSL2 section for why):
cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release -j
Or run the CUDA Docker image:
docker run --gpus all -p 8880:8880 \
-v $HOME/models:/models \
ghcr.io/ggml-org/llama.cpp:server-cuda \
-m /models/Qwen3.6-35B-A3B-UD-IQ4_XS.gguf \
--alias llm-model -c 98304 -n 32768 -ngl 99 --n-cpu-moe 14 \
-fa on --no-mmap --mlock -b 2048 -ub 512 \
--cache-type-k q8_0 --cache-type-v q8_0 --swa-full \
--no-context-shift --ctx-checkpoints 128 --checkpoint-every-n-tokens 4096 \
--cache-ram 4096 --jinja --reasoning-format deepseek --reasoning-budget 4096 \
--host 0.0.0.0 --port 8880
Run ./setup-llm.sh --launch to print the exact launch command for the native binary.
macOS
brew install llama.cpp
Metal is the default backend — no configuration needed. Launch with the same command ./setup-llm.sh --launch prints.
First run
The GGUF file must already exist at the path specified by -m. Download it before launching. llama-server does not accept HTTP traffic until the model is fully loaded.
Verify with:
./setup-llm.sh --status
No broker config changes are needed — the default broker.default.json template points at localhost:8880, and the broker auto-copies it to ~/.filescope/broker.json on first start if the file is missing.
Run as a systemd service (Linux only)
Once your launch script (typically ~/start-llama-server.sh, the command ./setup-llm.sh --launch prints) is in place, register llama-server as a systemd unit so it auto-starts on boot, restarts on failure, and logs to journalctl:
sudo ./setup-llm.sh --install-service
The unit lives at /etc/systemd/system/llama-server.service (template at monitoring/systemd/llama-server.service). It captures stdout/stderr to the journal, sets OOMScoreAdjust=-500 so the kernel picks lighter workloads first under memory pressure, and writes a start-time metric used by the optional monitoring dashboard. Override the launch-script path with --start-script /path/to/script if it isn't at $HOME/start-llama-server.sh.
The flag is refused under WSL2 — see the WSL2 + Windows GPU section instead.
Tail logs with:
sudo journalctl -u llama-server -f
WSL2 + Windows GPU
Alternative setup for Windows users with a dedicated GPU. WSL2 doesn't give native GPU access to llama.cpp, so llama-server runs on Windows and FileScopeMCP connects to it across the WSL2 boundary. If you're on a native Linux host (including Hermes on Ubuntu), use the Same Machine guide above instead.
Step 1: Pick the Windows binary
Download the llama.cpp Windows release from github.com/ggml-org/llama.cpp/releases. Pick the zip that matches your GPU.
| GPU | File | Backend |
|---|---|---|
| AMD RDNA2/RDNA3 (RX 6800 XT, RX 7900 XT, etc.) | llama-*-bin-win-vulkan-x64.zip | Vulkan |
| NVIDIA | llama-*-bin-win-cuda-12.X-x64.zip | CUDA (no toolkit required for prebuilt) |
| Intel Arc | llama-*-bin-win-vulkan-x64.zip | Vulkan |
For AMD: use Vulkan, NOT ROCm. Two reasons:
- The ROCm backend is broken on Windows 11 since llama.cpp build b8152 (Issue #19943) — models load CPU-only.
- Vulkan is 0-50% faster than ROCm on RDNA2 in practice, and the gap widens for MoE models.
No HIP SDK, no Visual Studio, no ROCm SDK needed.
Step 2: Extract the zip
Right-click → Extract All → enter C:\llama.cpp. The zip may or may not create a nested subfolder. Find the folder that actually contains llama-server.exe:
Get-ChildItem -Recurse -Filter llama-server.exe C:\llama.cpp
Note the exact folder — you will cd into it in Step 4.
Step 3: Open port 8880 in Windows Firewall
In an elevated PowerShell (Run as Administrator):
New-NetFirewallRule -DisplayName "llama-server 8880" `
-Direction Inbound -Action Allow -Protocol TCP -LocalPort 8880
If your WSL2 network interface is on the Public profile (unusual — usually Private), ensure the rule covers both.
Step 4: Launch llama-server
In PowerShell, from the folder that contains llama-server.exe:
cd C:\llama.cpp # or the nested subfolder from Step 2
.\llama-server.exe `
-m ~/models/Qwen3.6-35B-A3B-UD-IQ4_XS.gguf `
--alias llm-model `
-c 98304 `
-n 32768 `
-ngl 99 `
--n-cpu-moe 14 `
-fa on `
--no-mmap `
--mlock `
-b 2048 -ub 512 `
--cache-type-k q8_0 --cache-type-v q8_0 `
--swa-full `
--no-context-shift `
--ctx-checkpoints 128 `
--checkpoint-every-n-tokens 4096 `
--cache-ram 4096 `
--jinja `
--reasoning-format deepseek `
--reasoning-budget 4096 `
--chat-template-kwargs '{"preserve_thinking":true}' `
--temp 0.6 `
--top-p 0.95 `
--top-k 20 `
--min-p 0.0 `
--presence-penalty 0.0 `
--repeat-penalty 1.0 `
--host 0.0.0.0 --port 8880 `
--metrics `
-np 1
Flag breakdown:
-ngl 99— offload all layers to GPU--n-cpu-moe 14— keep routed expert FFNs in system RAM for 14 layers (tuned for 16GB VRAM). Raise to99if you hit OOM; lower for more speed if you have headroom.-fa on— flash attention--no-mmap --mlock— disable memory mapping, lock model in RAM for consistent performance-b 2048 -ub 512— logical and physical batch size--cache-type-k q8_0 --cache-type-v q8_0— KV cache in int8. Do NOT useq4_0on gfx1030 — known segfault (Issue #15107).--swa-full— full sliding window attention--no-context-shift --ctx-checkpoints 128 --checkpoint-every-n-tokens 4096— context management with periodic checkpoints--cache-ram 4096— 4GB RAM cache for context checkpoints--jinja— enable Jinja chat template--reasoning-format deepseek --reasoning-budget 4096— enable reasoning mode with 4K token budget--chat-template-kwargs '{"preserve_thinking":true}'— preserve thinking blocks in output-c 98304— 96K context window-n 32768— max tokens per generation
RAM requirement: --n-cpu-moe streams routed experts from system RAM. At --n-cpu-moe 14, keep ~12GB of system RAM free beyond what Windows itself uses; at --n-cpu-moe 99, keep ~20GB. The --cache-ram 4096 flag reserves an additional 4GB for context checkpoints.
Step 5: Configure the broker in WSL
mkdir -p ~/.filescope
cp ~/FileScopeMCP/broker.windows-host.json ~/.filescope/broker.json
The wsl-host placeholder in broker.windows-host.json is auto-resolved by the broker at startup — src/broker/config.ts runs ip route show default | awk '{print \$3}' to find the Windows host IP and rewrites baseURL in memory. No manual editing required in 99% of cases.
Step 6: Verify from WSL
curl http://$(ip route show default | awk '{print \$3}'):8880/v1/models
Expected: JSON with data[].id containing llm-model.
Step 7: Restart Claude Code
Start (or restart) a Claude Code session in your project. FileScopeMCP auto-spawns the broker, which connects to llama-server on Windows. Verify end-to-end with:
./setup-llm.sh --status
Or call status() from an MCP tool in Claude Code.
Remote / LAN Server
llama-server runs on a different machine on your network.
1. On the remote machine: Launch llama-server with the full flag set from Step 4 above, ensuring --host 0.0.0.0 --port 8880 --alias llm-model are present.
2. In WSL / on the FileScopeMCP machine:
mkdir -p ~/.filescope
cp ~/FileScopeMCP/broker.remote-lan.json ~/.filescope/broker.json
3. Edit ~/.filescope/broker.json and replace 192.168.1.100 with the actual IP of the remote machine:
{
"llm": {
"provider": "openai-compatible",
"model": "llm-model",
"baseURL": "http://YOUR_SERVER_IP:8880/v1",
"maxTokensPerCall": 1024
},
"jobTimeoutMs": 120000,
"maxQueueSize": 1000
}
4. Verify connectivity:
curl http://<remote-ip>:8880/v1/models
5. Restart Claude Code.
WSL + Windows Troubleshooting
If FileScopeMCP runs in WSL2 and llama-server runs on Windows, work through these checks in order.
1. Is llama-server running on Windows?
Check the PowerShell window it was launched in. If you closed that window, llama-server is gone — relaunch it with the command from Step 4.
2. Is llama-server listening on all interfaces?
In a Windows terminal:
netstat -an | findstr 8880
You should see 0.0.0.0:8880. If you see 127.0.0.1:8880, you forgot --host 0.0.0.0 on the launch command.
3. Can WSL reach the Windows host?
From WSL:
ip route show default | awk '{print \$3}'
curl http://$(ip route show default | awk '{print \$3}'):8880/v1/models
If curl hangs or returns "Connection refused":
- Firewall: The inbound rule from Step 3 may not be active. Re-run
New-NetFirewallRulein an elevated PowerShell. - VPN/proxy: Some VPN software changes WSL2 networking. Try disconnecting the VPN temporarily.
4. Is the broker config correct?
cat ~/.filescope/broker.json
baseURL should contain wsl-host:8880 (auto-resolved at startup) or the literal Windows host IP on port 8880. If it points at a different port or still has localhost, re-copy the template:
cp ~/FileScopeMCP/broker.windows-host.json ~/.filescope/broker.json
5. Is the broker process running?
ps aux | grep broker | grep -v grep
cat ~/.filescope/broker.log
Common errors:
ECONNREFUSED— llama-server isn't reachable (go back to checks 2-3).- Stale socket file — remove it and let the broker respawn:
Then restart your Claude Code session.rm ~/.filescope/broker.sock
6. Is wsl-host resolving correctly?
ip route show default
This should print one line whose third field is the Windows host gateway IP. If this fails (unusual), edit ~/.filescope/broker.json and replace wsl-host with the literal IP.