LLM Setup for FileScopeMCP

April 29, 2026 · View on GitHub

Back to README

FileScopeMCP uses llama.cpp's llama-server as a local OpenAI-compatible LLM backend. The default model is Qwen3.6 35B A3B MoE (UD-IQ4_XS quant, ~3B active params per token). The model alias llm-model is what the broker expects — pass --alias llm-model on the llama-server command line.

Without a running llama-server, FileScopeMCP still works for file tracking and dependency analysis — you just won't get auto-generated summaries, concepts, or change-impact assessments.

Pick the guide that matches your setup:

  • Same machine (Linux/macOS)default — llama-server and FileScopeMCP on the same Linux or macOS host (this is what agent runtimes like Hermes use)
  • Remote / LAN server — llama-server on a different machine on your network
  • WSL2 + Windows GPU — alternative for Windows users: FileScopeMCP in WSL2, llama-server on the Windows host for GPU access

Same Machine (Linux/macOS)

./setup-llm.sh

This prints a platform-specific setup guide. It does NOT install anything for you — you build or install llama.cpp yourself.

Linux

Build from source with the backend that matches your GPU:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# NVIDIA:
cmake -B build -DGGML_CUDA=ON
# AMD (Vulkan is the recommended backend — see WSL2 section for why):
cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release -j

Or run the CUDA Docker image:

docker run --gpus all -p 8880:8880 \
  -v $HOME/models:/models \
  ghcr.io/ggml-org/llama.cpp:server-cuda \
  -m /models/Qwen3.6-35B-A3B-UD-IQ4_XS.gguf \
  --alias llm-model -c 98304 -n 32768 -ngl 99 --n-cpu-moe 14 \
  -fa on --no-mmap --mlock -b 2048 -ub 512 \
  --cache-type-k q8_0 --cache-type-v q8_0 --swa-full \
  --no-context-shift --ctx-checkpoints 128 --checkpoint-every-n-tokens 4096 \
  --cache-ram 4096 --jinja --reasoning-format deepseek --reasoning-budget 4096 \
  --host 0.0.0.0 --port 8880

Run ./setup-llm.sh --launch to print the exact launch command for the native binary.

macOS

brew install llama.cpp

Metal is the default backend — no configuration needed. Launch with the same command ./setup-llm.sh --launch prints.

First run

The GGUF file must already exist at the path specified by -m. Download it before launching. llama-server does not accept HTTP traffic until the model is fully loaded.

Verify with:

./setup-llm.sh --status

No broker config changes are needed — the default broker.default.json template points at localhost:8880, and the broker auto-copies it to ~/.filescope/broker.json on first start if the file is missing.

Run as a systemd service (Linux only)

Once your launch script (typically ~/start-llama-server.sh, the command ./setup-llm.sh --launch prints) is in place, register llama-server as a systemd unit so it auto-starts on boot, restarts on failure, and logs to journalctl:

sudo ./setup-llm.sh --install-service

The unit lives at /etc/systemd/system/llama-server.service (template at monitoring/systemd/llama-server.service). It captures stdout/stderr to the journal, sets OOMScoreAdjust=-500 so the kernel picks lighter workloads first under memory pressure, and writes a start-time metric used by the optional monitoring dashboard. Override the launch-script path with --start-script /path/to/script if it isn't at $HOME/start-llama-server.sh.

The flag is refused under WSL2 — see the WSL2 + Windows GPU section instead.

Tail logs with:

sudo journalctl -u llama-server -f

WSL2 + Windows GPU

Alternative setup for Windows users with a dedicated GPU. WSL2 doesn't give native GPU access to llama.cpp, so llama-server runs on Windows and FileScopeMCP connects to it across the WSL2 boundary. If you're on a native Linux host (including Hermes on Ubuntu), use the Same Machine guide above instead.

Step 1: Pick the Windows binary

Download the llama.cpp Windows release from github.com/ggml-org/llama.cpp/releases. Pick the zip that matches your GPU.

GPUFileBackend
AMD RDNA2/RDNA3 (RX 6800 XT, RX 7900 XT, etc.)llama-*-bin-win-vulkan-x64.zipVulkan
NVIDIAllama-*-bin-win-cuda-12.X-x64.zipCUDA (no toolkit required for prebuilt)
Intel Arcllama-*-bin-win-vulkan-x64.zipVulkan

For AMD: use Vulkan, NOT ROCm. Two reasons:

  1. The ROCm backend is broken on Windows 11 since llama.cpp build b8152 (Issue #19943) — models load CPU-only.
  2. Vulkan is 0-50% faster than ROCm on RDNA2 in practice, and the gap widens for MoE models.

No HIP SDK, no Visual Studio, no ROCm SDK needed.

Step 2: Extract the zip

Right-click → Extract All → enter C:\llama.cpp. The zip may or may not create a nested subfolder. Find the folder that actually contains llama-server.exe:

Get-ChildItem -Recurse -Filter llama-server.exe C:\llama.cpp

Note the exact folder — you will cd into it in Step 4.

Step 3: Open port 8880 in Windows Firewall

In an elevated PowerShell (Run as Administrator):

New-NetFirewallRule -DisplayName "llama-server 8880" `
  -Direction Inbound -Action Allow -Protocol TCP -LocalPort 8880

If your WSL2 network interface is on the Public profile (unusual — usually Private), ensure the rule covers both.

Step 4: Launch llama-server

In PowerShell, from the folder that contains llama-server.exe:

cd C:\llama.cpp  # or the nested subfolder from Step 2
.\llama-server.exe `
  -m ~/models/Qwen3.6-35B-A3B-UD-IQ4_XS.gguf `
  --alias llm-model `
  -c 98304 `
  -n 32768 `
  -ngl 99 `
  --n-cpu-moe 14 `
  -fa on `
  --no-mmap `
  --mlock `
  -b 2048 -ub 512 `
  --cache-type-k q8_0 --cache-type-v q8_0 `
  --swa-full `
  --no-context-shift `
  --ctx-checkpoints 128 `
  --checkpoint-every-n-tokens 4096 `
  --cache-ram 4096 `
  --jinja `
  --reasoning-format deepseek `
  --reasoning-budget 4096 `
  --chat-template-kwargs '{"preserve_thinking":true}' `
  --temp 0.6 `
  --top-p 0.95 `
  --top-k 20 `
  --min-p 0.0 `
  --presence-penalty 0.0 `
  --repeat-penalty 1.0 `
  --host 0.0.0.0 --port 8880 `
  --metrics `
  -np 1

Flag breakdown:

  • -ngl 99 — offload all layers to GPU
  • --n-cpu-moe 14 — keep routed expert FFNs in system RAM for 14 layers (tuned for 16GB VRAM). Raise to 99 if you hit OOM; lower for more speed if you have headroom.
  • -fa on — flash attention
  • --no-mmap --mlock — disable memory mapping, lock model in RAM for consistent performance
  • -b 2048 -ub 512 — logical and physical batch size
  • --cache-type-k q8_0 --cache-type-v q8_0 — KV cache in int8. Do NOT use q4_0 on gfx1030 — known segfault (Issue #15107).
  • --swa-full — full sliding window attention
  • --no-context-shift --ctx-checkpoints 128 --checkpoint-every-n-tokens 4096 — context management with periodic checkpoints
  • --cache-ram 4096 — 4GB RAM cache for context checkpoints
  • --jinja — enable Jinja chat template
  • --reasoning-format deepseek --reasoning-budget 4096 — enable reasoning mode with 4K token budget
  • --chat-template-kwargs '{"preserve_thinking":true}' — preserve thinking blocks in output
  • -c 98304 — 96K context window
  • -n 32768 — max tokens per generation

RAM requirement: --n-cpu-moe streams routed experts from system RAM. At --n-cpu-moe 14, keep ~12GB of system RAM free beyond what Windows itself uses; at --n-cpu-moe 99, keep ~20GB. The --cache-ram 4096 flag reserves an additional 4GB for context checkpoints.

Step 5: Configure the broker in WSL

mkdir -p ~/.filescope
cp ~/FileScopeMCP/broker.windows-host.json ~/.filescope/broker.json

The wsl-host placeholder in broker.windows-host.json is auto-resolved by the broker at startup — src/broker/config.ts runs ip route show default | awk '{print \$3}' to find the Windows host IP and rewrites baseURL in memory. No manual editing required in 99% of cases.

Step 6: Verify from WSL

curl http://$(ip route show default | awk '{print \$3}'):8880/v1/models

Expected: JSON with data[].id containing llm-model.

Step 7: Restart Claude Code

Start (or restart) a Claude Code session in your project. FileScopeMCP auto-spawns the broker, which connects to llama-server on Windows. Verify end-to-end with:

./setup-llm.sh --status

Or call status() from an MCP tool in Claude Code.


Remote / LAN Server

llama-server runs on a different machine on your network.

1. On the remote machine: Launch llama-server with the full flag set from Step 4 above, ensuring --host 0.0.0.0 --port 8880 --alias llm-model are present.

2. In WSL / on the FileScopeMCP machine:

mkdir -p ~/.filescope
cp ~/FileScopeMCP/broker.remote-lan.json ~/.filescope/broker.json

3. Edit ~/.filescope/broker.json and replace 192.168.1.100 with the actual IP of the remote machine:

{
  "llm": {
    "provider": "openai-compatible",
    "model": "llm-model",
    "baseURL": "http://YOUR_SERVER_IP:8880/v1",
    "maxTokensPerCall": 1024
  },
  "jobTimeoutMs": 120000,
  "maxQueueSize": 1000
}

4. Verify connectivity:

curl http://<remote-ip>:8880/v1/models

5. Restart Claude Code.


WSL + Windows Troubleshooting

If FileScopeMCP runs in WSL2 and llama-server runs on Windows, work through these checks in order.

1. Is llama-server running on Windows?

Check the PowerShell window it was launched in. If you closed that window, llama-server is gone — relaunch it with the command from Step 4.

2. Is llama-server listening on all interfaces?

In a Windows terminal:

netstat -an | findstr 8880

You should see 0.0.0.0:8880. If you see 127.0.0.1:8880, you forgot --host 0.0.0.0 on the launch command.

3. Can WSL reach the Windows host?

From WSL:

ip route show default | awk '{print \$3}'
curl http://$(ip route show default | awk '{print \$3}'):8880/v1/models

If curl hangs or returns "Connection refused":

  • Firewall: The inbound rule from Step 3 may not be active. Re-run New-NetFirewallRule in an elevated PowerShell.
  • VPN/proxy: Some VPN software changes WSL2 networking. Try disconnecting the VPN temporarily.

4. Is the broker config correct?

cat ~/.filescope/broker.json

baseURL should contain wsl-host:8880 (auto-resolved at startup) or the literal Windows host IP on port 8880. If it points at a different port or still has localhost, re-copy the template:

cp ~/FileScopeMCP/broker.windows-host.json ~/.filescope/broker.json

5. Is the broker process running?

ps aux | grep broker | grep -v grep
cat ~/.filescope/broker.log

Common errors:

  • ECONNREFUSED — llama-server isn't reachable (go back to checks 2-3).
  • Stale socket file — remove it and let the broker respawn:
    rm ~/.filescope/broker.sock
    
    Then restart your Claude Code session.

6. Is wsl-host resolving correctly?

ip route show default

This should print one line whose third field is the Windows host gateway IP. If this fails (unusual), edit ~/.filescope/broker.json and replace wsl-host with the literal IP.