OpenCode + Candle-vLLM
June 11, 2026 ยท View on GitHub
This guide connects OpenCode directly to candle-vllm through the built-in OpenAI-compatible /v1/chat/completions endpoint.
OpenCode -> Candle-vLLM (OpenAI-compatible)
1) Start candle-vLLM (at port 8000)
cargo run --release --features cuda,nccl,flashinfer,cutlass -- \
--m Qwen/Qwen3.6-27B-FP8 \
--d 0 \
--p 8000 \
--kv-fraction 0.6 \
--enforce-parser qwen_coder
If you prefer FlashAttention, replace flashinfer with flashattn.
2) Discover the served model name
curl http://localhost:8000/v1/models
Use the returned id in the OpenCode config.
3) Configure OpenCode
Install OpenCode:
curl -fsSL https://opencode.ai/install | bash
Or:
npm i -g opencode-ai
Create ~/.config/opencode/config.json:
{
"$schema": "https://opencode.ai/config.json",
"provider": {
"local-candle-vllm": {
"npm": "@ai-sdk/openai-compatible",
"name": "Candle-vLLM Local",
"options": {
"baseURL": "http://localhost:8000/v1"
},
"models": {
"qwen3-coder": {
"name": "Qwen/Qwen3.6-27B-FP8"
}
}
}
},
"model": "local-candle-vllm/qwen3-coder"
}
4) Run OpenCode
opencode
Notes
- Tool calls follow the normal OpenAI request/response loop.
- For Qwen coder models,
--enforce-parser qwen_coderis usually the most reliable parser setting. - If OpenCode reports a model mismatch, compare your configured model against
GET /v1/models.
Troubleshooting
Chat logger:
export CANDLE_VLLM_CHAT_LOGGER=1
Reasoning routing for tool-enabled requests:
export CANDLE_VLLM_STREAM_AS_REASONING_CONTENT=1
Set CANDLE_VLLM_STREAM_AS_REASONING_CONTENT=0 if the client expects reasoning
to stay inside content instead of separate reasoning_content fields.