amplifier-module-provider-vllm
May 5, 2026 · View on GitHub
vLLM provider module for Amplifier - Responses API integration for local/self-hosted LLMs.
Overview
This provider module integrates vLLM's OpenAI-compatible Responses API with Amplifier, enabling the use of open-weight models like gpt-oss-20b with full reasoning and tool calling support.
Key Features:
- Responses API only - Optimized for reasoning models (gpt-oss, etc.)
- Full reasoning support - Automatic reasoning block separation
- Tool calling - Complete tool integration via Responses API
- Local OR remote - Works against a local vLLM with no auth, or any remote / hosted endpoint with Bearer auth
- OpenAI-compatible - Uses OpenAI SDK under the hood
Installation
# Via uv (recommended)
uv pip install git+https://github.com/microsoft/amplifier-module-provider-vllm@main
# For development
git clone https://github.com/microsoft/amplifier-module-provider-vllm
cd amplifier-module-provider-vllm
uv pip install -e .
Note for GPT-OSS models: Token accounting requires vocab files that are automatically downloaded to ~/.amplifier/cache/vocab/ on first use (requires internet access). If working offline, see troubleshooting section for manual setup.
vLLM Server Setup
This provider requires a running vLLM server. Example setup:
# Start vLLM server (basic)
vllm serve openai/gpt-oss-20b \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 2
# For production (recommended - full config in /etc/vllm/model.env)
sudo systemctl start vllm
Server requirements:
- vLLM version: ≥0.10.1 (tested with 0.10.1.1)
- Responses API: Automatically available (no special flags needed)
- Model: Any model compatible with vLLM (gpt-oss, Llama, Qwen, etc.)
Configuration
Minimal Configuration
providers:
- module: provider-vllm
source: git+https://github.com/microsoft/amplifier-module-provider-vllm@main
config:
base_url: "http://192.168.128.5:8000/v1" # Your vLLM server
Full Configuration
providers:
- module: provider-vllm
source: git+https://github.com/microsoft/amplifier-module-provider-vllm@main
config:
# Connection
base_url: "http://192.168.128.5:8000/v1" # Required: vLLM server URL
# Model settings
default_model: "openai/gpt-oss-20b" # Model name from vLLM
max_tokens: 4096 # Max output tokens
temperature: 0.7 # Sampling temperature
# Reasoning
reasoning: "high" # Reasoning effort: minimal|low|medium|high
reasoning_summary: "detailed" # Summary verbosity: auto|concise|detailed
# Advanced
enable_state: false # Server-side state (requires vLLM config)
truncation: "auto" # Automatic context management
timeout: 300.0 # API timeout (seconds)
priority: 100 # Provider selection priority
# Debug
debug: true # Enable detailed logging
raw_debug: false # Enable raw API I/O logging
debug_truncate_length: 180 # Truncate long debug strings
Local vs remote — single source of truth
base_url is the single source of truth for whether this provider
instance is local or remote. The provider URL-parses it once at
construction and caches the result; everything downstream
(is_remote property, capability tagging in get_info() and
list_models()) flows from that one decision.
- Local —
base_urlresolves tolocalhost,127.0.0.1,::1, or0.0.0.0. Capability tag:local. No auth required (api_key is ignored if it is the placeholder"EMPTY"). - Remote — anything else (LAN IP, public hostname, RunPod / Modal /
Anyscale / Lambda Labs URL, or a vLLM-backed proxy like
OpenRouter/Together/Fireworks). Capability tag:
remote. Bearer auth is attached whenapi_keyis set.
The is_remote property is purely informational — it does not change
how Bearer is attached (the OpenAI SDK does that whenever api_key is
non-empty regardless of host). It exists so routing matrices and other
downstream consumers can reason about the deployment shape.
Mixed local + remote (multi-instance)
To use both a local vLLM and a remote / hosted vLLM in the same
session, configure two provider instances. Amplifier supports multiple
named instances of the same provider module via the instance_id key:
# Default LOCAL instance — keeps the natural mount name "vllm"
[[providers]]
module = "amplifier-module-provider-vllm"
[providers.config]
base_url = "http://localhost:8000/v1"
default_model = "openai/gpt-oss-20b"
# Second instance — explicit `instance_id` makes it addressable as "vllm-remote"
[[providers]]
module = "amplifier-module-provider-vllm"
instance_id = "vllm-remote"
[providers.config]
base_url = "https://api.endpoints.anyscale.com/v1" # or your hosted vLLM URL
api_key = "${VLLM_API_KEY}"
default_model = "meta-llama/Llama-3.3-70B-Instruct"
A routing matrix can then target each independently:
roles:
reasoning:
candidates:
- provider: vllm-remote
model: "meta-llama/Llama-3.3-70B-Instruct"
- provider: vllm
model: "openai/gpt-oss-20b"
fast:
candidates:
- provider: vllm
model: "openai/gpt-oss-20b"
The kernel validates that at most one entry per module omits
instance_id (the "default" keeps the natural mount name); any
additional entries must specify an instance_id. See
amplifier-core/_session_init.py
for the exact contract.
Usage Examples
Basic Chat
from amplifier_core import AmplifierSession
config = {
"session": {
"orchestrator": "loop-basic",
"context": "context-simple"
},
"providers": [{
"module": "provider-vllm",
"config": {
"base_url": "http://192.168.128.5:8000/v1",
"default_model": "openai/gpt-oss-20b"
}
}]
}
async with AmplifierSession(config=config) as session:
response = await session.execute("Explain quantum computing")
print(response)
With Reasoning
config = {
"providers": [{
"module": "provider-vllm",
"config": {
"base_url": "http://192.168.128.5:8000/v1",
"default_model": "openai/gpt-oss-20b",
"reasoning": "high", # Enable high-effort reasoning
"reasoning_summary": "detailed"
}
}],
# ... rest of config
}
async with AmplifierSession(config=config) as session:
# Model will show internal reasoning before answering
response = await session.execute("Solve this complex problem...")
With Tool Calling
config = {
"providers": [{
"module": "provider-vllm",
"config": {
"base_url": "http://192.168.128.5:8000/v1",
"default_model": "openai/gpt-oss-20b"
}
}],
"tools": [{
"module": "tool-bash", # Enable bash tool
"config": {}
}],
# ... rest of config
}
async with AmplifierSession(config=config) as session:
# Model can call tools autonomously
response = await session.execute("List the files in the current directory")
Architecture
This provider uses the OpenAI SDK with a custom base_url pointing to your vLLM server. Since vLLM implements the OpenAI-compatible Responses API, the integration is clean and direct.
Key components:
VLLMProvider: Main provider class (handles Responses API calls)_constants.py: Configuration defaults and metadata keys_response_handling.py: Response parsing and content block conversion
Response flow:
ChatRequest → VLLMProvider.complete() → AsyncOpenAI.responses.create() →
→ vLLM Server → Response → Content blocks (Thinking + Text + ToolCall) → ChatResponse
Responses API Details
The vLLM provider uses the Responses API (/v1/responses) which provides:
- Structured reasoning: Separate reasoning blocks from response text
- Tool calling: Native function calling support
- Conversation state: Built-in multi-turn conversation handling
- Automatic continuation: Handles incomplete responses transparently
Tool format (vLLM Responses API):
{
"type": "function",
"name": "tool_name",
"description": "Tool description",
"parameters": {"type": "object", "properties": {...}}
}
Response structure:
{
"output": [
{"type": "reasoning", "content": [{"type": "reasoning_text", "text": "..."}]},
{"type": "function_call", "name": "tool_name", "arguments": "{...}"},
{"type": "message", "content": [{"type": "output_text", "text": "..."}]}
]
}
Debugging
Enable debug logging to see full request/response details:
config:
debug: true # Summary logging
raw_debug: true # Complete API I/O
Check logs:
# Find recent session
ls -lt ~/.amplifier/projects/*/sessions/*/events.jsonl | head -1
# View raw requests
grep '"event":"llm:request:raw"' <log-file> | python3 -m json.tool
# View raw responses
grep '"event":"llm:response:raw"' <log-file> | python3 -m json.tool
Troubleshooting
Connection refused
Problem: Cannot connect to vLLM server
Solution:
# Check vLLM service status
sudo systemctl status vllm
# Verify server is listening
curl http://192.168.128.5:8000/health
# Check logs
sudo journalctl -u vllm -n 50
Tool calling not working
Problem: Model responds with text instead of calling tools
Verification:
- ✅ vLLM version ≥0.10.1
- ✅ Using Responses API (not Chat Completions)
- ✅ Tools defined in request
Note: Tool calling works via Responses API without special vLLM flags. If it's not working, check the model supports tool calling.
No reasoning blocks
Problem: Responses don't include reasoning/thinking
Check:
- Is
reasoningparameter set in config? (minimal|low|medium|high) - Is the model a reasoning model? (gpt-oss supports reasoning)
- Check raw debug logs to see if reasoning is in API response
Token usage shows zeros
For GPT-OSS models: Token accounting is automatic but requires vocab files.
How it works:
- First use: Automatically downloads vocab files to
~/.amplifier/cache/vocab/ - Subsequent uses: Uses cached files
- No manual setup needed if you have internet access
What's computed:
- Input tokens: Accurate count using Harmony's tokenization (matches model training format)
- Output tokens: Approximate count based on visible output text
- Limitation: Output count doesn't include hidden reasoning channels (REST API limitation)
If auto-download fails (offline/air-gapped):
# Manual setup for offline environments
mkdir -p ~/.amplifier/cache/vocab
# Download vocab files (on a machine with internet)
curl -sS -o ~/.amplifier/cache/vocab/o200k_base.tiktoken \
https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken
curl -sS -o ~/.amplifier/cache/vocab/cl100k_base.tiktoken \
https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken
# Transfer ~/.amplifier/cache/vocab/ directory to offline machine
# Then set environment variable:
export TIKTOKEN_ENCODINGS_BASE=~/.amplifier/cache/vocab
Check logs for:
[TOKEN_ACCOUNTING] Downloading Harmony vocab files to ~/.amplifier/cache/vocab/...(first use)[TOKEN_ACCOUNTING] Loaded Harmony GPT-OSS encoder(success)[TOKEN_ACCOUNTING] Injected usage: input=X, output=Y(active)
Development
# Clone and install
git clone https://github.com/microsoft/amplifier-module-provider-vllm
cd amplifier-module-provider-vllm
uv pip install -e .
# Run tests
pytest tests/
# Check types and lint
make check
Testing
See ai_working/vllm-investigation/ for comprehensive test scripts:
test_provider_simple.py- Basic provider functionality test06_test_responses_correct_format.py- Responses API format validation04_test_tool_calling.py- Tool calling verification
License
MIT
Contributing
Note
This project is not currently accepting external contributions, but we're actively working toward opening this up. We value community input and look forward to collaborating in the future. For now, feel free to fork and experiment!
Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit Contributor License Agreements.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.
Trademarks
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.