amplifier-module-provider-vllm

May 5, 2026 · View on GitHub

vLLM provider module for Amplifier - Responses API integration for local/self-hosted LLMs.

Overview

This provider module integrates vLLM's OpenAI-compatible Responses API with Amplifier, enabling the use of open-weight models like gpt-oss-20b with full reasoning and tool calling support.

Key Features:

  • Responses API only - Optimized for reasoning models (gpt-oss, etc.)
  • Full reasoning support - Automatic reasoning block separation
  • Tool calling - Complete tool integration via Responses API
  • Local OR remote - Works against a local vLLM with no auth, or any remote / hosted endpoint with Bearer auth
  • OpenAI-compatible - Uses OpenAI SDK under the hood

Installation

# Via uv (recommended)
uv pip install git+https://github.com/microsoft/amplifier-module-provider-vllm@main

# For development
git clone https://github.com/microsoft/amplifier-module-provider-vllm
cd amplifier-module-provider-vllm
uv pip install -e .

Note for GPT-OSS models: Token accounting requires vocab files that are automatically downloaded to ~/.amplifier/cache/vocab/ on first use (requires internet access). If working offline, see troubleshooting section for manual setup.

vLLM Server Setup

This provider requires a running vLLM server. Example setup:

# Start vLLM server (basic)
vllm serve openai/gpt-oss-20b \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 2

# For production (recommended - full config in /etc/vllm/model.env)
sudo systemctl start vllm

Server requirements:

  • vLLM version: ≥0.10.1 (tested with 0.10.1.1)
  • Responses API: Automatically available (no special flags needed)
  • Model: Any model compatible with vLLM (gpt-oss, Llama, Qwen, etc.)

Configuration

Minimal Configuration

providers:
  - module: provider-vllm
    source: git+https://github.com/microsoft/amplifier-module-provider-vllm@main
    config:
      base_url: "http://192.168.128.5:8000/v1"  # Your vLLM server

Full Configuration

providers:
  - module: provider-vllm
    source: git+https://github.com/microsoft/amplifier-module-provider-vllm@main
    config:
      # Connection
      base_url: "http://192.168.128.5:8000/v1"  # Required: vLLM server URL

      # Model settings
      default_model: "openai/gpt-oss-20b"  # Model name from vLLM
      max_tokens: 4096                      # Max output tokens
      temperature: 0.7                      # Sampling temperature

      # Reasoning
      reasoning: "high"                     # Reasoning effort: minimal|low|medium|high
      reasoning_summary: "detailed"         # Summary verbosity: auto|concise|detailed

      # Advanced
      enable_state: false                   # Server-side state (requires vLLM config)
      truncation: "auto"                    # Automatic context management
      timeout: 300.0                        # API timeout (seconds)
      priority: 100                         # Provider selection priority

      # Debug
      debug: true                           # Enable detailed logging
      raw_debug: false                      # Enable raw API I/O logging
      debug_truncate_length: 180            # Truncate long debug strings

Local vs remote — single source of truth

base_url is the single source of truth for whether this provider instance is local or remote. The provider URL-parses it once at construction and caches the result; everything downstream (is_remote property, capability tagging in get_info() and list_models()) flows from that one decision.

  • Localbase_url resolves to localhost, 127.0.0.1, ::1, or 0.0.0.0. Capability tag: local. No auth required (api_key is ignored if it is the placeholder "EMPTY").
  • Remote — anything else (LAN IP, public hostname, RunPod / Modal / Anyscale / Lambda Labs URL, or a vLLM-backed proxy like OpenRouter/Together/Fireworks). Capability tag: remote. Bearer auth is attached when api_key is set.

The is_remote property is purely informational — it does not change how Bearer is attached (the OpenAI SDK does that whenever api_key is non-empty regardless of host). It exists so routing matrices and other downstream consumers can reason about the deployment shape.

Mixed local + remote (multi-instance)

To use both a local vLLM and a remote / hosted vLLM in the same session, configure two provider instances. Amplifier supports multiple named instances of the same provider module via the instance_id key:

# Default LOCAL instance — keeps the natural mount name "vllm"
[[providers]]
module = "amplifier-module-provider-vllm"
[providers.config]
base_url = "http://localhost:8000/v1"
default_model = "openai/gpt-oss-20b"

# Second instance — explicit `instance_id` makes it addressable as "vllm-remote"
[[providers]]
module = "amplifier-module-provider-vllm"
instance_id = "vllm-remote"
[providers.config]
base_url = "https://api.endpoints.anyscale.com/v1"  # or your hosted vLLM URL
api_key = "${VLLM_API_KEY}"
default_model = "meta-llama/Llama-3.3-70B-Instruct"

A routing matrix can then target each independently:

roles:
  reasoning:
    candidates:
      - provider: vllm-remote
        model: "meta-llama/Llama-3.3-70B-Instruct"
      - provider: vllm
        model: "openai/gpt-oss-20b"
  fast:
    candidates:
      - provider: vllm
        model: "openai/gpt-oss-20b"

The kernel validates that at most one entry per module omits instance_id (the "default" keeps the natural mount name); any additional entries must specify an instance_id. See amplifier-core/_session_init.py for the exact contract.

Usage Examples

Basic Chat

from amplifier_core import AmplifierSession

config = {
    "session": {
        "orchestrator": "loop-basic",
        "context": "context-simple"
    },
    "providers": [{
        "module": "provider-vllm",
        "config": {
            "base_url": "http://192.168.128.5:8000/v1",
            "default_model": "openai/gpt-oss-20b"
        }
    }]
}

async with AmplifierSession(config=config) as session:
    response = await session.execute("Explain quantum computing")
    print(response)

With Reasoning

config = {
    "providers": [{
        "module": "provider-vllm",
        "config": {
            "base_url": "http://192.168.128.5:8000/v1",
            "default_model": "openai/gpt-oss-20b",
            "reasoning": "high",  # Enable high-effort reasoning
            "reasoning_summary": "detailed"
        }
    }],
    # ... rest of config
}

async with AmplifierSession(config=config) as session:
    # Model will show internal reasoning before answering
    response = await session.execute("Solve this complex problem...")

With Tool Calling

config = {
    "providers": [{
        "module": "provider-vllm",
        "config": {
            "base_url": "http://192.168.128.5:8000/v1",
            "default_model": "openai/gpt-oss-20b"
        }
    }],
    "tools": [{
        "module": "tool-bash",  # Enable bash tool
        "config": {}
    }],
    # ... rest of config
}

async with AmplifierSession(config=config) as session:
    # Model can call tools autonomously
    response = await session.execute("List the files in the current directory")

Architecture

This provider uses the OpenAI SDK with a custom base_url pointing to your vLLM server. Since vLLM implements the OpenAI-compatible Responses API, the integration is clean and direct.

Key components:

  • VLLMProvider: Main provider class (handles Responses API calls)
  • _constants.py: Configuration defaults and metadata keys
  • _response_handling.py: Response parsing and content block conversion

Response flow:

ChatRequest → VLLMProvider.complete() → AsyncOpenAI.responses.create() →
→ vLLM Server → Response → Content blocks (Thinking + Text + ToolCall) → ChatResponse

Responses API Details

The vLLM provider uses the Responses API (/v1/responses) which provides:

  1. Structured reasoning: Separate reasoning blocks from response text
  2. Tool calling: Native function calling support
  3. Conversation state: Built-in multi-turn conversation handling
  4. Automatic continuation: Handles incomplete responses transparently

Tool format (vLLM Responses API):

{
  "type": "function",
  "name": "tool_name",
  "description": "Tool description",
  "parameters": {"type": "object", "properties": {...}}
}

Response structure:

{
  "output": [
    {"type": "reasoning", "content": [{"type": "reasoning_text", "text": "..."}]},
    {"type": "function_call", "name": "tool_name", "arguments": "{...}"},
    {"type": "message", "content": [{"type": "output_text", "text": "..."}]}
  ]
}

Debugging

Enable debug logging to see full request/response details:

config:
  debug: true        # Summary logging
  raw_debug: true    # Complete API I/O

Check logs:

# Find recent session
ls -lt ~/.amplifier/projects/*/sessions/*/events.jsonl | head -1

# View raw requests
grep '"event":"llm:request:raw"' <log-file> | python3 -m json.tool

# View raw responses
grep '"event":"llm:response:raw"' <log-file> | python3 -m json.tool

Troubleshooting

Connection refused

Problem: Cannot connect to vLLM server

Solution:

# Check vLLM service status
sudo systemctl status vllm

# Verify server is listening
curl http://192.168.128.5:8000/health

# Check logs
sudo journalctl -u vllm -n 50

Tool calling not working

Problem: Model responds with text instead of calling tools

Verification:

  • ✅ vLLM version ≥0.10.1
  • ✅ Using Responses API (not Chat Completions)
  • ✅ Tools defined in request

Note: Tool calling works via Responses API without special vLLM flags. If it's not working, check the model supports tool calling.

No reasoning blocks

Problem: Responses don't include reasoning/thinking

Check:

  • Is reasoning parameter set in config? (minimal|low|medium|high)
  • Is the model a reasoning model? (gpt-oss supports reasoning)
  • Check raw debug logs to see if reasoning is in API response

Token usage shows zeros

For GPT-OSS models: Token accounting is automatic but requires vocab files.

How it works:

  • First use: Automatically downloads vocab files to ~/.amplifier/cache/vocab/
  • Subsequent uses: Uses cached files
  • No manual setup needed if you have internet access

What's computed:

  • Input tokens: Accurate count using Harmony's tokenization (matches model training format)
  • Output tokens: Approximate count based on visible output text
  • Limitation: Output count doesn't include hidden reasoning channels (REST API limitation)

If auto-download fails (offline/air-gapped):

# Manual setup for offline environments
mkdir -p ~/.amplifier/cache/vocab

# Download vocab files (on a machine with internet)
curl -sS -o ~/.amplifier/cache/vocab/o200k_base.tiktoken \
  https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken

curl -sS -o ~/.amplifier/cache/vocab/cl100k_base.tiktoken \
  https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken

# Transfer ~/.amplifier/cache/vocab/ directory to offline machine
# Then set environment variable:
export TIKTOKEN_ENCODINGS_BASE=~/.amplifier/cache/vocab

Check logs for:

  • [TOKEN_ACCOUNTING] Downloading Harmony vocab files to ~/.amplifier/cache/vocab/... (first use)
  • [TOKEN_ACCOUNTING] Loaded Harmony GPT-OSS encoder (success)
  • [TOKEN_ACCOUNTING] Injected usage: input=X, output=Y (active)

Development

# Clone and install
git clone https://github.com/microsoft/amplifier-module-provider-vllm
cd amplifier-module-provider-vllm
uv pip install -e .

# Run tests
pytest tests/

# Check types and lint
make check

Testing

See ai_working/vllm-investigation/ for comprehensive test scripts:

  • test_provider_simple.py - Basic provider functionality test
  • 06_test_responses_correct_format.py - Responses API format validation
  • 04_test_tool_calling.py - Tool calling verification

License

MIT

Contributing

Note

This project is not currently accepting external contributions, but we're actively working toward opening this up. We value community input and look forward to collaborating in the future. For now, feel free to fork and experiment!

Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit Contributor License Agreements.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.