llm-d Inference Payload Processor
June 28, 2026 · View on GitHub
llm-d Inference Payload Processor
The Inference Payload Processor (IPP) is a pluggable framework for inspecting and mutating inference request and response payloads in the llm-d data plane. It runs as an External Processing (ext-proc) service alongside the inference gateway's Proxy, which streams each request and response to IPP for real-time, payload-aware processing.
Because IPP sees the full payload, it can shape requests and responses in arbitrary ways — any logic that benefits from reading or rewriting the body, headers, or trailers can be expressed as a plugin. Its flagship use is payload-aware routing: extracting signals from the request (such as the model name) and injecting headers so a single Gateway endpoint can front many models and LoRA adapters. This composes with the llm-d Router's Endpoint Picker (EPP) — IPP can decide which pool serves a request while the EPP decides which pod within that pool — but routing is one application of a general framework, not its limit.
Core Capabilities
- Request processing — Inspect and mutate request headers, body, or trailers before routing.
- Response processing — Inspect and mutate response headers, body, or trailers on the way back to the client.
- Payload-aware routing — Extract signals from the request body (e.g. the model name) and inject routing headers so the Proxy can select the correct destination (e.g., InferencePool). This powers multi-pool routing: serving multiple base models and LoRA adapters behind one OpenAI-compatible endpoint.
- Model selection — A pluggable
Filter → Score → Pickpipeline that chooses which model serves a request (e.g. for cost or load-aware routing), adapting the upstream Scheduler Architecture pattern at the model level. See the ModelSelector proposal. - Extensibility — All behavior is implemented as plugins configured via a YAML
PayloadProcessorConfig. Add your own without forking the framework — see Creating a Plugin.
Modes of Operation
IPP is deployed once per Gateway as a standalone service and wired into the Proxy via ext-proc. The Helm chart provisions the provider-specific integration automatically:
- Istio — Installs an
EnvoyFilterthat inserts the ext-proc filter into the Gateway's filter chain. - GKE — Installs a
GCPRoutingExtensionthat registers IPP as a routing extension. - None — Deploys the core IPP resources (Deployment, Service, config, RBAC) but no proxy integration; you wire that up yourself.
Documentation
| Document | Description |
|---|---|
| Architecture | How IPP works: ext-proc integration, the processing pipeline, profiles, model selection, and multi-pool routing. |
| Configuration | Full configuration reference: the PayloadProcessorConfig API, Helm values, env vars, CLI flags, ConfigMaps, and proxy integration. |
| Plugins | Reference for all in-tree plugins and how the pipeline composes them. |
| Creating a Plugin | Tutorial for writing and registering a custom plugin. |
| Metrics | Prometheus metrics exposed by IPP. |
| Helm Chart | Chart install reference and values table. |
| ModelSelector Proposal | Design of the model-selection framework. |
For end-to-end deployment, see the llm-d project documentation and guides.
Terminology
- IPP (Inference Payload Processor) — This service. Inspects and mutates request/response payloads via ext-proc; among other things, it can contribute pool-level routing signals.
- Plugin — A user-configurable unit of behavior (request processor, response processor, model-selector Filter/Scorer/Picker, profile picker, or data-layer collector/extractor/datasource). Plugins are selected and ordered in the
PayloadProcessorConfig. - Profile — A named set of request and response plugins. A request executes exactly one profile, chosen by the profile picker.
- ModelSelector — The
Filter → Score → Pickframework that selects a model (not an endpoint) for a request. - Proxy — The L7 proxy (e.g. Envoy) that invokes IPP over ext-proc.
Contributing
Contributions are welcome — see CONTRIBUTING.md. Active docs work lands on the
docs branch first; branch from it and open PRs against it so the documentation can be reviewed and
assembled collaboratively before merging to main.