llm-d Inference Payload Processor

June 28, 2026 · View on GitHub

llm-d Inference Payload Processor

The Inference Payload Processor (IPP) is a pluggable framework for inspecting and mutating inference request and response payloads in the llm-d data plane. It runs as an External Processing (ext-proc) service alongside the inference gateway's Proxy, which streams each request and response to IPP for real-time, payload-aware processing.

Because IPP sees the full payload, it can shape requests and responses in arbitrary ways — any logic that benefits from reading or rewriting the body, headers, or trailers can be expressed as a plugin. Its flagship use is payload-aware routing: extracting signals from the request (such as the model name) and injecting headers so a single Gateway endpoint can front many models and LoRA adapters. This composes with the llm-d Router's Endpoint Picker (EPP) — IPP can decide which pool serves a request while the EPP decides which pod within that pool — but routing is one application of a general framework, not its limit.

IPP Request Flow

Core Capabilities

Request processing — Inspect and mutate request headers, body, or trailers before routing.
Response processing — Inspect and mutate response headers, body, or trailers on the way back to the client.
Payload-aware routing — Extract signals from the request body (e.g. the model name) and inject routing headers so the Proxy can select the correct destination (e.g., InferencePool). This powers multi-pool routing: serving multiple base models and LoRA adapters behind one OpenAI-compatible endpoint.
Model selection — A pluggable Filter → Score → Pick pipeline that chooses which model serves a request (e.g. for cost or load-aware routing), adapting the upstream Scheduler Architecture pattern at the model level. See the ModelSelector proposal.
Extensibility — All behavior is implemented as plugins configured via a YAML PayloadProcessorConfig. Add your own without forking the framework — see Creating a Plugin.

Modes of Operation

IPP is deployed once per Gateway as a standalone service and wired into the Proxy via ext-proc. The Helm chart provisions the provider-specific integration automatically:

Istio — Installs an EnvoyFilter that inserts the ext-proc filter into the Gateway's filter chain.
GKE — Installs a GCPRoutingExtension that registers IPP as a routing extension.
None — Deploys the core IPP resources (Deployment, Service, config, RBAC) but no proxy integration; you wire that up yourself.

Documentation

Document	Description
Architecture	How IPP works: ext-proc integration, the processing pipeline, profiles, model selection, and multi-pool routing.
Configuration	Full configuration reference: the `PayloadProcessorConfig` API, Helm values, env vars, CLI flags, ConfigMaps, and proxy integration.
Plugins	Reference for all in-tree plugins and how the pipeline composes them.
Creating a Plugin	Tutorial for writing and registering a custom plugin.
Metrics	Prometheus metrics exposed by IPP.
Helm Chart	Chart install reference and values table.
ModelSelector Proposal	Design of the model-selection framework.

For end-to-end deployment, see the llm-d project documentation and guides.

Terminology

IPP (Inference Payload Processor) — This service. Inspects and mutates request/response payloads via ext-proc; among other things, it can contribute pool-level routing signals.
Plugin — A user-configurable unit of behavior (request processor, response processor, model-selector Filter/Scorer/Picker, profile picker, or data-layer collector/extractor/datasource). Plugins are selected and ordered in the PayloadProcessorConfig.
Profile — A named set of request and response plugins. A request executes exactly one profile, chosen by the profile picker.
ModelSelector — The Filter → Score → Pick framework that selects a model (not an endpoint) for a request.
Proxy — The L7 proxy (e.g. Envoy) that invokes IPP over ext-proc.

Contributing

Contributions are welcome — see CONTRIBUTING.md. Active docs work lands on the docs branch first; branch from it and open PRs against it so the documentation can be reviewed and assembled collaboratively before merging to main.