llm-d Inference Payload Processor

June 28, 2026 · View on GitHub

CI Go Reference License Join Slack

llm-d Inference Payload Processor

The Inference Payload Processor (IPP) is a pluggable framework for inspecting and mutating inference request and response payloads in the llm-d data plane. It runs as an External Processing (ext-proc) service alongside the inference gateway's Proxy, which streams each request and response to IPP for real-time, payload-aware processing.

Because IPP sees the full payload, it can shape requests and responses in arbitrary ways — any logic that benefits from reading or rewriting the body, headers, or trailers can be expressed as a plugin. Its flagship use is payload-aware routing: extracting signals from the request (such as the model name) and injecting headers so a single Gateway endpoint can front many models and LoRA adapters. This composes with the llm-d Router's Endpoint Picker (EPP) — IPP can decide which pool serves a request while the EPP decides which pod within that pool — but routing is one application of a general framework, not its limit.

IPP Request Flow

Core Capabilities

  • Request processing — Inspect and mutate request headers, body, or trailers before routing.
  • Response processing — Inspect and mutate response headers, body, or trailers on the way back to the client.
  • Payload-aware routing — Extract signals from the request body (e.g. the model name) and inject routing headers so the Proxy can select the correct destination (e.g., InferencePool). This powers multi-pool routing: serving multiple base models and LoRA adapters behind one OpenAI-compatible endpoint.
  • Model selection — A pluggable Filter → Score → Pick pipeline that chooses which model serves a request (e.g. for cost or load-aware routing), adapting the upstream Scheduler Architecture pattern at the model level. See the ModelSelector proposal.
  • Extensibility — All behavior is implemented as plugins configured via a YAML PayloadProcessorConfig. Add your own without forking the framework — see Creating a Plugin.

Modes of Operation

IPP is deployed once per Gateway as a standalone service and wired into the Proxy via ext-proc. The Helm chart provisions the provider-specific integration automatically:

  • Istio — Installs an EnvoyFilter that inserts the ext-proc filter into the Gateway's filter chain.
  • GKE — Installs a GCPRoutingExtension that registers IPP as a routing extension.
  • None — Deploys the core IPP resources (Deployment, Service, config, RBAC) but no proxy integration; you wire that up yourself.

Documentation

DocumentDescription
ArchitectureHow IPP works: ext-proc integration, the processing pipeline, profiles, model selection, and multi-pool routing.
ConfigurationFull configuration reference: the PayloadProcessorConfig API, Helm values, env vars, CLI flags, ConfigMaps, and proxy integration.
PluginsReference for all in-tree plugins and how the pipeline composes them.
Creating a PluginTutorial for writing and registering a custom plugin.
MetricsPrometheus metrics exposed by IPP.
Helm ChartChart install reference and values table.
ModelSelector ProposalDesign of the model-selection framework.

For end-to-end deployment, see the llm-d project documentation and guides.

Terminology

  • IPP (Inference Payload Processor) — This service. Inspects and mutates request/response payloads via ext-proc; among other things, it can contribute pool-level routing signals.
  • Plugin — A user-configurable unit of behavior (request processor, response processor, model-selector Filter/Scorer/Picker, profile picker, or data-layer collector/extractor/datasource). Plugins are selected and ordered in the PayloadProcessorConfig.
  • Profile — A named set of request and response plugins. A request executes exactly one profile, chosen by the profile picker.
  • ModelSelector — The Filter → Score → Pick framework that selects a model (not an endpoint) for a request.
  • Proxy — The L7 proxy (e.g. Envoy) that invokes IPP over ext-proc.

Contributing

Contributions are welcome — see CONTRIBUTING.md. Active docs work lands on the docs branch first; branch from it and open PRs against it so the documentation can be reviewed and assembled collaboratively before merging to main.