Agent Vector Protocol (AVP)
April 5, 2026 · View on GitHub
Multi-agent text handoffs discard KV-cache, embeddings, and attention state the previous agent already computed. AVP transfers that state directly — zero tokens between agents, 2-3x faster pipelines, same or better accuracy, across models and families.
Overview
Agent Vector Protocol (AVP) is a binary protocol for LLM agent communication via latent representations. When two agents run the same model, AVP lets them exchange hidden states and KV-cache directly, skipping autoregressive text generation entirely. When agents run different models -- same family or different families -- AVP uses vocabulary-mediated projection to bridge between their latent spaces with zero training. When no compatible projection path exists, agents fall back to JSON.
AVP is transport-agnostic -- it defines the binary format, handshake, and codec, not the transport. The reference implementation uses HTTP/2, but AVP messages can be carried over A2A, MCP, gRPC, WebSockets, or any channel that supports binary payloads. AVP handles the latent communication layer, not discovery or orchestration.
How It Works
- Handshake -- Agents exchange model identity (architecture, dimensions, weight hash, tokenizer hash)
- Resolve -- Same model: latent mode. Same family: cross-model projection. Otherwise: JSON fallback.
- Communicate -- Latent mode: binary tensor payloads. Cross-model: projected hidden states. JSON mode: text messages.
What Latent Mode Skips
In a standard agent-to-agent exchange, each message requires full autoregressive generation (token-by-token decoding). For same-model agents, this is redundant -- the receiving agent already operates in the same representation space. AVP eliminates this step by transmitting intermediate hidden states and KV-cache directly.
Binary Format
AVP uses a compact 12-byte header followed by protobuf metadata and raw tensor bytes:
Bytes 0-1: Magic (0x4156 = "AV")
Byte 2: Version (0x01)
Byte 3: Flags (compressed, has_map, kv_cache)
Bytes 4-7: Payload length (uint32 LE)
Bytes 8-11: Metadata length (uint32 LE)
Bytes 12..N: Protobuf metadata
Bytes N..: Raw tensor bytes
Documentation
Status
Version: 0.4
Current scope: same-model latent communication and cross-model communication via vocabulary-mediated projection (Rosetta Stone v2). Same-family models project through shared vocabulary; cross-family models project through overlapping BPE tokens (~85% overlap for Qwen/Llama). The core SDK depends on numpy, protobuf, and zstandard — torch and engine libraries are optional.
Implementation
- Python SDK --
pip install avp(v0.6.1). Easy API (think()/generate()), connector API (HuggingFaceConnector,LlamaCppConnector,OllamaConnector,VLLMConnector), cross-model viasource=+cross_model=True,ContextStore, per-transfer quality gate, observability metrics, codec, handshake, session management, realignment, KV-cache serialization, Rosetta Stone cross-model projection, framework integrations (LangChain, CrewAI, AutoGen), HTTP/2 transport. Core depends on numpy, protobuf, and zstandard; engine backends are optional extras ([hf],[llamacpp],[ollama],[vllm]).
Ecosystem
AVP is complementary to existing agent protocols and inference engines:
- A2A -- AVP provides a transport binding for A2A via
multipart/relatedwith binary payloads - MCP -- MCP handles tools and context; AVP handles tensor transfer between agents
- HuggingFace Transformers -- Full hidden state and KV-cache access for development and benchmarking (
pip install avp[hf]) - vLLM -- Text generation via
VLLMConnector; latent transfer viaKVConnectorBase_V1plugin and model plugins for 4 architectures (pip install avp[vllm]) - llama.cpp -- Full latent pipeline on GGUF-quantized models via embeddings API (
pip install avp[llamacpp]) - Ollama -- Auto-resolves Ollama model names to GGUF, auto-unloads to free VRAM, inherits full latent pipeline (
pip install avp[ollama]) - LangChain / CrewAI / AutoGen -- Framework integrations with latent think/generate roles
Research Foundation
Built on LatentMAS: Latent Collaboration in Multi-Agent Systems -- same-model latent communication via hidden state transfer and KV-cache sharing, with realignment for untied-weight models. Extended with cross-model vocabulary-mediated projection (novel -- zero training, works across model families).
Contributing
See CONTRIBUTING.md
License
Apache 2.0