Agent Vector Protocol (AVP) Specification
April 5, 2026 · View on GitHub
Version: 0.4 Status: Draft Last Updated: March 2026
Abstract
Agent Vector Protocol (AVP) is a binary protocol that enables LLM agents to communicate via latent representations (hidden states and KV-cache) instead of text. Same-model agents skip autoregressive generation entirely and exchange intermediate tensors directly. Cross-model agents -- same family or different families -- communicate via vocabulary-mediated projection with zero training. Models with no compatible projection path fall back to JSON text.
1. Introduction
1.1 Motivation
Current agent-to-agent communication requires each agent to:
- Generate a full text response (autoregressive token-by-token)
- Serialize the text (JSON, MessagePack, etc.)
- Transmit over the network
- Parse and re-encode text into the receiving model's embedding space
For same-model agents, this is wasteful -- the receiving agent already shares the same representation space. AVP lets these agents skip steps 1 and 4 by transmitting hidden states and KV-cache directly.
1.2 Design Goals
- Skip generation: Same-model agents bypass autoregressive decoding
- Graceful fallback: Incompatible models automatically fall back to JSON
- Transport-agnostic: AVP defines the binary format, handshake, and codec -- not the transport. The reference implementation uses HTTP/2, but AVP messages can be carried over any transport (A2A DataParts, gRPC, WebSockets, shared memory, etc.)
- Complementary: AVP is a latent communication layer, not an orchestration protocol. It works alongside A2A, MCP, or any agent framework.
- Engine-agnostic: Works with HuggingFace Transformers, vLLM, llama.cpp, Ollama, and other inference engines
- Extensible: Handshake carries enough structural info for cross-model communication
1.3 Scope
This specification covers same-model latent communication and cross-model communication via vocabulary-mediated projection (Rosetta Stone v2). Same-family models project through shared vocabulary; cross-family models project through overlapping BPE tokens. Both require zero training.
2. Protocol Overview
2.1 Communication Modes
AVP supports three communication modes, negotiated during handshake:
| Mode | When | What's transmitted |
|---|---|---|
| Latent | Same model (hash or structure match) | Hidden states, KV-cache |
| Latent (cross-model) | Same or different family | Vocabulary-mediated projected hidden states |
| JSON | Incompatible models | Plain text messages |
2.2 Message Flow
Agent A Agent B
| |
|--[1] Handshake ------------------>| (exchange model identity)
|<-[2] Identity response -----------|
| |
| [3] Resolve compatibility | (same hash -> latent, else -> json)
| |
|--[4] AVP binary message --------->| (hidden state or KV-cache)
|<-[5] Response --------------------|
| |
| ... or if JSON mode ... |
| |
|--[4] JSON text message ---------->| (fallback)
|<-[5] Response --------------------|
The diagram above is transport-independent. The reference HTTP/2 binding maps these to POST /avp/v2/handshake, /avp/v2/transmit, and /avp/v2/text. Other transports (gRPC, A2A DataParts, shared memory) can carry the same messages.
2.3 Core Components
- Handshake: Model identity exchange and compatibility resolution
- Binary codec: Hidden states, KV-cache, and embeddings serialized with protobuf metadata
- Compression: Optional zstd (mainly useful for embeddings)
- Session management: Track active agent pairs with TTL
- Realignment: Project hidden states from output to input embedding space
- JSON fallback: Text communication for incompatible models
- Transport: Transport-agnostic; reference binding is HTTP/2
3. Handshake Protocol
3.1 Model Identity
Each agent advertises its model identity during handshake:
| Field | Type | Description |
|---|---|---|
| model_family | string | Architecture family (e.g. "llama", "qwen", "mistral") |
| model_id | string | Full model identifier (e.g. "meta-llama/Llama-2-7b") |
| model_hash | string | SHA-256 of sorted model config |
| hidden_dim | uint32 | Hidden state dimensionality |
| num_layers | uint32 | Number of transformer layers |
| num_kv_heads | uint32 | Number of key-value attention heads |
| head_dim | uint32 | Dimension per attention head |
| tokenizer_hash | string | SHA-256 of sorted tokenizer vocabulary (optional, enables cross-model projection) |
3.2 Compatibility Resolution
The resolver determines the communication mode by evaluating rules in priority order. The first matching rule wins:
- Model hash matches -> Latent mode (identical models)
- Same family + matching hidden_dim + num_layers -> Latent mode (structurally identical)
- Shared tokenizer_hash -> Latent mode with
avp_map_id="vocab:{hash[:16]}"(vocabulary-mediated cross-model projection) - Pre-calibrated .avp-map file exists -> Latent mode with
avp_map_id="{src_hash[:16]}_{tgt_hash[:16]}"(pre-calibrated cross-model projection) - Sufficient vocabulary overlap (>= 100 tokens) -> Latent mode with
avp_map_id="vocab_overlap:{overlap_count}"(vocabulary-overlap cross-model projection) - No match -> JSON fallback
Rules 1-2 resolve same-model communication (no projection needed). Rule 3 enables zero-parameter cross-model communication between same-family models that share a tokenizer (e.g. Qwen2.5-1.5B and Qwen2.5-0.5B). Rule 4 enables pre-calibrated cross-model communication via cached projection maps on disk. Rule 5 enables zero-parameter cross-family communication by projecting through the overlapping portion of two different BPE vocabularies (e.g. Qwen to Llama, ~85% token overlap). When avp_map_id is non-empty, the session requires a Rosetta Stone projection map (see Section 4.3).
3.3 Session
A successful handshake creates a session with:
- Unique session_id
- Negotiated communication mode
- Both agent identities
avp_map_id(non-empty if cross-model projection is required)- TTL (default 1 hour)
Sessions expire automatically. The session manager handles cleanup.
4. Latent Communication
4.1 Hidden States
Agents extract hidden states from intermediate transformer layers and transmit them as raw tensor bytes. The receiving agent injects these via inputs_embeds to continue generation without re-encoding from text.
Hidden states require realignment -- projection from the model's output space back to the input embedding space. This is computed from the model's embedding and language model head weights:
W_realign = (E_out^T E_out + lambda * I)^{-1} E_out^T E_in
Models with tied weights (tie_word_embeddings=True) do not need the W_realign projection. However, hidden states from the last transformer layer still have different directional structure than input embeddings (cosine similarity ~0.24). For these models, hidden states are projected through the vocabulary via softmax soft embedding:
logits = hidden @ W_embed^T (project to vocabulary logits)
probs = softmax(logits) (probability distribution over tokens)
embed = probs @ W_embed (weighted average of embeddings)
This produces vectors with cosine similarity ~1.0 to the nearest input embedding.
Realignment matrices are cached to disk (~/.avp/realign/{model_hash}.pt) since they only depend on the model weights.
4.2 KV-Cache Transfer
Agents can transfer attention key-value caches to share context without re-processing input tokens. The KV-cache is serialized as contiguous little-endian tensor bytes:
[K_layer0][V_layer0][K_layer1][V_layer1]...
A 17-byte header precedes the tensor data:
- num_layers (uint32)
- num_kv_heads (uint32)
- head_dim (uint32)
- seq_len (uint32)
- dtype (uint8)
4.3 Transfer Modes
KV-cache payloads are large. AVP defines multiple transfer modes so users can choose the right bandwidth/compute tradeoff for their environment. The transfer mode is selected by the sender and indicated in the message metadata.
Mode 1: Full KV-cache (default)
Transmit the complete KV-cache as contiguous fp16 tensor bytes. Lossless. No additional receiver compute. Best for same-host or high-bandwidth datacenter (>1 Gbps).
Mode 2: Quantized KV-cache (specified, not yet implemented)
Transmit KV-cache in int8 or int4 representation. Reduces payload by 2-4x with negligible quality impact (int8) or small quality impact (int4). No additional receiver compute. Best for moderate bandwidth (500 Mbps - 1 Gbps).
Mode 3: Hidden-state transfer
Transmit hidden state vectors only, without the full KV-cache. The receiver reconstructs context by running latent steps (forward passes with the injected hidden states). Reduces payload by 16x or more. Trades bandwidth for receiver compute. Best for lower bandwidth (<500 Mbps).
This is what LatentMAS uses for Agent 4 (Judger) -- it receives hidden states via inputs_embeds, not KV-cache.
Mode 4: Delta transfer (specified, not yet implemented)
Transmit only KV-cache entries beyond a shared prefix. When agents share a common system prompt, the KV-cache for that prefix is identical and does not need to be transferred. Lossless. Combinable with modes 1-3.
Payload Size Reference
Representative KV-cache sizes per token (fp16):
| Model size | Per token | 200 tokens (full) | 200 tokens (int8) | 200 hidden states |
|---|---|---|---|---|
| 7B | ~256 KB | 50 MB | 25 MB | 1.6 MB |
| 14B | ~320 KB | 64 MB | 32 MB | 2.0 MB |
| 70B | ~640 KB | 128 MB | 64 MB | 3.2 MB |
Choosing a Transfer Mode
The choice depends on available bandwidth and acceptable receiver compute:
| Environment | Recommended mode | Transfer overhead | Rationale |
|---|---|---|---|
| Same process | In-memory (tensor reference) | ~5ms | No serialization needed |
| Same machine (multi-process) | Full KV-cache via shared memory | ~15-40ms | Local memory bandwidth (~5 GB/s) |
| Datacenter (>1 Gbps) | Full KV-cache or int8 | ~50-200ms | Network bandwidth is cheap |
| Cloud / cross-region (100 Mbps - 1 Gbps) | Quantized (int8/int4) + delta | ~0.5-2s | Balance bandwidth and quality |
| Edge / limited bandwidth (<100 Mbps) | Hidden-state transfer | ~0.04s network + latent step compute | Minimize payload, trade for compute |
Latent communication works well in two deployment scenarios:
- Local: Agents on the same machine (same process, multi-process, or containers). Transfer overhead is under 40ms even for 70B models. This is the simplest deployment and requires no network considerations.
- Datacenter: Agents on different machines with high-bandwidth interconnects (>500 Mbps). Transfer modes allow tuning the bandwidth/compute tradeoff.
Below ~50 Mbps over a network, JSON text mode is likely more practical unless hidden-state transfer mode is used.
4.4 Cross-Model Communication (Rosetta Stone)
When agents run different models -- same family or different families -- they can communicate via latent projection instead of JSON fallback. The handshake sets avp_map_id to indicate which projection method to use.
Vocabulary-Mediated Projection (avp_map_id = "vocab:...")
Same-family models share the same tokenizer -- same vocabulary, same token indices. The vocabulary serves as a natural shared coordinate system with dimensionality equal to the vocabulary size (e.g. 151K for Qwen2). The projection requires zero learned parameters and no calibration:
Source model (D_src): hidden @ W_src^T -> logits [vocab_size]
softmax(logits) -> token probabilities [vocab_size]
probs @ W_tgt -> target embedding [D_tgt]
Target model (D_tgt): inject via inputs_embeds
Where W_src is the source model's output head (lm_head) weights and W_tgt is the target model's input embedding weights. This is the cross-model generalization of the tied-weight soft embedding projection in Section 4.1.
The method is identified by avp_map_id starting with "vocab:", followed by the first 16 characters of the shared tokenizer hash.
Vocabulary-Overlap Projection (avp_map_id = "vocab_overlap:...")
Cross-family models have different tokenizers, but BPE tokenizers share many tokens (ASCII characters, common English words, punctuation). The vocabulary overlap bridge identifies shared tokens between the two vocabularies and projects through only the overlapping portion:
Source model (D_src): hidden @ W_src^T -> full_logits [vocab_size_src]
full_logits[src_indices] -> shared_logits [N_shared]
softmax(shared_logits) -> shared_probs [N_shared] (renormalized)
shared_probs @ W_tgt_shared -> target embedding [D_tgt]
Target model (D_tgt): inject via inputs_embeds
Where src_indices are the source token IDs for tokens that exist in both vocabularies, and W_tgt_shared are the corresponding target embedding rows. The softmax renormalization over shared tokens is semantically correct: "given the source model's beliefs about the next token, restricted to tokens both models understand, what's the expected target embedding?"
Typical overlap ratios: ~85% for Qwen/Llama, varying by family pair. Minimum overlap threshold: 100 shared tokens (below this, fall through to JSON fallback).
This is a strict generalization of vocabulary-mediated projection -- at 100% vocabulary overlap (same tokenizer), it produces identical results. The method requires zero learned parameters and no calibration.
The method is identified by avp_map_id starting with "vocab_overlap:", followed by the overlap count.
| Method | Cross-dim? | Training | Quality | Use case |
|---|---|---|---|---|
| Vocabulary-mediated | Yes | None (instant) | High (cos sim ~1.0) | Same-family, shared tokenizer |
| Vocabulary-overlap | Yes | None (instant) | High on structured tasks | Cross-family, different tokenizers |
Projection Method Enum
Maps carry a method field indicating the projection algorithm:
| Value | Description |
|---|---|
vocab_mediated | Vocabulary-mediated projection, shared tokenizer (zero parameters) |
vocab_overlap | Vocabulary-overlap projection, different tokenizers (zero parameters) |
Projection Validation
Before using a cross-model projection in production, implementations SHOULD validate projection quality. AVP defines a two-tier validation gate:
Tier 1: Cosine similarity (fast, ~1ms)
Project source hidden states through the projection and compare to target embeddings. For vocabulary-mediated projection, projected[i] should predict target_embed[token_ids[i+1]] (next-token prediction). Reject projections with cosine similarity below 0.5 (instant JSON fallback).
Tier 2: Pseudo-perplexity (~30ms, requires shared tokenizer)
Inject a single projected embedding to prime the target model's context, then feed actual text tokens and measure cross-entropy. This mirrors the actual pipeline behavior.
| Perplexity | Recommendation |
|---|---|
| ≤ 100 | LATENT — projection quality sufficient for direct use |
| > 100 | JSON — projection too lossy, fall back to text |
Thresholds are calibrated against real pipeline results: Qwen2.5-1.5B→0.5B achieves pseudo-perplexity of 25.8 and produces coherent output in the latent pipeline.
Per-Transfer Quality Gate
Cross-model projection accuracy depends on prompt length. Single-embedding rosetta works well for short structured prompts but degrades for longer prompts:
| Prompt tokens | GSM8K cross-family | HumanEval same-family |
|---|---|---|
| < 300 | 65% | 61% |
| 300-500 | 41% | 40% |
| 500+ | — | 19% |
Implementations SHOULD provide an advisory quality gate that recommends latent vs JSON fallback based on prompt token count. The default threshold is 300 tokens. The gate is advisory — callers decide how to act on the recommendation.
from avp.rosetta.quality import assess_transfer
result = assess_transfer(prompt_tokens=len(input_ids[0]))
if result.recommend_latent:
# proceed with rosetta projection
else:
# fall back to JSON text transfer
4.5 Engine Support
AVP is engine-agnostic. The binary format, handshake, and codec do not depend on any specific inference engine. Implementations provide engine-specific connectors that handle hidden state extraction, KV-cache access, and embedding injection. The core SDK depends only on numpy, protobuf, and zstandard -- torch and other engine libraries are optional dependencies installed via extras (e.g. pip install avp[hf], pip install avp[llamacpp]).
Connector Interface
All engine connectors implement the EngineConnector abstract base class, which defines both a low-level interface (hidden state extraction, embedding injection, KV-cache access) and a high-level API for common integration patterns.
High-Level API
The high-level API reduces AVP integration from ~50 lines of boilerplate to ~5 lines:
| Method | Description | Returns |
|---|---|---|
think(prompt, steps, context, output) | Run latent thinking steps. output=OutputType.AUTO (default) lets the system decide; OutputType.KV_CACHE returns full KV-cache; OutputType.HIDDEN_STATE returns only the last hidden state (KV freed). | AVPContext |
generate(prompt, context, source, cross_model, ...) | Generate text, optionally conditioned on latent context from think(). source= + cross_model=True enables automatic cross-model projection (experimental). | str |
can_think | Whether this connector supports think() (requires hidden state access) | bool |
AVPContext is a lightweight wrapper around a KV-cache or hidden state (tensor references, no copy) with metadata for compatibility checking:
past_key_values-- DynamicCache or legacy tuple (None whenoutput=HIDDEN_STATE)last_hidden_state-- last hidden state[1, D]from think()payload_type-- derived property:KV_CACHEifpast_key_valuesis set,HIDDEN_STATEif only hidden statemodel_hash-- SHA-256 of source model config (checked implicitly bythink()/generate())num_steps-- accumulated latent thinking stepsseq_len-- current KV-cache sequence lengthlast_hidden_state-- final hidden state vector from the last latent step (used for cross-model projection)to_bytes()/from_bytes()-- serialize to/from AVP wire format for cross-process transfer
Same-process usage never requires serialization -- AVPContext holds tensor references directly. to_bytes() invokes the standard AVP codec (Section 5) and stores model_hash, model_family, and num_steps in the protobuf metadata extra fields.
Cross-Model via source= + cross_model=True
When generate() receives source= + cross_model=True, it automatically:
- Detects model mismatch via
model_hash - Calibrates or loads a projection map (memory cache -> disk cache -> calibrate)
- Projects
context.last_hidden_statethrough the Rosetta Stone projection - Primes the target model's KV-cache with the projected embedding
- Generates text conditioned on the projected context
Cross-model projection is experimental — accuracy varies by task type (structured tasks work well, comprehension may degrade). Without cross_model=True, cross-model calls fall back to text-only generation with a UserWarning.
researcher = HuggingFaceConnector.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
solver = HuggingFaceConnector.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")
context = researcher.think(prompt, steps=10)
answer = solver.generate(prompt, context=context, source=researcher, cross_model=True)
Capability Discovery
Not all engines support all operations. Connectors advertise capabilities via properties:
| Capability | HuggingFace | vLLM SDK | llama.cpp | Ollama | Description |
|---|---|---|---|---|---|
can_think | Yes | No | Yes | Yes | Latent thinking requires per-step hidden state access |
generate() with context | Yes | No | Yes | Yes | KV-cache injection requires direct cache access |
generate() without context | Yes | Yes | Yes | Yes | Text-only generation |
| Cross-model rosetta | Yes | No | Yes | Yes | Automatic projection via source= + cross_model=True |
Calling think() on a connector with can_think=False raises EngineNotAvailableError with a message guiding the user to the correct connector. Calling generate() with a context argument on a connector that doesn't support it raises the same error.
HuggingFace Transformers
Full hidden state and KV-cache access via the standard model() and model.generate() APIs. Supports output_hidden_states=True for hidden state extraction, past_key_values (DynamicCache or legacy tuple) for KV-cache injection, and inputs_embeds for embedding injection. This is the primary development and benchmarking backend. Requires pip install avp[hf] (torch + transformers >= 5.0).
The HuggingFace connector supports the full high-level API:
connector = HuggingFaceConnector.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
context = connector.think("Analyze this problem: ...", steps=20)
answer = connector.generate("Solve it.", context=context)
from_pretrained() auto-detects device (CUDA > MPS > CPU) and dtype (bfloat16 > float16 > float32).
vLLM
Production serving integration via two components at different levels:
-
SDK connector (
VLLMConnector): Wraps the vLLMLLMengine for identity extraction, tokenization, text generation, and embedding injection via theprompt_embedsAPI. Supportsgenerate()for text-only generation. Does not supportthink()or context injection -- vLLM is a serving engine that manages KV-cache internally and does not expose per-step hidden states. Requirespip install avp[vllm]. -
KV connector plugin (
AVPKVConnectorV1Dynamic): ImplementsKVConnectorBase_V1for intercepting vLLM's attention pipeline. Uses file-based storage ({request_id}.avp) for KV-cache save/load between requests. Includes PagedAttention ↔ contiguous tensor conversion for bridging vLLM's paged memory layout with AVP's contiguous binary format. This is where latent transfer happens for vLLM -- transparently at the engine level, not at the SDK level. -
Model plugins (
AVPLatent{Qwen2,Llama,Mistral,Gemma}ForCausalLM): Registered viavllm.general_pluginsentry point. Enable latent thinking and cross-model rosetta projection within vLLM's serving pipeline for 4 model architectures.
The KV connector plugin is loaded dynamically via vLLM's --kv-connector configuration -- no vLLM fork required.
The two components serve different roles: the SDK connector provides handshake, identity, and text generation for application code. The KV connector plugin handles latent transfer between vLLM instances, transparent to the application.
llama.cpp
Full latent pipeline on GGUF-quantized models via the embeddings API and ctypes batch injection. Supports think() and generate() with same-model and cross-model rosetta. Requires pip install avp[llamacpp] (llama-cpp-python + gguf).
connector = LlamaCppConnector.from_pretrained("model.gguf")
context = connector.think("Analyze this problem", steps=10)
answer = connector.generate("Solve it", context=context)
Key implementation details: Jinja2-based chat template rendering (model-agnostic), GGUF weight extraction for rosetta projection, dedicated llama_context per operation (thread-safe), weight caching for repeated use.
Ollama
Resolves Ollama model names (e.g. "qwen2.5:7b") to GGUF blob paths on disk, auto-unloads the model from the Ollama server (keep_alive=0) to free VRAM, then inherits the full latent pipeline from LlamaCppConnector. Requires pip install avp[ollama].
connector = OllamaConnector.from_ollama("qwen2.5:7b")
context = connector.think("Analyze this problem", steps=10)
answer = connector.generate("Solve it", context=context)
4.6 Easy API
The easy API provides a zero-friction entry point for common use cases. It manages model loading, connector creation, handshake, and serialization internally.
Primary API (v0.6.1):
| Function | Description | Returns |
|---|---|---|
think(prompt, model, steps, output, ...) | Load model, run latent thinking steps. output=OutputType.AUTO (default), .KV_CACHE, or .HIDDEN_STATE. | ThinkResult |
generate(content, model, steps, source_model, cross_model, store, store_key, ...) | Think + generate in one call. source_model= + cross_model=True enables cross-model projection (experimental). | str |
Deprecated API (v0.2.x, still exported):
| Function | Description | Returns |
|---|---|---|
pack(content, model, think_steps, ...) | Load model, run latent thinking steps, return a serializable message | PackedMessage |
unpack(data, model, ...) | Receive a packed message, generate a text response | str |
import avp
# Same-model
answer = avp.generate("Solve: 24 * 17 + 3", model="Qwen/Qwen2.5-7B-Instruct")
# Cross-model (automatic rosetta projection, experimental)
answer = avp.generate("Solve: 24 * 17 + 3",
model="meta-llama/Llama-3.2-3B-Instruct",
source_model="Qwen/Qwen2.5-7B-Instruct",
cross_model=True)
ContextStore
ContextStore is a thread-safe, TTL-backed store for AVPContext objects. It enables multi-turn latent conversations where each agent stores its context under a key and retrieves prior context from other agents:
store = ContextStore(default_ttl=300)
result = avp.generate(content, model=model, store=store, store_key="agent_a", prior_key="agent_b")
4.7 Framework Integrations
AVP provides integration modules for popular agent frameworks, enabling latent communication within existing framework pipelines:
| Framework | Integration class | Description |
|---|---|---|
| LangChain | ChatAVP (BaseChatModel) | Drop-in LangChain LLM with think/generate roles |
| CrewAI | AVPLLM (BaseLLM) | CrewAI LLM backend with latent communication |
| AutoGen | AVPChatCompletionClient (ChatCompletionClient) | AutoGen chat completion client with latent support |
All framework integrations support same-model latent and cross-model rosetta projection. They require the HuggingFace engine backend (pip install avp[langchain], pip install avp[crewai], pip install avp[autogen]).
4.8 Observability
Implementations SHOULD provide timing metrics for key operations. The SDK exposes metrics via collect_metrics=True on think() and generate():
ThinkMetrics: identity extraction time, think duration, total durationGenerateMetrics: decode duration, generate duration, total durationTransferDiagnostics: debug diagnostics (NaN/Inf detection, norm trajectory, projection metrics, quality gate result) viadebug=True
5. Binary Format
6. Compression
7. Transport Layer
AVP is transport-agnostic. The binary format and handshake protocol do not depend on any specific transport. Implementations can carry AVP messages over HTTP/2, gRPC, WebSockets, A2A DataParts, shared memory, or any other channel that supports binary payloads.
The reference HTTP/2 transport binding is documented in protocol/transport.md.
8. Security Considerations
9. Integration with Agent Protocols
AVP is a latent communication layer, not an orchestration protocol. It is designed to work alongside any agent protocol that handles discovery, delegation, and task management.
A2A
Integration with A2A:
- AVP capabilities advertised via URI-namespaced A2A extensions
- Binary payloads transmitted as
multipart/relatedHTTP parts withcid:URIs - Handshake data carried in A2A DataParts
MCP
Agents connected via MCP (tool/resource access) can use AVP for latent communication when both agents run the same model. MCP handles tool invocation; AVP handles the tensor transfer.
Other Protocols
Any orchestration layer that can pass binary payloads between agents can use AVP. The binary format and handshake are self-contained and do not depend on the transport or orchestration protocol.
10. Versioning
AVP follows semantic versioning (MAJOR.MINOR.PATCH).
Current version: 0.6.1
11. References
- LatentMAS: Latent Collaboration in Multi-Agent Systems -- research foundation for same-model latent communication and realignment
- AVP Benchmark Results -- 7 benchmarks, 5 models, 2 families, same-model + cross-model (14-78% token savings, 2-4x faster, +8.6pp on code generation)
12. Authors
VectorArc Team
13. License
Apache 2.0 -- See LICENSE file