Engine & Framework Integration Guide
March 23, 2026 · View on GitHub
AVP works alongside your agent framework, not instead of it. Your framework handles routing, state, and agent lifecycle. AVP handles the LLM call.
Engines
HuggingFace – pip install avp[hf]
The reference implementation. Full latent pipeline with think/generate and cross-model rosetta.
from avp import HuggingFaceConnector
connector = HuggingFaceConnector.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
# Agent A thinks (builds KV-cache, no text output)
context = connector.think("Analyze this math problem: 24 * 17 + 3", steps=20)
# Agent B generates using Agent A's KV-cache
answer = connector.generate("Solve step by step: 24 * 17 + 3", context=context)
Cross-model:
researcher = HuggingFaceConnector.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
solver = HuggingFaceConnector.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")
context = researcher.think("Analyze this problem", steps=20)
answer = solver.generate("Solve it", context=context, source=researcher, cross_model=True)
Calibration is automatic and one-time per model pair (~0.5–2s), cached to ~/.avp/maps/.
Ollama – pip install avp[ollama]
Uses Ollama's downloaded GGUF files. Auto-unloads the model from the Ollama server to free VRAM, then loads it via llama.cpp for latent communication.
from avp.connectors.ollama import OllamaConnector
connector = OllamaConnector.from_ollama("qwen2.5:7b")
context = connector.think("Analyze this problem", steps=10)
answer = connector.generate("Solve step by step", context=context)
Cross-model:
researcher = OllamaConnector.from_ollama("qwen2.5:7b")
solver = OllamaConnector.from_ollama("llama3.2:3b")
context = researcher.think("Analyze this", steps=10)
answer = solver.generate("Solve it", context=context, source=researcher, cross_model=True)
Any model you've pulled with ollama pull works. AVP resolves the model name to the GGUF blob on disk. No torch required.
llama.cpp – pip install avp[llamacpp]
Direct GGUF file loading. Runs on CPU or GPU. Uses llama.cpp's embeddings API for hidden state extraction. No forks or custom builds required.
from avp.connectors.llamacpp import LlamaCppConnector
connector = LlamaCppConnector.from_pretrained("Qwen2.5-7B-Instruct-Q4_K_M.gguf")
context = connector.think("Analyze this problem", steps=10)
answer = connector.generate("Solve step by step", context=context)
Cross-model:
researcher = LlamaCppConnector.from_pretrained("qwen2-7b.gguf")
solver = LlamaCppConnector.from_pretrained("llama3-3b.gguf")
context = researcher.think("Analyze this", steps=10)
answer = solver.generate("Solve it", context=context, source=researcher, cross_model=True)
No torch required. Projection math uses numpy only.
vLLM – pip install avp[vllm]
vLLM integration uses two engine plugins: a KV connector for multi-agent cache transfer and a model plugin for latent thinking steps during prefill. Supports Qwen2, Llama, Mistral, and Gemma architectures.
from vllm import LLM, SamplingParams
from vllm.config import KVTransferConfig
ktc = KVTransferConfig(
kv_connector="AVPKVConnectorV1Dynamic",
kv_connector_module_path="avp.connectors.vllm_kv_connector",
kv_role="kv_both",
kv_connector_extra_config={
"avp_latent_steps": 20,
"avp_store_dir": "/tmp/avp_kv_store",
},
)
engine = LLM(
model="Qwen/Qwen2.5-7B-Instruct",
kv_transfer_config=ktc,
hf_overrides={"architectures": ["AVPLatentQwen2ForCausalLM"]},
)
params = SamplingParams(max_tokens=256, temperature=0.7)
outputs = engine.generate(["Solve step by step: 24 * 17 + 3"], params)
The KV connector auto-discovers cached state by prompt hash. Agent A's KV-cache is saved to the file store after generation; Agent B loads it automatically when it sees the same prompt prefix. For full configuration, cross-model rosetta, CLI usage, and architecture details, see vLLM Integration Guide.
Frameworks
AVP works as a sidecar. Your framework sees text in, text out. The KV-cache lives in a ContextStore on the GPU side. The framework's state carries only string reference keys.
┌─────────────────────────────────────────────────┐
│ Your Framework (LangGraph / CrewAI / any) │
│ │
│ Agent A node Agent B node │
│ │ │ │
│ │ "Research X" │ "Solve X" │
│ ▼ ▼ │
│ ┌──────────────────────────────────────┐ │
│ │ avp.generate() │ │
│ │ ContextStore (GPU-side, in-memory) │ │
│ │ KV-cache lives here, not in state │ │
│ └──────────────────────────────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ text result text result │
│ (framework stores this) (framework stores) │
└─────────────────────────────────────────────────┘
LangChain – pip install avp[langchain]
ChatAVP is a LangChain BaseChatModel that uses AVP latent thinking under the hood.
from avp.integrations.langchain import ChatAVP
import avp
store = avp.ContextStore(default_ttl=300)
# Researcher thinks, solver generates (linked via store key)
researcher = ChatAVP(model="Qwen/Qwen2.5-7B-Instruct", role="think",
store=store, store_key="task-1")
solver = ChatAVP(model="Qwen/Qwen2.5-7B-Instruct", role="generate",
store=store, store_key="task-1")
# In a LangGraph workflow:
researcher.invoke("Analyze this math problem: 24 * 17 + 3")
answer = solver.invoke("Solve step by step: 24 * 17 + 3")
CrewAI – pip install avp[crewai]
AVPLLM is a CrewAI BaseLLM that uses AVP latent thinking.
from avp.integrations.crewai import AVPLLM
from crewai import Agent, Task, Crew
import avp
store = avp.ContextStore(default_ttl=300)
researcher = Agent(
role="Researcher",
goal="Analyze math problems",
llm=AVPLLM(model="Qwen/Qwen2.5-7B-Instruct", role="think",
store=store, store_key="task-1"),
)
solver = Agent(
role="Solver",
goal="Solve math problems step by step",
llm=AVPLLM(model="Qwen/Qwen2.5-7B-Instruct", role="generate",
store=store, store_key="task-1"),
)
AutoGen – pip install avp[autogen]
AVPChatCompletionClient is an AutoGen ChatCompletionClient that uses AVP latent thinking.
from avp.integrations.autogen import AVPChatCompletionClient
from autogen_agentchat.agents import AssistantAgent
import avp
store = avp.ContextStore(default_ttl=300)
researcher = AssistantAgent(
"researcher",
model_client=AVPChatCompletionClient(
model="Qwen/Qwen2.5-7B-Instruct", role="think",
store=store, store_key="task-1",
),
)
solver = AssistantAgent(
"solver",
model_client=AVPChatCompletionClient(
model="Qwen/Qwen2.5-7B-Instruct", role="generate",
store=store, store_key="task-1",
),
)
LangGraph (easy API pattern)
If you don't need the framework-specific integrations, AVP's easy API works with any framework:
from langgraph.graph import StateGraph
from typing import TypedDict
import avp
MODEL = "Qwen/Qwen2.5-7B-Instruct"
store = avp.ContextStore(default_ttl=300)
class State(TypedDict):
query: str
research: str
answer: str
def researcher(state: State) -> dict:
text = avp.generate(
f"Research this problem step by step: {state['query']}",
model=MODEL, store=store, store_key="researcher",
)
return {"research": text}
def solver(state: State) -> dict:
text = avp.generate(
f"Using this research, solve: {state['query']}\n\nResearch: {state['research']}",
model=MODEL, store=store, prior_key="researcher",
)
return {"answer": text}
graph = StateGraph(State)
graph.add_node("researcher", researcher)
graph.add_node("solver", solver)
graph.add_edge("researcher", "solver")
graph.set_entry_point("researcher")
graph.set_finish_point("solver")
app = graph.compile()
result = app.invoke({"query": "What is 24 * 17 + 3?"})
ContextStore
ContextStore is a thread-safe, TTL-aware dictionary of AVPContext objects. It holds KV-cache tensors in GPU memory so they can be passed between agents without serialization.
store = avp.ContextStore(default_ttl=300) # 5 min TTL
# Store after thinking
ctx = avp.think("Research this", model=MODEL)
store.store("agent-a", ctx)
# Retrieve in another agent
ctx = store.get("agent-a") # None after TTL expires
# Housekeeping
store.active_count # number of live entries
store.cleanup_expired() # remove expired entries (also happens automatically)
When used with avp.generate(store=, store_key=, prior_key=), storing and retrieving happens automatically.
Requirements
- Self-hosted models with GPU access. AVP needs KV-cache internals that cloud APIs (OpenAI, Anthropic, Google) don't expose.
- Same machine or datacenter. KV-cache is 28–130 MB per transfer. This is for co-located agents, not cross-internet.
- Supported engines: HuggingFace Transformers, llama.cpp, Ollama, or vLLM.