Architecture
May 31, 2026 ยท View on GitHub
This is a small pnpm monorepo with two apps. apps/api is the Fastify control
plane and the graph executor. apps/inspector is a Next.js UI that reads run
state and draws the agent graph. State lives in Postgres through Drizzle ORM,
and runs are distributed through a Redis BullMQ queue.
graph TD C[Client / curl] --> API[Fastify API] API --> Q[Redis BullMQ queue] Q --> W[Run worker] W --> EX[Graph executor] EX --> AG[Agent dispatch] AG --> LLM[LLM adapter] AG --> R[Tool registry] R --> BI[built-in tools] R --> MCP[MCP tools] EX --> B[RunBudget] EX --> S[(Postgres state)] EX --> OT[OpenTelemetry spans] S --> I[Inspector UI] classDef ext fill:#a78bfa,stroke:#a78bfa,color:#fff class LLM,OT ext
Run lifecycle
- A client posts to
POST /runsnaming a registered graph and an input payload. - The API writes a
runsrow and enqueues the run. WithREDIS_URLset the job goes to BullMQ; without it the run executes in-process. - The executor walks the graph from its entry nodes. For each node it writes an
entertrace event, dispatches to the agent kind, then writes anexitevent and a checkpoint. - Every LLM and tool call is charged against the run's
RunBudget. A breach aborts the run through anAbortSignaland sets statusbudget_exceeded. - When the walk completes the run lands in
completedwith its output. Failures land infailedwith the error recorded on the run and in the trace.
Components
Graph DSL (src/graph/definition.ts)
A graph is nodes plus edges plus a budget, built with a small fluent API. Nodes
carry an agent kind, an optional LLM key, optional tool names, and optional
concurrency. Edges may carry a when predicate for conditional routing. The
toShape() method serialises the topology for the inspector.
Executor (src/graph/executor.ts)
Walks the graph in dependency order, enqueuing a node only once all of its
upstream nodes have run and at least one incoming edge predicate is satisfied. It
threads one RunBudget through every node, checkpoints the context after each
node, and wraps the run and each node in OpenTelemetry spans. On a replay run it
rehydrates context from the latest checkpoint at or before the requested step and
resumes from the cut.
Agents (src/graph/agents.ts)
pipeline runs one LLM pass (when an LLM is set) then invokes each declared tool
in order. supervisor runs one LLM pass whose decision is exposed on the context
so conditional edges can branch on it. swarm runs concurrency parallel LLM
passes and merges the results.
Budgets (src/budgets)
RunBudget bundles the token, tool-call, and wall-clock limits and owns the
AbortController. TokenBudget and ToolBudget are the underlying counters;
ToolBudget also enforces per-tool caps. A breach throws BudgetExceededError,
which the executor maps to the budget_exceeded status.
Tools (src/tools)
The registry validates arguments against a Zod schema before the handler runs and charges the call against the run budget first, so a breach aborts before any side effect. MCP tools are registered through the same registry and share that budget.
State (src/state)
StateStore is the contract. PostgresStore is the durable default; MemoryStore
backs tests and offline development. The store is selected from DATABASE_URL at
start-up.
Queue (src/queue)
enqueueRun adds a job to BullMQ with the retry policy applied, or runs in-process
when Redis is absent. startWorker processes jobs and retries failures with
exponential backoff.
Telemetry (src/telemetry)
startTelemetry boots the OpenTelemetry Node SDK and exports over OTLP/HTTP when
an endpoint is configured. tracer() returns the shared tracer used across the
executor and agents.
Database schema
erDiagram
RUNS ||--o{ TRACES : has
RUNS ||--o{ CHECKPOINTS : has
RUNS {
uuid id
text graph
text status
jsonb input
jsonb output
timestamp started_at
timestamp finished_at
int budget_tokens
int budget_tools
int budget_wall_sec
int tokens_used
int tools_used
uuid replay_of
int replay_from_step
}
TRACES {
uuid id
uuid run_id
int step
text node
text kind
jsonb payload
int duration_ms
}
CHECKPOINTS {
uuid id
uuid run_id
int step
jsonb state
timestamp created_at
}
Why these choices
Postgres plus Drizzle is the state store because it survives process crashes and
gives SQL access to inspect runs after the fact. Redis with BullMQ handles
queueing so workers can be restarted and scaled horizontally, and gives retry
policies for free. Trace events are written on every step because debugging an
agent failure hours later needs the full sequence of decisions. Budgets are hard
limits enforced through an AbortSignal because soft limits get ignored in
practice. OpenTelemetry is the tracing layer because it is the standard and lets
you ship spans to whatever backend you already run.