Cuba-Memorys

July 15, 2026 · View on GitHub

Long-term memory for AI coding agents. An MCP server that gives your agent a knowledge graph it can search, reason over, and be corrected by — so it stops forgetting your codebase between sessions.

Written in Rust. Backed by PostgreSQL + pgvector. 28 MCP tools (29 with CUBA_DOCS=1), 16 CLI commands, and every number below measured on a benchmark that — as of v0.12 — actually measures what it claims to. (The previous one did not. See Measured.)

cuba-memorys terminal demo — hybrid search, claim verification with an LLM judge, procedural memory, and the CLI

Install

pip install cuba-memorys        # or: npm install -g cuba-memorys
claude mcp add cuba-memorys -- cuba-memorys

That is the whole setup. On first run it provisions a PostgreSQL 18 + pgvector container via Docker and initializes the schema. Docker must be running.

Cursor / Windsurf / VS Code / Zed

{
  "mcpServers": {
    "cuba-memorys": {
      "command": "cuba-memorys"
    }
  }
}

No DATABASE_URL needed. Or run cuba-memorys setup and it writes the config for every client it finds — then cuba-memorys setup check audits them for disagreement, which is the failure that actually bites (two configs, two embedding dimensions, one silently broken search).

Bring your own PostgreSQL

{
  "mcpServers": {
    "cuba-memorys": {
      "command": "cuba-memorys",
      "env": { "DATABASE_URL": "postgresql://user:pass@localhost:5432/brain" }
    }
  }
}

Needs the vector and pg_trgm extensions. cuba-memorys doctor will tell you if anything is missing.

Semantic embeddings & models (recommended)

Without a model, embeddings are hash-based: deterministic, and semantically meaningless. Search still works through the lexical and BM25 branches, but nothing understands meaning.

One command installs the models and the ONNX runtime, on any OS — no shell scripts, no manual ORT_DYLIB_PATH:

cuba-memorys models all          # embeddings + NLI + reranker + runtime
cuba-memorys models embed        # just the embeddings model (~113 MB)
cuba-memorys models all --gpu    # GPU runtime, if you have one
cuba-memorys doctor              # confirms what loaded

Everything lands in ~/.cache/cuba-memorys/ and is found automatically. models downloads only when you run it — nothing is fetched behind your back.

bge-m3 (1024-d) is better than e5-small for Spanish, though the size of the gap is no longer claimed (the old +21 nDCG figure came from a broken benchmark). It needs a dimension migration (scripts/migrate-embedding-dim.sh 1024) and CUBA_EMBED_MODEL=bge-m3 CUBA_POOLING=cls.

Modes: local · red · completo

CUBA_MODE is a preset that sets the database, the models, and outbound network together, so you pick one name instead of lining up a dozen env vars:

`CUBA_MODE`	Database	Capabilities	Network out
`local` (default)	Docker on this machine	embeddings + NLI as installed	none
`red`	shared managed Postgres (set `DATABASE_URL` with `sslmode=require`)	+ provenance per node, real-time sync between machines	none
`completo`	whatever `DATABASE_URL` implies	+ reranker (GPU if present) + `cuba_docs`	`cuba_docs`

Two machines, one memory. Point both at the same managed Postgres (Neon or Supabase free tier both have pgvector and fit the 36 MB corpus many times over), give each a name with CUBA_NODE_NAME, and CUBA_MODE=red. What one writes, the other reads; every memory records which machine it came from (origin_node). Do not expose your own Postgres port to the internet — use a managed provider's TLS, or a private network like Tailscale.

Real isolation when you share. A shared database is where row-level security stops being decorative. Run cuba-memorys secure once (as the admin role) to create a non-superuser cuba_app with RLS and append-only audit actually enforced, then point the runtime at it with CUBA_SKIP_MIGRATIONS=1. cuba-memorys doctor reports whether the runtime role is a superuser (which bypasses all of it) or not.

Maximum capability. CUBA_MODE=completo turns on the cross-encoder reranker (+92% nDCG) and cuba_docs. On a GPU the reranker is instant; on CPU faro time-boxes it and falls back to the RRF ranking (CUBA_RERANK_TIMEOUT_SECS, default 20 s), so a slow machine still answers. GPU binaries ship with CUDA (NVIDIA) and, on Windows, DirectML (any GPU) — cuba-memorys models runtime --gpu fetches the accelerated runtime.

Individual env vars (CUBA_DOCS, CUBA_RERANKER_PATH, …) always override the preset.

What it actually does

Most memory servers are a key-value store with an embedding bolted on. This one models four kinds of memory, because the psychology literature says they are four different things and they decay differently:

	What it holds	How it strengthens
Semantic	Facts about entities — "all endpoints are async"	Access (Hebbian/BCM, Oja 1982)
Episodic	Events with actors and time — "we shipped v2 on Tuesday"	Power-law decay (Tulving 1972, Wixted 2004)
Procedural	How things are done here — recipes with a track record	Success, not access (ACT-R)
Working	Scratch notes bound to the current session	Cleared with the session

Procedural memory is a separate table rather than a ninth observation type for a specific reason: ACT-R separates declarative memory (reinforced by access) from procedural (reinforced by success). As an observation, a recipe consulted constantly because it keeps failing would climb in importance. It is ranked by Wilson lower bound, so 1/1 successes scores 0.21 and 47/50 scores 0.84 — a lucky first try does not outrank a track record.

Retrieval

Hybrid RRF fusion (k=60, Cormack 2009) over three signals — full-text, BM25 (ts_rank_cd), and pgvector HNSW — with entropy-routed weighting that shifts from keyword-heavy to semantic as the query's Shannon entropy rises.

Answers arrive in compact by default: abbreviated keys, content truncated at 1200 chars. 28% fewer tokens at identical nDCG — identical to four decimal places, because the response format cannot change which documents rank, only how they are printed. Pass "format": "verbose" for the full per-branch score breakdown.

Verification that actually verifies

cuba_faro mode=verify checks a claim against what is stored. It used to score claims by cosine similarity to the retrieved evidence, and that does not work — similarity measures what a text is about, not what it asserts. "cuba-memorys is written in Rust" and "…in Java" are nearly the same vector. Measured on the live corpus, the false claim scored 0.61 and the true one 0.59.

Entailment is a different question from similarity, and it needs something that reads. A local cross-encoder now judges each piece of evidence — supports / contradicts / unrelated — and confidence is derived from the verdicts, each weighted by that evidence's similarity. Same corpus, after:

Claim	Before (cosine)	Now
"written in Rust" (true)	0.59	0.995 · verified
"written in Java" (false)	0.61	0.00 · contradicted
"the best paella uses saffron" (unrelated)	0.45, with 10 "evidence" items	0.00 · unknown, no evidence

Being on-topic is not support, and unrelated counts for neither side.

The judge is mDeBERTa-v3-base-xnli running locally on ONNX: 100 languages, ~50 ms per verdict, no API key, no network, no cost. That matters here — about 75% of this corpus is Spanish, and the English-only NLI models everyone reaches for first would have silently failed on three memories out of four. Install it with cuba-memorys models nli; cuba-memorys doctor will tell you whether it loaded.

Without it, verification falls back to an LLM (your MCP client's own model via sampling, a local claude CLI, or the Anthropic API) — and with none of those, to an honest unknown rather than an invented verdict.

Two things it will not do. It will not confirm a claim on weak evidence: entailment must clear 0.80 while contradiction needs only 0.60, because confirming a false memory and doubting a true one are not errors of equal cost. And when it cannot tell, it says so instead of returning whichever number came out largest — an argmax over a 3-way head will happily publish supports for a claim that is flatly false, and did.

Calibrated abstention

The out-of-distribution gate rejects queries the corpus cannot answer. The threshold is not a magic constant: Ledoit-Wolf covariance shrinkage plus a conformal quantile, calibrated against your own corpus with cuba-memorys calibrate --apply and persisted. (The theoretical χ² threshold rejected 100% of answerable queries. Distribution-free calibration is not a nicety here.)

And it tells you when it is broken

$ cuba-memorys doctor
[  ok  ] migrations           33 aplicadas, ninguna dirty
[  ok  ] embedding_dim        runtime 1024-d == columna vector(1024)
[  ok  ] runtime_role         'cuba_app' sin superuser — RLS y audit efectivos
[ warn ] binary_freshness     4 proceso(s) MCP corren un binario más viejo que el de disco

This exists because the failure mode of a hybrid search engine is not a crash — it is a vector branch dying and the search quietly becoming lexical, with no symptom. The server now refuses to start on an embedding-dimension mismatch, and search sets degraded: true in the response when a branch fails.

The CLI: your memory without an LLM in the middle

Fourteen commands. cuba-memorys --help lists them all.


`search <query>` · `save` · `delete` · `export`	Read and write the brain from a shell
`dashboard`	A self-contained HTML view of what is in there
`doctor`	Health check: schema, dimensions, config coherence, stale processes
`recall`	Session-start context injection — wire it with `setup hook`
`reembed`	Re-encode what needs it (default: only stale rows, not all of them)
`calibrate`	Recompute the abstention threshold from your corpus
`link`	Auto-link entities by NPMI co-occurrence
`dedupe`	Entities that are the same thing under different names — see below
`skills <dir>`	Export procedures as Claude Code Skills
`eval`	Retrieval benchmark — nDCG@10 with confidence intervals, MRR, recall, token cost
`setup`	Wire this into your MCP clients; `setup check` audits them

`dedupe` — because a different string is a different entity

cuba_alma create inserts with ON CONFLICT (name). So one project fragments into Mapupita-Web, Mapupitta-Web (typo), Mapupita Web, mapupita… and searching one finds none of the others. On a real 266-entity graph, 158 of them (59%) had not a single relation — for PageRank and multi-hop retrieval, they did not exist.

What decides a merge is not the embedding centroid. That was the obvious idea and it is wrong: M-Codes Reference Guide and G-Codes Reference Guide sit at 0.811 cosine between centroids. On a corpus about one domain, centroid similarity measures the domain, not the entity — a 0.80 threshold would have merged two different CNC guides, irreversibly.

So --apply merges only what is provable (identical after normalizing case and separators). Typos and near-matches are shown, and judged one at a time with --judge. The old name is written to brain_entity_aliases, so nothing is lost: looking it up still resolves.

The 28 tools

Named after Cuban culture. cuba-memorys advertises all of them, or set CUBA_TOOL_PROFILE=lean to advertise only cuba_tools + cuba_call — 67% smaller tool catalogue, zero functions lost, schemas loaded on demand.

Knowledge graph — cuba_alma (entities) · cuba_cronica (observations, episodes, timeline) · cuba_puente (typed relations, traversal, link prediction) · cuba_ingesta (bulk import)

Search — cuba_faro (hybrid RRF, verification, MMR diversification, OOD abstention)

Error memory — cuba_alarma (report) · cuba_remedio (resolve) · cuba_expediente (search past errors; warns if an approach failed before)

Sessions & decisions — cuba_jornada (session lifecycle, diff) · cuba_decreto (architecture decisions) · cuba_proyecto (per-project isolation) · cuba_pre_compact (survive /compact)

Procedural — cuba_receta (recipes ranked by Wilson lower bound)

Cognition — cuba_reflexion (gap detection) · cuba_hipotesis (abductive inference) · cuba_contradiccion (semantic conflicts) · cuba_juez (LLM judge) · cuba_centinela (prospective triggers) · cuba_calibrar (Bayesian calibration, source credibility)

Maintenance — cuba_zafra (decay, prune, merge, PageRank, Leiden communities) · cuba_eco (RLHF feedback) · cuba_vigia (health, drift, centrality) · cuba_forget (GDPR erasure) · cuba_archivo (CFR-21 hash-chain audit log) · cuba_pizarra (working memory) · cuba_sync (git-friendly export/import)

Meta — cuba_tools (discover) · cuba_call (invoke)

Configuration

Variable	Default	What it does
`CUBA_MODE`	`local`	`local` / `red` (shared cloud DB) / `completo` (everything + GPU). A preset for the rest.
`CUBA_NODE_NAME`	hostname	Names this machine in `origin_node` — which computer wrote each memory
`DATABASE_URL`	auto (Docker)	PostgreSQL connection. Set it (external + TLS) for `red` mode.
`ONNX_MODEL_PATH` + `ORT_DYLIB_PATH`	auto (`~/.cache`)	Semantic embeddings. `cuba-memorys models` sets these up for you.
`CUBA_EMBED_MODEL` · `CUBA_EMBEDDING_DIM` · `CUBA_POOLING`	`multilingual-e5-small` · `384` · `mean`	Set to `bge-m3` · `1024` · `cls` for the stronger Spanish model
`CUBA_TOOL_PROFILE`	`full`	`lean` → 2 tools, 67% smaller catalogue, nothing lost
`CUBA_JUDGE`	`auto`	`nli` / `mcp_sampling` / `claude_cli` / `anthropic_api` / `heuristic`
`CUBA_NLI_PATH`	`~/.cache/cuba-memorys/models-nli`	Local entailment model (`cuba-memorys models nli`)
`CUBA_NLI_ESCALATE`	off	Send claims the NLI could not decide to an LLM. Buys recall, costs ~12 s each
`CUBA_RERANKER_PATH` · `CUBA_RERANK_TIMEOUT_SECS`	`~/.cache/…/reranker` · `20`	Cross-encoder reranker (+92% nDCG); on CPU it falls back to RRF past the budget
`CUBA_DOCS`	off	`1` enables `cuba_docs`, the only tool that leaves your machine. Unset, it is not even advertised.
`CUBA_COMPACT_CHARS`	`1200`	Compact truncation (measured knee)
`CUBA_OOD_THRESHOLD`	calibrated	Override the abstention threshold
`CUBA_BITEMPORAL`	on	Mirror observations into `brain_facts`

Measured — and the benchmark that was lying

Until v0.12 this section carried a line reading "every number here is measured rather than assumed", and every number in it was wrong. The benchmark was broken in three ways, and finding out cost two published conclusions.

It had ten queries. A 95% interval of roughly ±0.12; the smallest effect it could detect was ~0.25 nDCG. Any claim about a smaller difference was noise wearing a decimal point.

Relevance was judged by substring match. A result counted as correct if its text merely contained a marker word — so every observation mentioning "postgres" scored as a right answer to any question about postgres, whether it answered anything or not. That measures keyword presence, not retrieval, and it tilts the whole benchmark toward the lexical branch and against the vector one.

nDCG normalized against what was retrieved, not what exists. With 5 relevant documents in the corpus and 2 found, the "ideal" ranking was taken to be those 2 — so a system that missed 60% of the answer scored a perfect 1.0. (And R@10 = 3.125 shipped in this file. Recall is a proportion.)

The real number is not 0.894. On 221 id-scored queries it is nDCG@10 = 0.50 [95% CI 0.44–0.56]. The system did not get worse. It was never 0.894.

What that cost

~~"The cross-encoder reranker earns nothing"~~ — it had never run. Three bugs in series: faro wrapped the call in if let Ok(..) and dropped the error; it fed token_type_ids to a model that is XLM-RoBERTa and has none; it read f16 logits as f32. The output was "bit for bit identical" to no reranking not because reranking changed nothing, but because it never happened. Fixed; being measured properly now.
Associative retrieval does degrade — but the old evidence (−0.03 at n=10) could not have shown it. On the new dataset with a paired bootstrap (the correct test: same queries in both arms), the interval is [−0.051, −0.018] and never touches zero. It improves 0 queries and hurts 23. The decision was right; the reasoning was not. The power was never in more data — it was in using the right test.

What survives, re-measured honestly


`compact` by default	−28% tokens at identical nDCG (paired difference: exactly 0.0000 — format cannot change which documents rank, only how they are shown). The old "−40%" came from the broken benchmark.
Conformal abstention	100% of out-of-distribution queries caught, 0% false abstentions.
`lean` tool profile	−67% catalogue, zero functions lost.
bge-m3 over e5-small	Direction almost certainly right; the +21.2 nDCG figure is withdrawn — it came from the broken benchmark and re-establishing it would mean re-embedding the corpus twice.
The benchmark itself	221 queries (was 10), relevance by document id, bootstrap confidence intervals, and the minimum detectable effect printed beside every result — so nobody reads a 3-point difference as a finding again.

Foundations

Algorithm	Reference
RRF fusion (k=60)	Cormack et al. (2009)
Hebbian + BCM metaplasticity	Oja (1982); Bienenstock, Cooper & Munro (1982)
Conformal prediction	Vovk (2005); Angelopoulos & Bates (2023)
Ledoit-Wolf covariance shrinkage	Ledoit & Wolf (2004)
Mahalanobis OOD detection	Lee et al. (NeurIPS 2018)
Wilson score interval	Wilson (1927)
Declarative vs procedural memory	Anderson & Lebiere (ACT-R)
Testing effect	Karpicke & Roediger (Science 2008)
Power-law forgetting	Wixted (2004)
Episodic vs semantic memory	Tulving (1972)
PageRank · Leiden · Brandes	Brin & Page (1998); Traag et al. (2019); Brandes (2001)
NPMI co-occurrence	Bouma (2009)
MMR diversification	Carbonell & Goldstein (1998)
Contextual Retrieval	Anthropic (2024)
Prompt-injection spotlighting	Hines et al. (2024)

Development

git clone https://github.com/LeandroPG19/cuba-memorys.git
cd cuba-memorys/rust && cargo build --release

./scripts/demo.sh          # runs on a throwaway Postgres it removes on exit
./scripts/merge-gate.sh    # fmt · clippy -D warnings · 223 tests · audit · integration

Publishing is tag-driven: v* triggers GitHub Release binaries (5 platforms), PyPI wheels, npm, and the MCP Registry. A test pins all four files that hold a version number to the same value, because they used to drift and nothing caught it.

License

AGPL-3.0 — free to use, modify and run, including inside a company. If you offer a modified version to others over a network, you have to publish your changes under the same license.

Author

Leandro Perez G. — @LeandroPG19