carbon_data Quickstart
May 6, 2026 · View on GitHub
Agent_module.carbon_data is the Agent Data Infra: one .carbondata file =
one self-contained knowledge base, simultaneously serving RAG semantic
search / long-term memory / structured queries / knowledge-graph traversal.
Companion runnable demo:
examples/carbondata_quickstart.py
1. Install & dependencies
The library itself only needs the Python standard library plus numpy.
Optional dependencies are loaded on demand:
| Dependency | Purpose | Required? |
|---|---|---|
| Python ≥ 3.10 | required | |
numpy | vector ops | required |
hnswlib | fast ANN for 10k+ vectors | optional (falls back to brute force when missing) |
pip install numpy
pip install hnswlib # optional
A .carbondata file is just a SQLite database, so you can crack it open
with any SQLite tool (sqlite3 kb.carbondata, DBeaver, DataGrip) for ad-hoc
debugging.
2. 30-second tour
import numpy as np
from Agent_module.carbon_data import create, LambdaEmbedder
# Your embedding function — the library does not bundle a model.
def embed(texts):
return np.random.randn(len(texts), 384).astype(np.float32)
embedder = LambdaEmbedder(embed, model="my-model", dim=384)
# 1) Create or open a knowledge base
store = create("kb.carbondata")
# 2) Feed it data
store.ingest_text(
"This is paragraph one.\n\nParagraph two.\n\nAnd paragraph three.",
id="doc-1",
embedder=embedder,
)
# 3) Semantic search
hits = store.search("paragraph one", embedder=embedder, top_k=3)
for h in hits:
print(h.score, h.chunk.content)
store.close()
3. The four core scenarios
3.1 RAG semantic search
# Vector retrieval (default)
hits = store.search("user login flow", embedder=embedder, top_k=5)
# BM25 keyword
hits = store.search("login", mode="keyword", top_k=5)
# Vector + keyword fused via Reciprocal Rank Fusion
hits = store.search("user login", mode="hybrid", embedder=embedder, top_k=5)
Every SearchHit carries chunk (the matched fragment), entity (its
parent), and score.
3.2 Long-term memory
# Write
store.remember(
"user prefers Vim for editing code",
session_id="sess-001",
actor="CodingAgent",
salience=0.8, # importance, 0..1
ttl=86400, # expires after 24h
embedder=embedder,
)
# Semantic recall (expired memories are excluded by default)
memories = store.recall(
"editor preferences",
session_id="sess-001",
min_salience=0.5,
embedder=embedder,
top_k=3,
)
for m in memories:
print(m.score, m.memory.content, m.memory.actor)
# Cleanup
store.forget(session_id="sess-001") # by session
store.forget_expired() # GC expired memories
3.3 Structured queries
# Ingest a table (list of dicts or a CSV path)
store.ingest_table(
[{"id": "u1", "name": "Alice", "team": "infra"},
{"id": "u2", "name": "Bob", "team": "ml"}],
table_name="users",
id_column="id",
embedder=embedder,
)
# Structured filter — note the JSON path goes through `columns.`
# because ingest_table stores rows as metadata = {table, columns}.
rows = store.query_entities(
kind="table_row",
where={"metadata.columns.team": "ml"},
)
# Metadata filters with operators
recent = store.query_entities(
kind="document",
where={"created_at": (">=", 1700000000.0)},
order_by="updated_at DESC",
limit=20,
)
3.4 Knowledge graph
# Add edges between existing entities
store.add_relation("doc:paper", "doc:guide", "references", weight=0.9)
store.add_relation("doc:paper", "doc:guide", "supersedes") # multiple kinds per pair
# 1-hop neighbors
ns = store.neighbors("doc:paper", direction="out", kind="references")
# Multi-hop traversal (recursive CTE, cycle-safe, bounded)
hits = store.traverse(
"doc:paper",
kind="references",
direction="out",
max_hops=3,
)
for h in hits:
print(h.entity.id, "at hop", h.hop)
# Subgraph extraction
sg = store.subgraph(seeds=["doc:paper"], max_hops=2, direction="both")
print(len(sg.entities), len(sg.relations))
4. Going further
4.1 Custom chunkers
from Agent_module.carbon_data import by_tokens, by_paragraph, by_sentence
# Sliding token window (with overlap)
chunker = by_tokens(max_tokens=256, overlap=32)
# By paragraph (default)
chunker = by_paragraph(min_chars=20)
# By sentence
chunker = by_sentence()
store.ingest_text(text, chunker=chunker, embedder=embedder)
You can roll your own: Chunker = Callable[[str], list[str]].
4.2 HNSW dispatch
When hnswlib is installed, the library automatically uses HNSW
acceleration whenever:
mode="vector"(including the vector leg ofmode="hybrid")metric="cosine"- no JSON
filters=are present - `namespace=$ \text{is} \text{applied} \text{as} \text{a} \text{post}-\text{filter} (\text{with} 3 \times \text{over}-\text{sampling})
\text{Explicit} \text{override}: $``python store.search(q, embedder=emb, use_hnsw="auto") # default store.search(q, embedder=emb, use_hnsw="off") # force brute force store.search(q, embedder=emb, use_hnsw="on") # force HNSW (raises if not eligible)
On the first search, HNSW is built from the entire `embedding` table and
persisted to a `vector_index` blob; later add/delete calls are detected via
a row-count comparison and trigger a lazy rebuild.
### 4.3 Namespace isolation
```python
store.put_entity(id="x", kind="document", namespace="tenant-a")
hits = store.search(q, embedder=emb, namespace="tenant-a")
Useful when multiple tenants or datasets share one file.
4.4 Transactions
with store.transaction():
store.put_entity(id="a", kind="document")
store.add_chunks("a", [...])
store.add_relation("a", "b", "ref")
# any exception inside the block → full rollback
ingest_text / ingest_table / remember are already wrapped in their own
transactions internally.
4.5 Admin & ops
# Consistency check (read-only)
report = store.validate()
if not report.ok:
for issue in report.issues:
print(issue)
# VACUUM + FTS rebuild + clear vector_index cache
sizes = store.compact()
print("freed:", sizes["size_before"] - sizes["size_after"], "bytes")
# One-file JSON backup (embeddings encoded as base64)
store.export("backup.json")
store.export("backup_no_vec.json", include_embeddings=False)
# Snapshot of current state
print(store.stats())
5. Debugging tips
- It really is SQLite.
sqlite3 kb.carbondatadrops you straight into the REPL where you can run raw SQL against any table. - Schema lives entirely in
carbon_data/schema.py, ~150 lines. - Test fixtures are good textbook material:
tests/carbon_data/test_m{1..9}_*.py— each file covers one milestone end-to-end. - Run the suite:
cd Agent_module .venv/bin/pytest tests/carbon_data/ -q
6. Next steps
Run the demo to see it all in action:
cd Agent_module
.venv/bin/python examples/carbondata_quickstart.py
For the full API surface, read the class CarbonStore docstring in
carbon_data/store.py.