carbon_data Quickstart

May 6, 2026 · View on GitHub

Agent_module.carbon_data is the Agent Data Infra: one .carbondata file = one self-contained knowledge base, simultaneously serving RAG semantic search / long-term memory / structured queries / knowledge-graph traversal.

Companion runnable demo: examples/carbondata_quickstart.py


1. Install & dependencies

The library itself only needs the Python standard library plus numpy. Optional dependencies are loaded on demand:

DependencyPurposeRequired?
Python ≥ 3.10required
numpyvector opsrequired
hnswlibfast ANN for 10k+ vectorsoptional (falls back to brute force when missing)
pip install numpy
pip install hnswlib   # optional

A .carbondata file is just a SQLite database, so you can crack it open with any SQLite tool (sqlite3 kb.carbondata, DBeaver, DataGrip) for ad-hoc debugging.


2. 30-second tour

import numpy as np
from Agent_module.carbon_data import create, LambdaEmbedder

# Your embedding function — the library does not bundle a model.
def embed(texts):
    return np.random.randn(len(texts), 384).astype(np.float32)

embedder = LambdaEmbedder(embed, model="my-model", dim=384)

# 1) Create or open a knowledge base
store = create("kb.carbondata")

# 2) Feed it data
store.ingest_text(
    "This is paragraph one.\n\nParagraph two.\n\nAnd paragraph three.",
    id="doc-1",
    embedder=embedder,
)

# 3) Semantic search
hits = store.search("paragraph one", embedder=embedder, top_k=3)
for h in hits:
    print(h.score, h.chunk.content)

store.close()

3. The four core scenarios

# Vector retrieval (default)
hits = store.search("user login flow", embedder=embedder, top_k=5)

# BM25 keyword
hits = store.search("login", mode="keyword", top_k=5)

# Vector + keyword fused via Reciprocal Rank Fusion
hits = store.search("user login", mode="hybrid", embedder=embedder, top_k=5)

Every SearchHit carries chunk (the matched fragment), entity (its parent), and score.

3.2 Long-term memory

# Write
store.remember(
    "user prefers Vim for editing code",
    session_id="sess-001",
    actor="CodingAgent",
    salience=0.8,           # importance, 0..1
    ttl=86400,              # expires after 24h
    embedder=embedder,
)

# Semantic recall (expired memories are excluded by default)
memories = store.recall(
    "editor preferences",
    session_id="sess-001",
    min_salience=0.5,
    embedder=embedder,
    top_k=3,
)
for m in memories:
    print(m.score, m.memory.content, m.memory.actor)

# Cleanup
store.forget(session_id="sess-001")    # by session
store.forget_expired()                  # GC expired memories

3.3 Structured queries

# Ingest a table (list of dicts or a CSV path)
store.ingest_table(
    [{"id": "u1", "name": "Alice", "team": "infra"},
     {"id": "u2", "name": "Bob",   "team": "ml"}],
    table_name="users",
    id_column="id",
    embedder=embedder,
)

# Structured filter — note the JSON path goes through `columns.`
# because ingest_table stores rows as metadata = {table, columns}.
rows = store.query_entities(
    kind="table_row",
    where={"metadata.columns.team": "ml"},
)

# Metadata filters with operators
recent = store.query_entities(
    kind="document",
    where={"created_at": (">=", 1700000000.0)},
    order_by="updated_at DESC",
    limit=20,
)

3.4 Knowledge graph

# Add edges between existing entities
store.add_relation("doc:paper", "doc:guide", "references", weight=0.9)
store.add_relation("doc:paper", "doc:guide", "supersedes")  # multiple kinds per pair

# 1-hop neighbors
ns = store.neighbors("doc:paper", direction="out", kind="references")

# Multi-hop traversal (recursive CTE, cycle-safe, bounded)
hits = store.traverse(
    "doc:paper",
    kind="references",
    direction="out",
    max_hops=3,
)
for h in hits:
    print(h.entity.id, "at hop", h.hop)

# Subgraph extraction
sg = store.subgraph(seeds=["doc:paper"], max_hops=2, direction="both")
print(len(sg.entities), len(sg.relations))

4. Going further

4.1 Custom chunkers

from Agent_module.carbon_data import by_tokens, by_paragraph, by_sentence

# Sliding token window (with overlap)
chunker = by_tokens(max_tokens=256, overlap=32)

# By paragraph (default)
chunker = by_paragraph(min_chars=20)

# By sentence
chunker = by_sentence()

store.ingest_text(text, chunker=chunker, embedder=embedder)

You can roll your own: Chunker = Callable[[str], list[str]].

4.2 HNSW dispatch

When hnswlib is installed, the library automatically uses HNSW acceleration whenever:

  • mode="vector" (including the vector leg of mode="hybrid")
  • metric="cosine"
  • no JSON filters= are present
  • `namespace=$ \text{is} \text{applied} \text{as} \text{a} \text{post}-\text{filter} (\text{with} 3 \times \text{over}-\text{sampling})

\text{Explicit} \text{override}: $``python store.search(q, embedder=emb, use_hnsw="auto") # default store.search(q, embedder=emb, use_hnsw="off") # force brute force store.search(q, embedder=emb, use_hnsw="on") # force HNSW (raises if not eligible)


On the first search, HNSW is built from the entire `embedding` table and
persisted to a `vector_index` blob; later add/delete calls are detected via
a row-count comparison and trigger a lazy rebuild.

### 4.3 Namespace isolation

```python
store.put_entity(id="x", kind="document", namespace="tenant-a")
hits = store.search(q, embedder=emb, namespace="tenant-a")

Useful when multiple tenants or datasets share one file.

4.4 Transactions

with store.transaction():
    store.put_entity(id="a", kind="document")
    store.add_chunks("a", [...])
    store.add_relation("a", "b", "ref")
    # any exception inside the block → full rollback

ingest_text / ingest_table / remember are already wrapped in their own transactions internally.

4.5 Admin & ops

# Consistency check (read-only)
report = store.validate()
if not report.ok:
    for issue in report.issues:
        print(issue)

# VACUUM + FTS rebuild + clear vector_index cache
sizes = store.compact()
print("freed:", sizes["size_before"] - sizes["size_after"], "bytes")

# One-file JSON backup (embeddings encoded as base64)
store.export("backup.json")
store.export("backup_no_vec.json", include_embeddings=False)

# Snapshot of current state
print(store.stats())

5. Debugging tips

  • It really is SQLite. sqlite3 kb.carbondata drops you straight into the REPL where you can run raw SQL against any table.
  • Schema lives entirely in carbon_data/schema.py, ~150 lines.
  • Test fixtures are good textbook material: tests/carbon_data/test_m{1..9}_*.py — each file covers one milestone end-to-end.
  • Run the suite:
    cd Agent_module
    .venv/bin/pytest tests/carbon_data/ -q
    

6. Next steps

Run the demo to see it all in action:

cd Agent_module
.venv/bin/python examples/carbondata_quickstart.py

For the full API surface, read the class CarbonStore docstring in carbon_data/store.py.