ZettelForge Benchmark Report

April 17, 2026 · View on GitHub

Version: 2.2.0 Last updated: 2026-04-16 Author: Automated benchmark suite

Most raw numbers in this report were captured against v2.0.0–v2.1.1. They are retained as published so prior claims can be audited. The LOCOMO row was re-measured at v2.1.1 with Ollama cloud judges and the CTIBench ATE row was re-measured at v2.2.0 after fixing the ingestion pipeline and dropping the ICS matrix noise (see CHANGELOG).

Executive Summary

ZettelForge was evaluated across five benchmark suites. The system runs with zero external AI dependencies on the default configuration (fastembed for embeddings, llama-cpp-python for LLM, SQLite for storage; TypeDB is an opt-in extension).

Benchmark	What it measures	Key result
CTI Retrieval	Real CTI queries (attribution, CVE linkage, tools)	75.0% accuracy
LOCOMO (ACL 2024)	Conversational memory recall	22.0% accuracy (v2.1.1, Ollama judge)
MemPalace comparison	Head-to-head on LOCOMO	MemPalace 26% vs ZettelForge 18% (v2.0.0 snapshot)
RAGAS	Retrieval quality metrics	78.1% keyword presence
CTIBench (NeurIPS 2024)	ATT&CK technique extraction	F1 = 0.146 (v2.2.0, fixed ingestion)

Key finding: ZettelForge scores 75% on its domain benchmark (CTI queries) and 22% on LOCOMO conversational memory. This is by design — the system is built for threat intelligence, not chatbot memory.

1. CTI Retrieval Benchmark (Domain Benchmark)

Date: 2026-04-10 | Corpus: 8 real-world-style CTI reports | Queries: 20

This is ZettelForge's home turf — the queries an analyst would actually ask.

Results by Category

Category	Queries	Accuracy	What it tests
Attribution	5	100%	"Who is attributed to MOIS?" → MuddyWater
Multi-hop	3	100%	"APT group using DROPBEAR + NATO?" → APT28
CVE linkage	4	75%	"Link CVE-2026-3055 to threat actor" → MuddyWater
Temporal	3	66.7%	"Is Server ALPHA currently secure?" → rebuilt, patched
Tool attribution	5	40%	"What tools does Turla use?" → Carbon, Kazuar, Snake
Overall	20	75.0%

p50 latency: 620ms | Notes: 8

Chunking Strategy Comparison

Tested whether 800-char chunking (like MemPalace) improves CTI accuracy:

Strategy	CTI Accuracy	p50 Latency	Notes
full_session (current)	75.0%	620ms	8
chunked_800	75.0%	706ms	8

Verdict: No improvement. CTI reports are already 500-900 chars. Chunking adds latency without benefit. Not merged.

Tool Attribution Gap Analysis

Tool attribution scores 40% because queries like "What tools does APT28 use?" match the correct report but keyword overlap on multi-tool answers (e.g., "Cobalt Strike, DROPBEAR, SedUploader") requires all keywords to appear in retrieved context. When the report mentions tools across multiple sentences, the keyword judge scores partial matches as 0.5 rather than 1.0.

2. LOCOMO Benchmark (Conversational Memory)

Source: LoCoMo (ACL 2024) Dataset: 10 conversations, 5882 dialogue turns, 100 QA pairs Judge: Keyword overlap

Version Progression

Category	v1.3.0	v1.5.0	v2.0.0 (with retrieval improvements)
single-hop	5.0%	10.0%	15.0%
multi-hop	0.0%	0.0%	0.0% (avg_score 0.15)
temporal	0.0%	0.0%	5.0%
open-domain	30.0%	30.0%	40.0%
adversarial	35.0%	35.0%	30.0%
Overall	14.0%	15.0%	18.0%
p50 latency	238ms	344ms	1,240ms
p95 latency	190,000ms	1,305ms	2,282ms

Why LOCOMO Scores Are Low

ZettelForge's entity extractor recognizes CTI entities (CVEs, APT groups, tools). LOCOMO uses conversational entities (person names, hobbies, life events). Graph traversal doesn't fire on conversational queries because no recognized entities appear.

Additionally, the supersession logic aggressively marks LOCOMO sessions as superseded (264/272) because conversational sessions share speakers. The benchmark now uses exclude_superseded=False to work around this.

LOCOMO Leaderboard

System	Accuracy	p95 Latency	External Dependencies
Mem0g	68.5%	2.6s	Cloud API
Mem0	66.9%	1.4s	Cloud API
LangMem	58.1%	60s	Cloud API
OpenAI Memory	52.9%	0.9s	Cloud API
MemPalace	26.0%	170ms	None (ChromaDB)
ZettelForge 2.0.0	18.0%	2.3s	None (fastembed + GGUF)

3. MemPalace Comparison

Date: 2026-04-10 | Benchmark: LOCOMO (same dataset, same scoring)

Category	ZettelForge	MemPalace	Delta
single-hop	10.0%	15.0%	+5
multi-hop	0.0%	0.0%	—
temporal	0.0%	10.0%	+10
open-domain	35.0%	55.0%	+20
adversarial	30.0%	50.0%	+20
Overall	18.0%	26.0%	+8
p50 latency	1,240ms	130ms	10x faster

Why MemPalace Wins on LOCOMO

Chunking: 800-char chunks vs ZettelForge's 4000-char full sessions. Smaller chunks produce more precise keyword matches.
No overhead: Pure ChromaDB vector search. No intent classification, graph traversal, or blending. For conversational data with no CTI entities, this overhead adds latency without accuracy.

Where ZettelForge Wins

ZettelForge scores 75% on CTI queries (attribution, CVE linkage, multi-hop reasoning). MemPalace has no knowledge graph, no STIX ontology, no entity extraction, and no typed relationships. On "What tools does MOIS use?" or "Link CVE-2026-3055 to Dindoor backdoor", ZettelForge's graph traversal and entity indexing outperform flat vector search.

4. RAGAS Retrieval Quality

Date: 2026-04-10 | Dataset: LOCOMO | Scoring: Manual fallback (SequenceMatcher + keyword presence)

Metric	v1.5.0	v2.0.0	Change
Keyword presence	75.9%	78.1%	+2.2pp
String similarity	17.7%	18.2%	+0.5pp
p50 latency	320ms	2,045ms	In-process LLM overhead

Retrieval quality slightly improved with fastembed embeddings. The high keyword presence (78%) indicates retrieved context contains relevant information — the accuracy gap on LOCOMO is in answer extraction, not retrieval.

5. CTIBench (NeurIPS 2024)

Initial run: 2026-04-10 (v2.0.0) — F1 = 0.000 Fixed: 2026-04-16 (v2.2.0) — F1 = 0.146 Task: CTI-ATE (ATT&CK Technique Extraction)

Metric	v2.0.0	v2.2.0	Change
F1	0.000	0.146	+0.146
p50 latency	1,170 ms	1,120 ms	unchanged

What changed. The original v2.0.0 run returned F1 = 0 because CTI-ATE descriptions are natural-language paraphrases of ATT&CK techniques without T-codes (T1071, T1573, etc.) and the benchmark adapter only looked for T-code regex patterns in retrieved text.

v2.2.0 fixes two methodology issues:

Ingestion pipeline — the adapter now ingests the MITRE ATT&CK technique descriptions as cross-reference notes before scoring, so T-codes can be linked back to the paraphrased descriptions.
ICS matrix noise removed — the CTI-ATE split includes ATT&CK for ICS technique entries whose T-codes overlap with Enterprise IDs. Those rows were dropped from scoring.

Remaining gap. F1 = 0.146 is a lower bound driven by low recall rather than precision: many techniques map to multiple paraphrases and the current scoring function rewards exact T-code matches. A semantic matcher over technique descriptions would lift this further and is tracked as a follow-up.

Architecture Summary (v2.2.0)

Component	Technology	External Server?
Embeddings	fastembed (nomic-embed-text-v1.5-Q, 768-dim, ONNX)	No
LLM	llama-cpp-python (Qwen2.5-3B-Instruct Q4_K_M) or Ollama	No (local) / Yes (ollama)
Vector store	LanceDB (IVF_PQ, in-memory fallback)	No
Notes / KG / entity index	SQLite (WAL mode)	No
Ontology (optional)	TypeDB (STIX 2.1, Docker) — via `zettelforge-enterprise`	Yes (Docker, extension only)

Total external dependencies for the default community build: none. SQLite, LanceDB, and fastembed all run in-process. TypeDB ships only with the zettelforge-enterprise extension.

Regression Root Causes Found and Fixed

During v2.0.0 benchmarking, three regressions were identified and fixed:

VectorRetriever LanceDB path — Rewritten retriever tried LanceDB first, got partial results with quantized embeddings, didn't fall back to in-memory. Fix: Force in-memory cosine similarity.
BlendedRetriever result dropping — Blending reduced vector results when graph returned nothing. Fix: Fall back to vector when blending reduces count.
Supersession on conversational data — _check_supersession() marked 264/272 LOCOMO notes as superseded because sessions share speakers. Fix: LOCOMO benchmark uses exclude_superseded=False.

Raw Data Files

File	Description	Date
`cti_retrieval_results.json`	CTI benchmark (75% accuracy)	2026-04-10
`locomo_results.json`	LOCOMO v2.0.0 (15% accuracy)	2026-04-10
`mempalace_results.json`	MemPalace comparison (26%)	2026-04-10
`ragas_results.json`	RAGAS retrieval quality	2026-04-10
`ctibench_results.json`	CTIBench ATE (v2.2.0, F1 = 0.146)	2026-04-16
`locomo_results_v1.3.0_baseline.json`	LOCOMO v1.3.0 (14%)	2026-04-09

Benchmark Scripts

Script	What it runs
`cti_retrieval_benchmark.py`	8 CTI reports, 20 queries, 5 categories
`locomo_benchmark.py`	LOCOMO 100 QA pairs across 5 categories
`mempalace_benchmark.py`	MemPalace on LOCOMO (ChromaDB)
`ragas_benchmark.py`	RAGAS retrieval quality metrics
`ctibench_benchmark.py`	CTIBench ATE adapter

6. MemoryAgentBench (ICLR 2026)

Date: 2026-04-10 | Model: nemotron-3-super:cloud via Ollama | Splits: AR (3 rows), CR (3 rows)

Results

Split	F1	EM	ROUGE-L	Queries	p50 Latency
Accurate Retrieval	0.328	0.157	0.332	300	333ms
Conflict Resolution	0.032	0.017	0.031	300	128ms
Overall	0.180	0.087	0.182	600

Model Impact

Model	AR F1	CR F1	Overall F1
Qwen2.5-3B (local GGUF)	0.012	0.012	0.007
nemotron-3-super:cloud	0.328	0.032	0.180
Improvement	27x	2.7x	25x

Retrieval latency was unchanged between models (128-333ms). The improvement is purely in answer generation quality. This validates that ZettelForge's retrieval pipeline works — the local 3B model was the bottleneck.

CR Gap Analysis

Conflict Resolution scores remain low because the questions require multi-hop entity chain reasoning (e.g., "country of citizenship of the spouse of the author of Our Mutual Friend" → Belgium). This needs explicit graph traversal over entity relationships, not just LLM reasoning over retrieved text chunks.

Benchmark Scripts

Script	What it runs
`cti_retrieval_benchmark.py`	8 CTI reports, 20 queries, 5 categories
`locomo_benchmark.py`	LOCOMO 100 QA pairs across 5 categories
`mempalace_benchmark.py`	MemPalace on LOCOMO (ChromaDB)
`ragas_benchmark.py`	RAGAS retrieval quality metrics
`ctibench_benchmark.py`	CTIBench ATE adapter
`memoryagentbench.py`	MemoryAgentBench AR + CR + TTL + LRU