ZettelForge Benchmark Report

April 17, 2026 · View on GitHub

Version: 2.2.0 Last updated: 2026-04-16 Author: Automated benchmark suite

Most raw numbers in this report were captured against v2.0.0–v2.1.1. They are retained as published so prior claims can be audited. The LOCOMO row was re-measured at v2.1.1 with Ollama cloud judges and the CTIBench ATE row was re-measured at v2.2.0 after fixing the ingestion pipeline and dropping the ICS matrix noise (see CHANGELOG).


Executive Summary

ZettelForge was evaluated across five benchmark suites. The system runs with zero external AI dependencies on the default configuration (fastembed for embeddings, llama-cpp-python for LLM, SQLite for storage; TypeDB is an opt-in extension).

BenchmarkWhat it measuresKey result
CTI RetrievalReal CTI queries (attribution, CVE linkage, tools)75.0% accuracy
LOCOMO (ACL 2024)Conversational memory recall22.0% accuracy (v2.1.1, Ollama judge)
MemPalace comparisonHead-to-head on LOCOMOMemPalace 26% vs ZettelForge 18% (v2.0.0 snapshot)
RAGASRetrieval quality metrics78.1% keyword presence
CTIBench (NeurIPS 2024)ATT&CK technique extractionF1 = 0.146 (v2.2.0, fixed ingestion)

Key finding: ZettelForge scores 75% on its domain benchmark (CTI queries) and 22% on LOCOMO conversational memory. This is by design — the system is built for threat intelligence, not chatbot memory.


1. CTI Retrieval Benchmark (Domain Benchmark)

Date: 2026-04-10 | Corpus: 8 real-world-style CTI reports | Queries: 20

This is ZettelForge's home turf — the queries an analyst would actually ask.

Results by Category

CategoryQueriesAccuracyWhat it tests
Attribution5100%"Who is attributed to MOIS?" → MuddyWater
Multi-hop3100%"APT group using DROPBEAR + NATO?" → APT28
CVE linkage475%"Link CVE-2026-3055 to threat actor" → MuddyWater
Temporal366.7%"Is Server ALPHA currently secure?" → rebuilt, patched
Tool attribution540%"What tools does Turla use?" → Carbon, Kazuar, Snake
Overall2075.0%

p50 latency: 620ms | Notes: 8

Chunking Strategy Comparison

Tested whether 800-char chunking (like MemPalace) improves CTI accuracy:

StrategyCTI Accuracyp50 LatencyNotes
full_session (current)75.0%620ms8
chunked_80075.0%706ms8

Verdict: No improvement. CTI reports are already 500-900 chars. Chunking adds latency without benefit. Not merged.

Tool Attribution Gap Analysis

Tool attribution scores 40% because queries like "What tools does APT28 use?" match the correct report but keyword overlap on multi-tool answers (e.g., "Cobalt Strike, DROPBEAR, SedUploader") requires all keywords to appear in retrieved context. When the report mentions tools across multiple sentences, the keyword judge scores partial matches as 0.5 rather than 1.0.


2. LOCOMO Benchmark (Conversational Memory)

Source: LoCoMo (ACL 2024) Dataset: 10 conversations, 5882 dialogue turns, 100 QA pairs Judge: Keyword overlap

Version Progression

Categoryv1.3.0v1.5.0v2.0.0 (with retrieval improvements)
single-hop5.0%10.0%15.0%
multi-hop0.0%0.0%0.0% (avg_score 0.15)
temporal0.0%0.0%5.0%
open-domain30.0%30.0%40.0%
adversarial35.0%35.0%30.0%
Overall14.0%15.0%18.0%
p50 latency238ms344ms1,240ms
p95 latency190,000ms1,305ms2,282ms

Why LOCOMO Scores Are Low

ZettelForge's entity extractor recognizes CTI entities (CVEs, APT groups, tools). LOCOMO uses conversational entities (person names, hobbies, life events). Graph traversal doesn't fire on conversational queries because no recognized entities appear.

Additionally, the supersession logic aggressively marks LOCOMO sessions as superseded (264/272) because conversational sessions share speakers. The benchmark now uses exclude_superseded=False to work around this.

LOCOMO Leaderboard

SystemAccuracyp95 LatencyExternal Dependencies
Mem0g68.5%2.6sCloud API
Mem066.9%1.4sCloud API
LangMem58.1%60sCloud API
OpenAI Memory52.9%0.9sCloud API
MemPalace26.0%170msNone (ChromaDB)
ZettelForge 2.0.018.0%2.3sNone (fastembed + GGUF)

3. MemPalace Comparison

Date: 2026-04-10 | Benchmark: LOCOMO (same dataset, same scoring)

CategoryZettelForgeMemPalaceDelta
single-hop10.0%15.0%+5
multi-hop0.0%0.0%
temporal0.0%10.0%+10
open-domain35.0%55.0%+20
adversarial30.0%50.0%+20
Overall18.0%26.0%+8
p50 latency1,240ms130ms10x faster

Why MemPalace Wins on LOCOMO

  • Chunking: 800-char chunks vs ZettelForge's 4000-char full sessions. Smaller chunks produce more precise keyword matches.
  • No overhead: Pure ChromaDB vector search. No intent classification, graph traversal, or blending. For conversational data with no CTI entities, this overhead adds latency without accuracy.

Where ZettelForge Wins

ZettelForge scores 75% on CTI queries (attribution, CVE linkage, multi-hop reasoning). MemPalace has no knowledge graph, no STIX ontology, no entity extraction, and no typed relationships. On "What tools does MOIS use?" or "Link CVE-2026-3055 to Dindoor backdoor", ZettelForge's graph traversal and entity indexing outperform flat vector search.


4. RAGAS Retrieval Quality

Date: 2026-04-10 | Dataset: LOCOMO | Scoring: Manual fallback (SequenceMatcher + keyword presence)

Metricv1.5.0v2.0.0Change
Keyword presence75.9%78.1%+2.2pp
String similarity17.7%18.2%+0.5pp
p50 latency320ms2,045msIn-process LLM overhead

Retrieval quality slightly improved with fastembed embeddings. The high keyword presence (78%) indicates retrieved context contains relevant information — the accuracy gap on LOCOMO is in answer extraction, not retrieval.


5. CTIBench (NeurIPS 2024)

Initial run: 2026-04-10 (v2.0.0) — F1 = 0.000 Fixed: 2026-04-16 (v2.2.0) — F1 = 0.146 Task: CTI-ATE (ATT&CK Technique Extraction)

Metricv2.0.0v2.2.0Change
F10.0000.146+0.146
p50 latency1,170 ms1,120 msunchanged

What changed. The original v2.0.0 run returned F1 = 0 because CTI-ATE descriptions are natural-language paraphrases of ATT&CK techniques without T-codes (T1071, T1573, etc.) and the benchmark adapter only looked for T-code regex patterns in retrieved text.

v2.2.0 fixes two methodology issues:

  1. Ingestion pipeline — the adapter now ingests the MITRE ATT&CK technique descriptions as cross-reference notes before scoring, so T-codes can be linked back to the paraphrased descriptions.
  2. ICS matrix noise removed — the CTI-ATE split includes ATT&CK for ICS technique entries whose T-codes overlap with Enterprise IDs. Those rows were dropped from scoring.

Remaining gap. F1 = 0.146 is a lower bound driven by low recall rather than precision: many techniques map to multiple paraphrases and the current scoring function rewards exact T-code matches. A semantic matcher over technique descriptions would lift this further and is tracked as a follow-up.


Architecture Summary (v2.2.0)

ComponentTechnologyExternal Server?
Embeddingsfastembed (nomic-embed-text-v1.5-Q, 768-dim, ONNX)No
LLMllama-cpp-python (Qwen2.5-3B-Instruct Q4_K_M) or OllamaNo (local) / Yes (ollama)
Vector storeLanceDB (IVF_PQ, in-memory fallback)No
Notes / KG / entity indexSQLite (WAL mode)No
Ontology (optional)TypeDB (STIX 2.1, Docker) — via zettelforge-enterpriseYes (Docker, extension only)

Total external dependencies for the default community build: none. SQLite, LanceDB, and fastembed all run in-process. TypeDB ships only with the zettelforge-enterprise extension.


Regression Root Causes Found and Fixed

During v2.0.0 benchmarking, three regressions were identified and fixed:

  1. VectorRetriever LanceDB path — Rewritten retriever tried LanceDB first, got partial results with quantized embeddings, didn't fall back to in-memory. Fix: Force in-memory cosine similarity.

  2. BlendedRetriever result dropping — Blending reduced vector results when graph returned nothing. Fix: Fall back to vector when blending reduces count.

  3. Supersession on conversational data_check_supersession() marked 264/272 LOCOMO notes as superseded because sessions share speakers. Fix: LOCOMO benchmark uses exclude_superseded=False.


Raw Data Files

FileDescriptionDate
cti_retrieval_results.jsonCTI benchmark (75% accuracy)2026-04-10
locomo_results.jsonLOCOMO v2.0.0 (15% accuracy)2026-04-10
mempalace_results.jsonMemPalace comparison (26%)2026-04-10
ragas_results.jsonRAGAS retrieval quality2026-04-10
ctibench_results.jsonCTIBench ATE (v2.2.0, F1 = 0.146)2026-04-16
locomo_results_v1.3.0_baseline.jsonLOCOMO v1.3.0 (14%)2026-04-09

Benchmark Scripts

ScriptWhat it runs
cti_retrieval_benchmark.py8 CTI reports, 20 queries, 5 categories
locomo_benchmark.pyLOCOMO 100 QA pairs across 5 categories
mempalace_benchmark.pyMemPalace on LOCOMO (ChromaDB)
ragas_benchmark.pyRAGAS retrieval quality metrics
ctibench_benchmark.pyCTIBench ATE adapter

6. MemoryAgentBench (ICLR 2026)

Date: 2026-04-10 | Model: nemotron-3-super:cloud via Ollama | Splits: AR (3 rows), CR (3 rows)

Results

SplitF1EMROUGE-LQueriesp50 Latency
Accurate Retrieval0.3280.1570.332300333ms
Conflict Resolution0.0320.0170.031300128ms
Overall0.1800.0870.182600

Model Impact

ModelAR F1CR F1Overall F1
Qwen2.5-3B (local GGUF)0.0120.0120.007
nemotron-3-super:cloud0.3280.0320.180
Improvement27x2.7x25x

Retrieval latency was unchanged between models (128-333ms). The improvement is purely in answer generation quality. This validates that ZettelForge's retrieval pipeline works — the local 3B model was the bottleneck.

CR Gap Analysis

Conflict Resolution scores remain low because the questions require multi-hop entity chain reasoning (e.g., "country of citizenship of the spouse of the author of Our Mutual Friend" → Belgium). This needs explicit graph traversal over entity relationships, not just LLM reasoning over retrieved text chunks.


Benchmark Scripts

ScriptWhat it runs
cti_retrieval_benchmark.py8 CTI reports, 20 queries, 5 categories
locomo_benchmark.pyLOCOMO 100 QA pairs across 5 categories
mempalace_benchmark.pyMemPalace on LOCOMO (ChromaDB)
ragas_benchmark.pyRAGAS retrieval quality metrics
ctibench_benchmark.pyCTIBench ATE adapter
memoryagentbench.pyMemoryAgentBench AR + CR + TTL + LRU