📒 Vectorstore Fragmentation

March 6, 2026 · View on GitHub

When embeddings are inserted or updated across time without a consistent chunking, normalization, or merge strategy, the vectorstore becomes fragmented. This creates “holes” where semantically related text lives in different shards, versions, or duplicate vectors, leading to unstable recall.

🌀 Symptoms of Fragmentation

Sign	What You See
Retrieval drops	Facts exist in DB but never show up
Duplicate chunks	Nearly identical snippets appear multiple times
Version skew	Old vectors mix with new encoders
Query instability	Same query → different answers each run
Hybrid failure	BM25 beats hybrid retriever that should win

🧩 Root Causes

Weakness	Result
Mixed encoders	Same corpus stored under incompatible embeddings
No chunk contract	Sentence vs paragraph vs sliding window → fractured recall
No dedupe layer	Near-duplicate vectors inflate noise
No update strategy	Old vectors never pruned, drift builds up
Shard misalignment	Different stores or partitions hold overlapping data

🛡️ WFGY Structural Fix

Problem	Module	Remedy
Metric mismatch	ΔS checks + BBMC	Compare across seeds, enforce unified metric
Chunk drift	Chunking Contract	Standardize window, overlap, anchor rules
Duplicate noise	BBPF fork + collapse	Collapse near-dupes before index write
Update skew	BBCR re-index	Wipe and rebuild with normalized schema
Store fragmentation	Semantic Tree	Trace lineage, merge shards consistently

✍️ Demo — Retrieval Before vs After Fix

Query:
"Who approved the compliance waiver for dataset X?"

Before:
• Top-3 results: duplicate sentences from old version
• Actual approval record missing

After WFGY:
• ΔS(question,retrieved) = 0.38
• Coverage = 0.78 for target section
• Single, authoritative snippet retrieved

Stable recall restored once fragmented vectors were collapsed and re-indexed.

🛠 Module Cheat-Sheet

Module	Role
ΔS Metric	Detects fragmentation via semantic drift
BBMC	Checks consistency across seeds/encoders
BBPF	Collapses near-duplicate embeddings
BBCR	Forces clean rebuild when skew detected
Semantic Tree	Tracks provenance across shards/versions

📊 Implementation Status

Feature	State
Chunking contract enforcement	✅ Active
Duplicate collapse	✅ Stable
Encoder version check	✅ Stable
Shard merge & lineage tracking	🔜 Planned

📝 Tips & Limits

Always record encoder version in metadata.
Run ΔS probe on 3 paraphrases before/after re-index.
Use semantic contract: same chunk size, stride, and normalization across all updates.
If >15% duplicate rate detected, wipe and rebuild index.

🔗 Quick-Start Downloads (60 sec)

Tool	Link	3-Step Setup
WFGY 1.0 PDF	Engine Paper	1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + ”
TXT OS (plain-text OS)	TXTOS.txt	1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly

Explore More

Layer	Page	What it’s for
⭐ Proof	WFGY Recognition Map	External citations, integrations, and ecosystem proof
⚙️ Engine	WFGY 1.0	Original PDF tension engine and early logic sketch (legacy reference)
⚙️ Engine	WFGY 2.0	Production tension kernel for RAG and agent systems
⚙️ Engine	WFGY 3.0	TXT based Singularity tension engine (131 S class set)
🗺️ Map	Problem Map 1.0	Flagship 16 problem RAG failure taxonomy and fix map
🗺️ Map	Problem Map 2.0	Global Debug Card for RAG and agent pipeline diagnosis
🗺️ Map	Problem Map 3.0	Global AI troubleshooting atlas and failure pattern map
🧰 App	TXT OS	.txt semantic OS with fast bootstrap
🧰 App	Blah Blah Blah	Abstract and paradox Q&A built on TXT OS
🧰 App	Blur Blur Blur	Text to image generation with semantic control
🏡 Onboarding	Starter Village	Guided entry point for new users

If this repository helped, starring it improves discovery so more builders can find the docs and tools.