📒 Vectorstore Fragmentation

March 6, 2026 · View on GitHub

When embeddings are inserted or updated across time without a consistent chunking, normalization, or merge strategy, the vectorstore becomes fragmented. This creates “holes” where semantically related text lives in different shards, versions, or duplicate vectors, leading to unstable recall.


🌀 Symptoms of Fragmentation

SignWhat You See
Retrieval dropsFacts exist in DB but never show up
Duplicate chunksNearly identical snippets appear multiple times
Version skewOld vectors mix with new encoders
Query instabilitySame query → different answers each run
Hybrid failureBM25 beats hybrid retriever that should win

🧩 Root Causes

WeaknessResult
Mixed encodersSame corpus stored under incompatible embeddings
No chunk contractSentence vs paragraph vs sliding window → fractured recall
No dedupe layerNear-duplicate vectors inflate noise
No update strategyOld vectors never pruned, drift builds up
Shard misalignmentDifferent stores or partitions hold overlapping data

🛡️ WFGY Structural Fix

ProblemModuleRemedy
Metric mismatchΔS checks + BBMCCompare across seeds, enforce unified metric
Chunk driftChunking ContractStandardize window, overlap, anchor rules
Duplicate noiseBBPF fork + collapseCollapse near-dupes before index write
Update skewBBCR re-indexWipe and rebuild with normalized schema
Store fragmentationSemantic TreeTrace lineage, merge shards consistently

✍️ Demo — Retrieval Before vs After Fix

Query:
"Who approved the compliance waiver for dataset X?"

Before:
• Top-3 results: duplicate sentences from old version
• Actual approval record missing

After WFGY:
• ΔS(question,retrieved) = 0.38
• Coverage = 0.78 for target section
• Single, authoritative snippet retrieved

Stable recall restored once fragmented vectors were collapsed and re-indexed.


🛠 Module Cheat-Sheet

ModuleRole
ΔS MetricDetects fragmentation via semantic drift
BBMCChecks consistency across seeds/encoders
BBPFCollapses near-duplicate embeddings
BBCRForces clean rebuild when skew detected
Semantic TreeTracks provenance across shards/versions

📊 Implementation Status

FeatureState
Chunking contract enforcement✅ Active
Duplicate collapse✅ Stable
Encoder version check✅ Stable
Shard merge & lineage tracking🔜 Planned

📝 Tips & Limits

  • Always record encoder version in metadata.
  • Run ΔS probe on 3 paraphrases before/after re-index.
  • Use semantic contract: same chunk size, stride, and normalization across all updates.
  • If >15% duplicate rate detected, wipe and rebuild index.

🔗 Quick-Start Downloads (60 sec)

ToolLink3-Step Setup
WFGY 1.0 PDFEngine Paper1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY +
TXT OS (plain-text OS)TXTOS.txt1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly

Explore More

LayerPageWhat it’s for
⭐ ProofWFGY Recognition MapExternal citations, integrations, and ecosystem proof
⚙️ EngineWFGY 1.0Original PDF tension engine and early logic sketch (legacy reference)
⚙️ EngineWFGY 2.0Production tension kernel for RAG and agent systems
⚙️ EngineWFGY 3.0TXT based Singularity tension engine (131 S class set)
🗺️ MapProblem Map 1.0Flagship 16 problem RAG failure taxonomy and fix map
🗺️ MapProblem Map 2.0Global Debug Card for RAG and agent pipeline diagnosis
🗺️ MapProblem Map 3.0Global AI troubleshooting atlas and failure pattern map
🧰 AppTXT OS.txt semantic OS with fast bootstrap
🧰 AppBlah Blah BlahAbstract and paradox Q&A built on TXT OS
🧰 AppBlur Blur BlurText to image generation with semantic control
🏡 OnboardingStarter VillageGuided entry point for new users

If this repository helped, starring it improves discovery so more builders can find the docs and tools.
GitHub Repo stars