Embeddings

March 6, 2026 · View on GitHub

🏥 Quick Return to Emergency Room

You are in a specialist desk.
For full triage and doctors on duty, return here:

Think of this page as a sub-room.
If you want full consultation and prescriptions, go back to the Emergency Room lobby.

A hub to stabilize the embedding layer before retrieval begins.
Use this folder if your vectors look fine at a glance but retrieval keeps drifting, coverage stays low, or store queries fail silently. No infra change needed.


Orientation: what each page covers

PageWhat it solvesTypical symptom
Metric MismatchStore metric (L2, cosine, dot) differs from model assumptionHigh similarity but wrong neighbors
Normalization & ScalingEmbeddings not normalized or scaledResults unstable across runs
Tokenization & CasingTokenizer mismatch, casing differencesSame text gives different vectors
Chunking → Embedding ContractChunk cuts misaligned with semantic windowsSnippets cut mid-thought, anchors lost
Vectorstore FragmentationIndex silently fragmentedRecall too low even with large k
Dimension Mismatch & ProjectionStore dimension vs embedding dimension mismatchIndex errors or silent truncation
Update & Index SkewOld vectors remain in indexResults point to stale data
Hybrid Retriever WeightsBM25 + ANN weights unbalancedHybrid worse than single retriever
Duplication & Near-Duplicate CollapseDuplicate data overwhelms recallSame doc retrieved repeatedly
Poisoning & ContaminationEmbeddings polluted by adversarial/noisy vectorsRetrieval looks “randomized”

When to use this folder

  • Retrieval looks fine by eye but metrics drift across runs.
  • Coverage stays low despite healthy-looking indexes.
  • Citations pull from stale or duplicated data.
  • Same query yields different answers depending on casing or seed.
  • Hybrid retrievers collapse into noise.

Acceptance targets

  • ΔS(question, retrieved) ≤ 0.45
  • Coverage ≥ 0.70 for target section
  • λ_observe convergent across 3 paraphrases and 2 seeds
  • No index skew between write/read

60-second fix checklist

  1. Lock metrics
    One model family, one distance metric.
    Guide: Metric Mismatch

  2. Normalize
    Apply L2 norm to embeddings at both write and query.
    Guide: Normalization & Scaling

  3. Unify tokenization
    Same tokenizer + casing across ingestion and query.
    Guide: Tokenization & Casing

  4. Audit chunking
    Verify semantic alignment, no mid-thought splits.
    Guide: Chunking → Embedding Contract

  5. Rebuild index if skewed
    Drop old embeddings, rebuild with correct dimension.
    Guide: Update & Index Skew


FAQ for newcomers

Why is metric mismatch so common?
Because vector DBs default differently: FAISS often L2, Pinecone cosine, Redis dot. If your embedding model expects cosine, L2 will silently break recall.

Why normalize embeddings?
Without normalization, embeddings vary in magnitude. Distance stops reflecting meaning.

Why do tokenizers matter?
“Apple” vs “apple” may yield different vectors if one side lowercases, the other doesn’t.

What if coverage stays low after all fixes?
Check for fragmentation and duplication collapse. The issue may not be the embedding model itself, but how the index is populated.



🔗 Quick-Start Downloads (60 sec)

ToolLink3-Step Setup
WFGY 1.0 PDFEngine Paper1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS)TXTOS.txt1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly

Explore More

LayerPageWhat it’s for
⭐ ProofWFGY Recognition MapExternal citations, integrations, and ecosystem proof
⚙️ EngineWFGY 1.0Original PDF tension engine and early logic sketch (legacy reference)
⚙️ EngineWFGY 2.0Production tension kernel for RAG and agent systems
⚙️ EngineWFGY 3.0TXT based Singularity tension engine (131 S class set)
🗺️ MapProblem Map 1.0Flagship 16 problem RAG failure taxonomy and fix map
🗺️ MapProblem Map 2.0Global Debug Card for RAG and agent pipeline diagnosis
🗺️ MapProblem Map 3.0Global AI troubleshooting atlas and failure pattern map
🧰 AppTXT OS.txt semantic OS with fast bootstrap
🧰 AppBlah Blah BlahAbstract and paradox Q&A built on TXT OS
🧰 AppBlur Blur BlurText to image generation with semantic control
🏡 OnboardingStarter VillageGuided entry point for new users

If this repository helped, starring it improves discovery so more builders can find the docs and tools.
GitHub Repo stars