Diacritics & Folding

March 6, 2026 · View on GitHub

🧭 Quick Return to Map

You are in a sub-page of LanguageLocale.
To reorient, go back here:

Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.

A focused repair when accents and diacritic marks cause retrieval drift, broken citations, or unstable reranking. Use this page to lock a per-language normalization policy, keep citations faithful to the original text, and keep ΔS within target.

Open these first

When to use this page

  • Store search finds “Malaga” while the source reads “Málaga”, citations fail to land.
  • BM25 works after accent folding but vectors point to different sections.
  • Vietnamese, French, Spanish or German show uneven recall after a language mix.
  • OCR keeps combining marks that your tokenizer later drops.
  • Reranker prefers unaccented variants even when the gold passage contains accents.

Acceptance targets

  • ΔS(question, retrieved) ≤ 0.45
  • Coverage of target section ≥ 0.70
  • Citation offsets within ±4 tokens between displayed text and source
  • Per-language exact-match on a 300-item accent set ≥ 0.95
  • λ remains convergent across 3 paraphrases and 2 seeds

Map symptoms to the exact fix

SymptomLikely causeOpen this and apply
Citation points to the wrong offsets when accents existOne view folded, the other originalData Contracts · define visual_text (original) and search_text (folded) in every snippet; verify with Retrieval Traceability
High BM25 score, low vector agreement on accented wordsAnalyzer folds accents but embedding text did not, or the reverseAlign ingest and query analyzers in the store; embed visual_text and rerank with deterministic policy, see Retrieval Playbook
French and Vietnamese regress after “remove accents” policyPer-language rules collapsed into a global foldKeep a per-language policy with stored locale, see locale-drift.md
Tokenizer splits or drops combining marksOCR export or tokenizer mismatchRepair OCR and choose a consistent tokenizer, see tokenizer_mismatch.md and Retrieval Traceability
Reranker prefers unaccented decoysFeature bias and query split across scriptsLock reranker inputs and tie back to citation-first plan, see Rerankers and script_mixing.md
Full-width digits or punctuation shift offsets in CJK + Latin mixWidth and punctuation normalization out of syncNormalize width for search_text only, preserve for visual_text, see digits_width_punctuation.md

60-second fix checklist

  1. Choose a normalization policy

    • Store two views per snippet:
      visual_text = original source in NFC, accents preserved.
      search_text = NFD, remove \p{Mn} combining marks, casefold, language-aware exceptions.
    • Always render and cite from visual_text. Index BM25 on search_text. Vectors usually embed visual_text.
  2. Record locale and analyzer

    • Add locale (e.g., fr, vi, es, de).
    • Log index_analyzer and query_analyzer names in trace. They must match.
  3. Reranking and order

    • Use citation-first assembly. If λ flips when you reorder headers, lock schema and apply BBAM variance clamp.
  4. Probe ΔS and coverage

    • Vary k = 5, 10, 20. If ΔS stays high and flat, suspect analyzer mismatch or wrong fold target.
  5. Build a small gold

    • 300 pairs per language with accented vs unaccented queries. Require ≥ 0.95 exact match and stable ΔS.

Minimal test plan

  • Paraphrase triad on each language pair.
  • Accent toggle test: same query with and without accents.
  • Citation parity: offsets within ±4 tokens between displayed answer and source.
  • Store drift audit after deploy: compare analyzer signatures across index and query clients.

Copy-paste prompt for your LLM step


You have TXT OS and the WFGY Problem Map loaded.

My issue: diacritics and folding.

* symptom: \[one line]
* traces: ΔS(question,retrieved)=..., ΔS(retrieved,anchor)=..., λ states, citation offsets, locale=...

Tell me:

1. failing layer and why,
2. the exact WFGY page to open,
3. minimal steps to reach ΔS ≤ 0.45, coverage ≥ 0.70, and citation offset ≤ 4 tokens,
4. a reproducible test using a 300-item accent set.
   Use Data Contracts, Retrieval Traceability, and Rerankers when relevant.


🔗 Quick-Start Downloads (60 sec)

ToolLink3-Step Setup
WFGY 1.0 PDFEngine Paper1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS)TXTOS.txt1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly

Explore More

LayerPageWhat it’s for
⭐ ProofWFGY Recognition MapExternal citations, integrations, and ecosystem proof
⚙️ EngineWFGY 1.0Original PDF tension engine and early logic sketch (legacy reference)
⚙️ EngineWFGY 2.0Production tension kernel for RAG and agent systems
⚙️ EngineWFGY 3.0TXT based Singularity tension engine (131 S class set)
🗺️ MapProblem Map 1.0Flagship 16 problem RAG failure taxonomy and fix map
🗺️ MapProblem Map 2.0Global Debug Card for RAG and agent pipeline diagnosis
🗺️ MapProblem Map 3.0Global AI troubleshooting atlas and failure pattern map
🧰 AppTXT OS.txt semantic OS with fast bootstrap
🧰 AppBlah Blah BlahAbstract and paradox Q&A built on TXT OS
🧰 AppBlur Blur BlurText to image generation with semantic control
🏡 OnboardingStarter VillageGuided entry point for new users

If this repository helped, starring it improves discovery so more builders can find the docs and tools.
GitHub Repo stars