Unicode Normalization

September 5, 2025 · View on GitHub

🧭 Quick Return to Map

You are in a sub-page of LanguageLocale.
To reorient, go back here:

Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.

A focused fix for NFC vs NFD vs NFKC drift, width and compatibility forms, and double-normalized text that breaks retrieval, dedupe, and citation offsets.

When to use

  • Same word appears twice with different byte forms. Tokens do not match across pipelines.
  • High similarity yet wrong matches for accentized strings.
  • Citations point to wrong character offsets after rendering.
  • Mixed full-width and half-width forms, ZWJ or combining marks causing “near duplicates.”

Open these first

Acceptance targets

  • ΔS(question, retrieved) ≤ 0.45 across NFC and NFD variants of the same query.
  • Coverage of target section ≥ 0.70 after normalization pass.
  • λ remains convergent for three paraphrases and two seeds.
  • Offset accuracy ≥ 0.99 when mapping citations from normalized text back to raw text.

Fix checklist

  1. Pick one normalization form per layer

    • Storage and search keys: NFKC for maximal unification of width and compatibility forms.
    • Human-facing content and display: NFC to preserve intent while keeping codepoints stable.
    • Do not normalize code blocks or hashes. Tag code spans and skip them.
  2. Record the contract
    Add to your snippet schema:


norm\_form: NFC|NFD|NFKC|NFKD
norm\_version: ICU/Unicode version
offsets: {raw\_start, raw\_end, norm\_start, norm\_end}

See schema ideas in
Retrieval Traceability · Data Contracts

  1. Normalize on ingestion, not at query time only
  • Normalize and index once. Persist INDEX_HASH that includes the normalization form and version.
  • If you normalize only at query time, ΔS will drift for long contexts and k-ordering becomes unstable.
  1. Preserve both raw and normalized text
  • Store raw bytes for audit and exact reproduction.
  • Store normalized text for retrieval and rerank.
  • Maintain a fast map raw↔norm for citations.
  1. Unify width and compatibility sets
  • Collapse full-width ASCII, half-width katakana, superscripts, circled numbers under NFKC for keys.
  • Keep the visible form for UI by rendering raw text, not the NFKC key.
  1. Prevent double normalization
  • Mark a boolean is_normalized.
  • Gate every downstream step with a one-line check to avoid repeated passes that shift offsets.
  1. Chunk after normalization
  • Chunk boundaries must be computed on the normalized string to keep k-NN neighborhoods stable.
    Chunking Checklist

Verification

  • Twin query probe
    Run the same question in NFC and NFD. Top-k overlap ≥ 0.8 and ΔS difference ≤ 0.05.

  • Anchor triangulation
    Compare ΔS to the correct anchor and a decoy paragraph. After normalization, the correct anchor must win with margin ≥ 0.15.

  • Offset audit
    Pick ten citations with combining marks. Validate raw offsets and UI highlights match exactly.

  • k-ordering stability
    For k in {5, 10, 20}, the normalized index keeps the same top-3 order or differs by at most one position.


Minimal test set you can copy

  • “café” with precomposed é vs e + combining acute.
  • Full-width “ABC123” vs ASCII “ABC123”.
  • Half-width katakana vs standard katakana.
  • Arabic text with tatweel and optional diacritics.
  • Greek polytonic marks.
  • Hangul precomposed syllables vs Jamo sequences.

If any pair fails the twin query probe or offset audit, rebuild with the chosen normalization form and re-verify.


Escalate when

  • ΔS remains ≥ 0.60 after normalization and chunk rebuild
    Retrieval Playbook

  • High recall but messy k-ordering persists
    Rerankers


👑 Early Stargazers: See the Hall of Fame
Engineers, hackers, and open source builders who supported WFGY from day one.

GitHub starsWFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.

WFGY Main   TXT OS   Blah   Blot   Bloc   Blur   Blow