Locale Collation & Sorting

March 6, 2026 · View on GitHub

🧭 Quick Return to Map

You are in a sub-page of LanguageLocale.
To reorient, go back here:

Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.

A focused guide to stop locale-specific sort/collation bugs that break retrieval order, deduping, join keys, and user-facing lists. Use this when lists look “random,” numbers sort as text (“v10” before “v2”), or non-English letters jump around across environments.

Open these first

Core acceptance targets

  • Deterministic sort across OS/runtime/DB: the same input yields identical order on two hosts.
  • Numeric natural order for mixed strings (“v2” < “v10”, “Chap 9” < “Chap 12”).
  • Locale-consistent letter order under the chosen policy (e.g., tr-TR “I/ı/İ/i”, sv-SE “Å/Ä/Ö”, da-DK “Æ/Ø/Å”, es-ES “ñ”, de-DE “ä/ö/ü/ß”, ja-JP kana).
  • Retrieval stability: top-k set and order do not change when locale/collation varies; ΔS(question, retrieved) ≤ 0.45 on three paraphrases.

Fast triage — what’s breaking?

SymptomLikely causeWhat to check
“v10” sorted before “v2”Lexicographic sort on stringsEnable natural sort with numeric keys or precomputed tokens.
Turkish “I/ı/İ/i” mis-ordered or casefold joins failWrong casefold/collation (not tr-TR)Ensure locale-aware casefold and ICU tr-TR collation.
Swedish “ÅÄÖ” placed near A/ODefault UCA/English collationUse sv-SE collation; verify ICU rules.
German “ß” vs “ss” dedupe missesCollation strength mismatchSet strength=secondary/tertiary consistently; precompute collation keys.
Japanese kana mixed order, halfwidth/fullwidth splitWidth/diacritic handling offNormalize to NFC, enable ja collation with kana handling.
CJK sort flips between Pinyin vs strokeDifferent collation per hostPin zh-u-co-pinyin (or stroke) everywhere; store the policy centrally.
RAG top-k changes across deploysStore/retriever disagree on localeLock analyzer + collation; move ordering to reranker layer.

Fix in 60 seconds

  1. Pick one policy and write it down Choose the business collation per language. Examples you can adopt as BCP-47/ICU tags:

    • Turkish: tr (case and dotted I rules)
    • Swedish: sv (Å/Ä/Ö order)
    • Danish: da (Æ/Ø/Å order)
    • Spanish: es (ñ, modern UCA rules)
    • German: de (ä/ö/ü/ß handling)
    • Japanese: ja with kana sensitivity
    • Chinese: zh-u-co-pinyin or zh-u-co-stroke (pick one and stick with it)
  2. Normalize before you sort

    • Apply Unicode NFC for storage and indexing.
    • Apply locale-aware casefold where required (not global lowercasing for tr-TR).
  3. Persist a collation key

    • Generate an ICU sort key per string at write time and sort by that key.
    • Keep the display text separate to avoid re-computing under mixed hosts.
  4. Enable natural sort for mixed numbers

    • Extract number runs and sort by (text_prefix, numeric_value, text_suffix) or pre-tokenize with zero-padded numeric keys.
  5. Move “human order” to the right layer

    • Retrieval index should stay analyzer-consistent; perform user-facing sort with the pinned collation or with a reranker when semantics must decide order.

Verify with: three paraphrases, two seeds, ΔS ≤ 0.45, top-k order unchanged. See Rerankers and Retrieval Playbook.


Engineering checklist (copy-paste)

  • Policy: Document target collation per language (sv-SE, tr-TR, zh-u-co-pinyin, …) and the ICU options: normalization=on, strength=secondary/tertiary, caseLevel as needed.
  • Ingest: Normalize to NFC, record a collation_key column, and a natural_key for number-aware ordering.
  • APIs: Expose sort_by=collation_key|natural_key flags; default to the business policy.
  • DB: Use ICU collations consistently across read/write paths; avoid OS-default discrepancies.
  • Search/RAG: Keep tokenizer/casing consistent with store; if locale differs, rerank to enforce the final order.
  • Tests: Gold lists for tr I/ı/İ/i, sv Å/Ä/Ö, da Æ/Ø/Å, de ß/ss, kana order, CJK Pinyin vs stroke; include mixed “v2/v10” and wide/narrow digits.

Deeper diagnostics

  • Strength probe: Re-sort once at primary strength (base letters), once at tertiary (case/diacritics). If order flips, lock the strength in config and rebuild keys.
  • Width probe: Convert to halfwidth/fullwidth and verify order invariance; if not invariant, enable width-insensitive collation or normalize earlier.
  • CJK policy probe: Compare zh-u-co-pinyin vs stroke; choose one and rebuild collation keys cluster-wide.
  • RAG stability probe: If top-k changes when locale toggles, push final ordering to reranker and validate with Retrieval Traceability.

When to escalate


Copy-paste prompt for the AI

You have TXTOS and the WFGY Problem Map loaded.

My locale sorting issue:
- language(s): [...]
- observed order vs expected: [...]
- store/runtime: [DB/search/lib], ICU settings?, strength?, normalization?
- symptoms: top-k flip? numeric misorder? CJK strategy drift?

Tell me:
1) which layer is failing (normalization, collation, numeric natural sort, rerank),
2) the exact WFGY pages to open,
3) the minimal steps to pin the policy and rebuild keys,
4) a repeatable test to verify stability across 2 hosts and 2 seeds.

🔗 Quick-Start Downloads (60 sec)

ToolLink3-Step Setup
WFGY 1.0 PDFEngine Paper1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS)TXTOS.txt1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly

Explore More

LayerPageWhat it’s for
⭐ ProofWFGY Recognition MapExternal citations, integrations, and ecosystem proof
⚙️ EngineWFGY 1.0Original PDF tension engine and early logic sketch (legacy reference)
⚙️ EngineWFGY 2.0Production tension kernel for RAG and agent systems
⚙️ EngineWFGY 3.0TXT based Singularity tension engine (131 S class set)
🗺️ MapProblem Map 1.0Flagship 16 problem RAG failure taxonomy and fix map
🗺️ MapProblem Map 2.0Global Debug Card for RAG and agent pipeline diagnosis
🗺️ MapProblem Map 3.0Global AI troubleshooting atlas and failure pattern map
🧰 AppTXT OS.txt semantic OS with fast bootstrap
🧰 AppBlah Blah BlahAbstract and paradox Q&A built on TXT OS
🧰 AppBlur Blur BlurText to image generation with semantic control
🏡 OnboardingStarter VillageGuided entry point for new users

If this repository helped, starring it improves discovery so more builders can find the docs and tools.
GitHub Repo stars