Locale Collation & Sorting
March 6, 2026 · View on GitHub
🧭 Quick Return to Map
You are in a sub-page of LanguageLocale.
To reorient, go back here:
- LanguageLocale — localization, regional settings, and context adaptation
- WFGY Global Fix Map — main Emergency Room, 300+ structured fixes
- WFGY Problem Map 1.0 — 16 reproducible failure modes
Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.
A focused guide to stop locale-specific sort/collation bugs that break retrieval order, deduping, join keys, and user-facing lists. Use this when lists look “random,” numbers sort as text (“v10” before “v2”), or non-English letters jump around across environments.
Open these first
- Visual map and recovery: RAG Architecture & Recovery
- End-to-end retrieval knobs: Retrieval Playbook
- Traceability and snippet schema: Retrieval Traceability
- Ordering control at re-rank stage: Rerankers
- Wrong-meaning hits despite high similarity: Embedding ≠ Semantic
- Hybrid query instability: Query Parsing Split
Core acceptance targets
- Deterministic sort across OS/runtime/DB: the same input yields identical order on two hosts.
- Numeric natural order for mixed strings (“v2” < “v10”, “Chap 9” < “Chap 12”).
- Locale-consistent letter order under the chosen policy (e.g., tr-TR “I/ı/İ/i”, sv-SE “Å/Ä/Ö”, da-DK “Æ/Ø/Å”, es-ES “ñ”, de-DE “ä/ö/ü/ß”, ja-JP kana).
- Retrieval stability: top-k set and order do not change when locale/collation varies; ΔS(question, retrieved) ≤ 0.45 on three paraphrases.
Fast triage — what’s breaking?
| Symptom | Likely cause | What to check |
|---|---|---|
| “v10” sorted before “v2” | Lexicographic sort on strings | Enable natural sort with numeric keys or precomputed tokens. |
| Turkish “I/ı/İ/i” mis-ordered or casefold joins fail | Wrong casefold/collation (not tr-TR) | Ensure locale-aware casefold and ICU tr-TR collation. |
| Swedish “ÅÄÖ” placed near A/O | Default UCA/English collation | Use sv-SE collation; verify ICU rules. |
| German “ß” vs “ss” dedupe misses | Collation strength mismatch | Set strength=secondary/tertiary consistently; precompute collation keys. |
| Japanese kana mixed order, halfwidth/fullwidth split | Width/diacritic handling off | Normalize to NFC, enable ja collation with kana handling. |
| CJK sort flips between Pinyin vs stroke | Different collation per host | Pin zh-u-co-pinyin (or stroke) everywhere; store the policy centrally. |
| RAG top-k changes across deploys | Store/retriever disagree on locale | Lock analyzer + collation; move ordering to reranker layer. |
Fix in 60 seconds
-
Pick one policy and write it down Choose the business collation per language. Examples you can adopt as BCP-47/ICU tags:
- Turkish:
tr(case and dotted I rules) - Swedish:
sv(Å/Ä/Ö order) - Danish:
da(Æ/Ø/Å order) - Spanish:
es(ñ, modern UCA rules) - German:
de(ä/ö/ü/ß handling) - Japanese:
jawith kana sensitivity - Chinese:
zh-u-co-pinyinorzh-u-co-stroke(pick one and stick with it)
- Turkish:
-
Normalize before you sort
- Apply Unicode NFC for storage and indexing.
- Apply locale-aware casefold where required (not global lowercasing for tr-TR).
-
Persist a collation key
- Generate an ICU sort key per string at write time and sort by that key.
- Keep the display text separate to avoid re-computing under mixed hosts.
-
Enable natural sort for mixed numbers
- Extract number runs and sort by
(text_prefix, numeric_value, text_suffix)or pre-tokenize with zero-padded numeric keys.
- Extract number runs and sort by
-
Move “human order” to the right layer
- Retrieval index should stay analyzer-consistent; perform user-facing sort with the pinned collation or with a reranker when semantics must decide order.
Verify with: three paraphrases, two seeds, ΔS ≤ 0.45, top-k order unchanged. See Rerankers and Retrieval Playbook.
Engineering checklist (copy-paste)
- Policy: Document target collation per language (
sv-SE,tr-TR,zh-u-co-pinyin, …) and the ICU options: normalization=on, strength=secondary/tertiary, caseLevel as needed. - Ingest: Normalize to NFC, record a collation_key column, and a natural_key for number-aware ordering.
- APIs: Expose
sort_by=collation_key|natural_keyflags; default to the business policy. - DB: Use ICU collations consistently across read/write paths; avoid OS-default discrepancies.
- Search/RAG: Keep tokenizer/casing consistent with store; if locale differs, rerank to enforce the final order.
- Tests: Gold lists for
tr I/ı/İ/i,sv Å/Ä/Ö,da Æ/Ø/Å,de ß/ss, kana order, CJK Pinyin vs stroke; include mixed “v2/v10” and wide/narrow digits.
Deeper diagnostics
- Strength probe: Re-sort once at primary strength (base letters), once at tertiary (case/diacritics). If order flips, lock the strength in config and rebuild keys.
- Width probe: Convert to halfwidth/fullwidth and verify order invariance; if not invariant, enable width-insensitive collation or normalize earlier.
- CJK policy probe: Compare
zh-u-co-pinyinvs stroke; choose one and rebuild collation keys cluster-wide. - RAG stability probe: If top-k changes when locale toggles, push final ordering to reranker and validate with Retrieval Traceability.
When to escalate
- Order still unstable after policy pinning → treat as schema drift and lock payloads with Data Contracts.
- Wrong items despite perfect sorting → it’s not collation, it’s meaning. See Embedding ≠ Semantic and re-index via the Retrieval Playbook.
- Hybrid retrieval becomes worse than single → see Query Parsing Split and pin deterministic weights.
Copy-paste prompt for the AI
You have TXTOS and the WFGY Problem Map loaded.
My locale sorting issue:
- language(s): [...]
- observed order vs expected: [...]
- store/runtime: [DB/search/lib], ICU settings?, strength?, normalization?
- symptoms: top-k flip? numeric misorder? CJK strategy drift?
Tell me:
1) which layer is failing (normalization, collation, numeric natural sort, rerank),
2) the exact WFGY pages to open,
3) the minimal steps to pin the policy and rebuild keys,
4) a repeatable test to verify stability across 2 hosts and 2 seeds.
🔗 Quick-Start Downloads (60 sec)
| Tool | Link | 3-Step Setup |
|---|---|---|
| WFGY 1.0 PDF | Engine Paper | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>” |
| TXT OS (plain-text OS) | TXTOS.txt | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
Explore More
| Layer | Page | What it’s for |
|---|---|---|
| ⭐ Proof | WFGY Recognition Map | External citations, integrations, and ecosystem proof |
| ⚙️ Engine | WFGY 1.0 | Original PDF tension engine and early logic sketch (legacy reference) |
| ⚙️ Engine | WFGY 2.0 | Production tension kernel for RAG and agent systems |
| ⚙️ Engine | WFGY 3.0 | TXT based Singularity tension engine (131 S class set) |
| 🗺️ Map | Problem Map 1.0 | Flagship 16 problem RAG failure taxonomy and fix map |
| 🗺️ Map | Problem Map 2.0 | Global Debug Card for RAG and agent pipeline diagnosis |
| 🗺️ Map | Problem Map 3.0 | Global AI troubleshooting atlas and failure pattern map |
| 🧰 App | TXT OS | .txt semantic OS with fast bootstrap |
| 🧰 App | Blah Blah Blah | Abstract and paradox Q&A built on TXT OS |
| 🧰 App | Blur Blur Blur | Text to image generation with semantic control |
| 🏡 Onboarding | Starter Village | Guided entry point for new users |
If this repository helped, starring it improves discovery so more builders can find the docs and tools.