Romanization & Transliteration

March 6, 2026 · View on GitHub

🧭 Quick Return to Map

You are in a sub-page of Language.
To reorient, go back here:

Language — multilingual processing and semantic alignment

WFGY Global Fix Map — main Emergency Room, 300+ structured fixes

WFGY Problem Map 1.0 — 16 reproducible failure modes

Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.

Make cross-script search and RAG stable when users type Latin transliterations of non-Latin names and terms. This page gives a minimal contract, store wiring, and tests so Hepburn vs Kunrei, Pinyin vs mixed tone marks, RR vs MR, ISO9 vs GOST, Buckwalter vs ISO 233, and similar systems do not break recall or flip ranking.

Open these first

Visual map and recovery → rag-architecture-and-recovery.md
End to end retrieval knobs → retrieval-playbook.md
Traceability and cite-then-explain → retrieval-traceability.md
Contract the payload → data-contracts.md
Embedding vs meaning → embedding-vs-semantic.md
Tokenizer variance → tokenizer_mismatch.md
Mixed scripts in one query → script_mixing.md
Locale normalization and width/diacritics → locale_drift.md
Names and brand aliases → proper_noun_aliases.md
End to end multilingual playbook → multilingual_guide.md

Core acceptance targets

ΔS(question, retrieved) ≤ 0.45 for native script, romanized, and accent-stripped variants
Coverage of target section ≥ 0.70 under three paraphrases and two seeds
λ remains convergent when switching romanizers inside the same language
No false merges across entities when romanized forms collide

Minimal contract

Add a small, explicit layer around romanization so behavior is auditable.

Document side fields

raw_text            # untouched source
lang                # BCP-47 primary tag
script              # ISO 15924 (Han, Cyrl, Arab, Hira, Kana, Hang, etc.)
canonical           # preferred display form for proper nouns if known
alias_tail          # pipe-joined alias list incl. romanized forms
romanizers          # systems observed for this doc: "pinyin|rr|hepburn|iso9|buckwalter"

Query side context

q_text              # user input
q_lang_guess        # detector result, nullable
q_script_guess      # detector result, nullable
q_romanizer_hint    # optional, from UI or logs, e.g. "hepburn"

Rules

Never mutate raw_text or canonical.
Romanized strings live only in alias_tail and store-specific synonym views.
Record which systems were used. Mixing systems without a record increases ΔS variance.

Store wiring

BM25 style indexes

Keep raw_text with a locale-aware analyzer.
Add a synonym graph on a separate field that contains romanized aliases.
Apply width normalization and diacritic strip only in alias field. Keep canonical untouched. See locale_drift.md.

Vector stores

Append alias_tail to the chunk text right after the first canonical mention.
Keep short, high precision alias lists. Over-expansion harms meaning.
If nearest neighbors look similar yet wrong, verify metric per embedding-vs-semantic.md.

Hybrid

When BM25 yields an exact canonical match, bias reranker features to keep it above looser transliterations.
Log ΔS and λ per candidate so you can see when a romanized neighbor outranks the native script without evidence.

System map (examples)

Language	Common systems	Notes
Chinese	Pinyin (tone marks or digits)	Keep tone-less aliases for user input, but preserve tone marks in canonical forms.
Japanese	Hepburn, Kunrei, Nihon	Handle long vowels (ō vs ou) and small tsu.
Korean	RR (Revised Romanization), MR	Names often appear without hyphens, add both.
Russian and Cyrillic	ISO 9, GOST, BGN/PCGN	Map soft sign and yo/ë variants.
Arabic	Buckwalter, ISO 233, DMG	Decide on hamza and taa marbuta conventions, keep both if present in corpus.
Hebrew	SBL, Academy rules	Deal with mater lectionis and dagesh normalization.
Hindi and Indic	ITRANS, ISO 15919	Normalize nukta forms.

Keep this list in code comments and in your ops runbook, not only in the model prompt.

Typical failure → fix

Symptom	Likely cause	Open this
Native script doc exists, romanized query misses it	no alias view built at index time	retrieval-playbook.md
Romanized neighbor outranks exact canonical snippet	reranker features not constrained	retrieval-traceability.md
Answers flip between Hepburn and Kunrei inputs	mixed systems without logging, λ not clamped	tokenizer_mismatch.md
Cyrillic ISO9 vs GOST produce different chunks	analyzer mismatch per field	locale_drift.md
Arabic Buckwalter forms merge two entities	alias collision, missing scope fence	proper_noun_aliases.md

60-second fix checklist

Wire alias view for documents that carry non-Latin scripts.
Record the system used for any generated alias.
Normalize only in alias fields for width and diacritics, never in canonical.
Bias reranker to keep exact canonical hits above loose translits.
Log ΔS and λ for native vs romanized queries and compare.

Copy snippets

Alias expansion at ingest time (no external libs)

def simple_pinyin_drop_tones(s: str) -> str:
    tone_map = str.maketrans("āáǎàēéěèīíǐìōóǒòūúǔùǖǘǚǜü", "aaaaeeeeiiiioooouuuuuuuuu")
    return s.translate(tone_map)

def width_fold(s: str) -> str:
    # simple NFKC fold
    import unicodedata as ud
    return ud.normalize("NFKC", s)

def alias_pack(canonical: str, lang: str, romanizer_hint: str | None = None) -> list[str]:
    out = {canonical}
    if lang == "zh":
        out.add(simple_pinyin_drop_tones(canonical))
    # add more light rules per language as needed
    return [width_fold(x) for x in out]

Prompt fence for romanizers

You have TXTOS and the WFGY Problem Map.

When the question or snippet contains a non-Latin name or term:
1) Try native script first. If the user input looks romanized, search both native and alias views.
2) Keep the canonical form in the final answer. Cite the exact snippet that contains the canonical form.
3) If multiple romanization systems match, state which system appears in the cited text.

Eval plan

Use a code-switching set with 5 languages and 10 entities each. For every entity build 3 questions:

native script,
romanized in system A,
romanized in system B.

Run the suite with code_switching_eval.md.

Targets

top-k 10 recall across forms ≥ 0.85
ΔS(question, retrieved) ≤ 0.45 on the best hit
λ convergent across two seeds and three paraphrases

If recall is fine but ranking flips between systems, tighten reranker constraints and verify with retrieval-traceability.md.

🔗 Quick-Start Downloads (60 sec)

Tool	Link	3-Step Setup
WFGY 1.0 PDF	Engine Paper	1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS)	TXTOS.txt	1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly

Explore More

Layer	Page	What it’s for
⭐ Proof	WFGY Recognition Map	External citations, integrations, and ecosystem proof
⚙️ Engine	WFGY 1.0	Original PDF tension engine and early logic sketch (legacy reference)
⚙️ Engine	WFGY 2.0	Production tension kernel for RAG and agent systems
⚙️ Engine	WFGY 3.0	TXT based Singularity tension engine (131 S class set)
🗺️ Map	Problem Map 1.0	Flagship 16 problem RAG failure taxonomy and fix map
🗺️ Map	Problem Map 2.0	Global Debug Card for RAG and agent pipeline diagnosis
🗺️ Map	Problem Map 3.0	Global AI troubleshooting atlas and failure pattern map
🧰 App	TXT OS	.txt semantic OS with fast bootstrap
🧰 App	Blah Blah Blah	Abstract and paradox Q&A built on TXT OS
🧰 App	Blur Blur Blur	Text to image generation with semantic control
🏡 Onboarding	Starter Village	Guided entry point for new users

If this repository helped, starring it improves discovery so more builders can find the docs and tools.