Romanization & Transliteration

March 6, 2026 · View on GitHub

🧭 Quick Return to Map

You are in a sub-page of Language.
To reorient, go back here:

Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.

Make cross-script search and RAG stable when users type Latin transliterations of non-Latin names and terms. This page gives a minimal contract, store wiring, and tests so Hepburn vs Kunrei, Pinyin vs mixed tone marks, RR vs MR, ISO9 vs GOST, Buckwalter vs ISO 233, and similar systems do not break recall or flip ranking.


Open these first


Core acceptance targets

  • ΔS(question, retrieved) ≤ 0.45 for native script, romanized, and accent-stripped variants
  • Coverage of target section ≥ 0.70 under three paraphrases and two seeds
  • λ remains convergent when switching romanizers inside the same language
  • No false merges across entities when romanized forms collide

Minimal contract

Add a small, explicit layer around romanization so behavior is auditable.

Document side fields

raw_text            # untouched source
lang                # BCP-47 primary tag
script              # ISO 15924 (Han, Cyrl, Arab, Hira, Kana, Hang, etc.)
canonical           # preferred display form for proper nouns if known
alias_tail          # pipe-joined alias list incl. romanized forms
romanizers          # systems observed for this doc: "pinyin|rr|hepburn|iso9|buckwalter"

Query side context

q_text              # user input
q_lang_guess        # detector result, nullable
q_script_guess      # detector result, nullable
q_romanizer_hint    # optional, from UI or logs, e.g. "hepburn"

Rules

  • Never mutate raw_text or canonical.
  • Romanized strings live only in alias_tail and store-specific synonym views.
  • Record which systems were used. Mixing systems without a record increases ΔS variance.

Store wiring

BM25 style indexes

  • Keep raw_text with a locale-aware analyzer.
  • Add a synonym graph on a separate field that contains romanized aliases.
  • Apply width normalization and diacritic strip only in alias field. Keep canonical untouched. See locale_drift.md.

Vector stores

  • Append alias_tail to the chunk text right after the first canonical mention.
  • Keep short, high precision alias lists. Over-expansion harms meaning.
  • If nearest neighbors look similar yet wrong, verify metric per embedding-vs-semantic.md.

Hybrid

  • When BM25 yields an exact canonical match, bias reranker features to keep it above looser transliterations.
  • Log ΔS and λ per candidate so you can see when a romanized neighbor outranks the native script without evidence.

System map (examples)

LanguageCommon systemsNotes
ChinesePinyin (tone marks or digits)Keep tone-less aliases for user input, but preserve tone marks in canonical forms.
JapaneseHepburn, Kunrei, NihonHandle long vowels (ō vs ou) and small tsu.
KoreanRR (Revised Romanization), MRNames often appear without hyphens, add both.
Russian and CyrillicISO 9, GOST, BGN/PCGNMap soft sign and yo/ë variants.
ArabicBuckwalter, ISO 233, DMGDecide on hamza and taa marbuta conventions, keep both if present in corpus.
HebrewSBL, Academy rulesDeal with mater lectionis and dagesh normalization.
Hindi and IndicITRANS, ISO 15919Normalize nukta forms.

Keep this list in code comments and in your ops runbook, not only in the model prompt.


Typical failure → fix

SymptomLikely causeOpen this
Native script doc exists, romanized query misses itno alias view built at index timeretrieval-playbook.md
Romanized neighbor outranks exact canonical snippetreranker features not constrainedretrieval-traceability.md
Answers flip between Hepburn and Kunrei inputsmixed systems without logging, λ not clampedtokenizer_mismatch.md
Cyrillic ISO9 vs GOST produce different chunksanalyzer mismatch per fieldlocale_drift.md
Arabic Buckwalter forms merge two entitiesalias collision, missing scope fenceproper_noun_aliases.md

60-second fix checklist

  1. Wire alias view for documents that carry non-Latin scripts.
  2. Record the system used for any generated alias.
  3. Normalize only in alias fields for width and diacritics, never in canonical.
  4. Bias reranker to keep exact canonical hits above loose translits.
  5. Log ΔS and λ for native vs romanized queries and compare.

Copy snippets

Alias expansion at ingest time (no external libs)

def simple_pinyin_drop_tones(s: str) -> str:
    tone_map = str.maketrans("āáǎàēéěèīíǐìōóǒòūúǔùǖǘǚǜü", "aaaaeeeeiiiioooouuuuuuuuu")
    return s.translate(tone_map)

def width_fold(s: str) -> str:
    # simple NFKC fold
    import unicodedata as ud
    return ud.normalize("NFKC", s)

def alias_pack(canonical: str, lang: str, romanizer_hint: str | None = None) -> list[str]:
    out = {canonical}
    if lang == "zh":
        out.add(simple_pinyin_drop_tones(canonical))
    # add more light rules per language as needed
    return [width_fold(x) for x in out]

Prompt fence for romanizers

You have TXTOS and the WFGY Problem Map.

When the question or snippet contains a non-Latin name or term:
1) Try native script first. If the user input looks romanized, search both native and alias views.
2) Keep the canonical form in the final answer. Cite the exact snippet that contains the canonical form.
3) If multiple romanization systems match, state which system appears in the cited text.

Eval plan

Use a code-switching set with 5 languages and 10 entities each. For every entity build 3 questions:

  1. native script,
  2. romanized in system A,
  3. romanized in system B.

Run the suite with code_switching_eval.md.

Targets

  • top-k 10 recall across forms ≥ 0.85
  • ΔS(question, retrieved) ≤ 0.45 on the best hit
  • λ convergent across two seeds and three paraphrases

If recall is fine but ranking flips between systems, tighten reranker constraints and verify with retrieval-traceability.md.


🔗 Quick-Start Downloads (60 sec)

ToolLink3-Step Setup
WFGY 1.0 PDFEngine Paper1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS)TXTOS.txt1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly

Explore More

LayerPageWhat it’s for
⭐ ProofWFGY Recognition MapExternal citations, integrations, and ecosystem proof
⚙️ EngineWFGY 1.0Original PDF tension engine and early logic sketch (legacy reference)
⚙️ EngineWFGY 2.0Production tension kernel for RAG and agent systems
⚙️ EngineWFGY 3.0TXT based Singularity tension engine (131 S class set)
🗺️ MapProblem Map 1.0Flagship 16 problem RAG failure taxonomy and fix map
🗺️ MapProblem Map 2.0Global Debug Card for RAG and agent pipeline diagnosis
🗺️ MapProblem Map 3.0Global AI troubleshooting atlas and failure pattern map
🧰 AppTXT OS.txt semantic OS with fast bootstrap
🧰 AppBlah Blah BlahAbstract and paradox Q&A built on TXT OS
🧰 AppBlur Blur BlurText to image generation with semantic control
🏡 OnboardingStarter VillageGuided entry point for new users

If this repository helped, starring it improves discovery so more builders can find the docs and tools.
GitHub Repo stars