ARC Language Module

May 10, 2026 · View on GitHub

GitHub Sponsors Python 3.10+ SQLite backed FastAPI API CLI operator tooling Production track

A governed multilingual backend foundation for future AI systems.

ARC Language Module is not just a translator. It is a language knowledge engine that helps an AI system know:

  • what languages it has data for
  • what scripts, variants, pronunciation hints, and lineage relationships exist
  • what it can actually translate right now
  • what still depends on external providers or corpora
  • what was seeded, imported, changed, or left unresolved

That makes it a better fit for serious AI infrastructure than projects that only expose a translation endpoint.

At-a-glance feature fit

This table is here to make the repo's niche obvious fast: ARC Language Module is best when you need a governed language backend, not just a translator endpoint.

Capability / fitARC Language ModuleArgos TranslateLibreTranslateFirefox Translations / BergamotUnicode CLDR
Structured language graphYes — core strengthLimitedLimitedNoYes — locale/reference focused
Runtime translationPartial / routedYes — core strengthYes — core strengthYes — browser focusedNo
Offline / local-first operationYesYesYesYesData/library dependent
API surfaceYesLimited / wrapper dependentYes — core strengthNo public ops API focusNo
CLI / operator workflowsYesYesLimited admin focusNoLimited tooling focus
Coverage / readiness matrixYes — core strengthNoNoNoPartial via locale coverage
Provenance / governed ingestionYes — core strengthNoNoNoContributor/repository process, not runtime governance
Release / evidence snapshotsYesNoNoNoNo
Best used forAI language substrate, multilingual control plane, governed routingOffline translation librarySelf-hosted translation APIPrivate browser/page translationLocale data and internationalization reference
Stronger than ARC atAuditability, routing, graph modelingRaw offline MT packagingSimple translation API deploymentSeamless in-browser page translationBreadth of locale standards/reference data
Stronger than others atGoverned language infrastructureOffline MT inferenceTranslation API simplicityBrowser-native private translationStandards/reference ecosystem depth

Quick read of the table

  • Choose ARC Language Module when you need to know what languages you support, how well you support them, what data you have, what runtime paths exist, and what changed over time.
  • Choose Argos Translate when you mainly want local/offline translation models.
  • Choose LibreTranslate when you mainly want a translation API you can self-host quickly.
  • Choose Firefox Translations / Bergamot when you mainly want private, on-device browser translation.
  • Choose Unicode CLDR when you mainly want locale/reference data for i18n and formatting.

What this repo is, in plain English

Think of this as the brain + filing system + traffic controller behind a multilingual AI stack.

It gives you:

  • a language graph stored in SQLite
  • a CLI and API for operators and applications
  • seeded language knowledge you can inspect and extend
  • runtime routing that separates “we know this language” from “we can translate or speak it right now”
  • coverage, readiness, and policy surfaces so unsupported paths are visible instead of hidden
  • evidence and release snapshots so the package can explain what it contains and what it claims

If you want a one-line summary:

ARC Language Module is a production-track substrate for AI systems that need structured multilingual knowledge, honest capability tracking, and controlled routing between data and runtime providers.


What it can do today

1) Store structured language knowledge

It keeps language records in a real database rather than loose notes or hardcoded conditionals.

That includes things like:

  • language records
  • aliases and alternate names
  • scripts
  • lineage / family relationships
  • variants (dialects, registers, orthographies, historical stages)
  • pronunciation profiles
  • broad phonology hints
  • transliteration profiles
  • seeded phrase translations
  • capability/readiness records
  • governed language graph surfaces for efficient downstream model/context use

2) Tell you what the system actually knows

It can answer practical questions such as:

  • Which languages are loaded?
  • Which scripts are attached to each language?
  • Which languages have pronunciation or phonology profiles?
  • Which surfaces are seeded versus missing?
  • Which capabilities are production, reviewed, experimental, or absent?

3) Route translation requests honestly

This repo does not pretend that every language is fully runtime-ready.

It can route a request through:

  • seeded local phrase support
  • optional local/runtime providers
  • external provider bridges
  • explicit “not ready” or “gap” states

That makes it a language operations layer, not just a translator wrapper.

4) Support operator workflows

The CLI/API surfaces can be used for:

  • coverage reports
  • implementation/readiness matrices
  • policy snapshots
  • acquisition workspace planning
  • import validation
  • evidence bundle exports
  • release integrity checks

5) Ingest and govern new language data

The package supports dry-run-safe ingestion and provenance-aware updates, so new datasets can be staged and checked instead of blindly merged.


What it is not

To keep claims honest, this package is not:

  • a universal best-in-class machine translation model
  • a finished speech/TTS stack
  • a complete transliteration engine for every script pair
  • a giant cloud service by itself

It is strongest when used as a multilingual control layer inside a larger AI product or research stack.


Why this matters for future AI

Most language projects specialize in one narrow slice:

  • translation only
  • locale/reference data only
  • browser translation only
  • API hosting only

Future AI systems need more than that.

They need to know:

  • what language knowledge they own
  • what runtime tools are available
  • which paths are trustworthy
  • what support is partial or missing
  • how to ingest better data without losing provenance
  • how to expose all of this to both humans and software

That is the lane ARC Language Module is trying to lead:

not “best translator in the world,” but best governed language substrate for future AI systems that need multilingual memory, routing, readiness, and auditability.


Language graph and parameter efficiency

ARC Language Module is not a hidden dataset dump and it does not replace real training data. Its role is different: it gives ARC systems and compatible LLM stacks a structured language graph so the model does not have to relearn every language relationship only from stored examples.

Instead of treating each language as isolated text, the module stores language identity, script, family, branch, lineage, variants, phonology hints, pronunciation hints, transliteration hints, aliases, and custom lineage overlays. That gives future model training and retrieval systems a reusable linguistic scaffold.

flowchart TD
    ARC[ARC Language Module] --> IDS[Language IDs / ISO codes]
    ARC --> SCRIPT[Scripts + orthography]
    ARC --> LINEAGE[Family / branch / lineage graph]
    ARC --> PHONO[Phonology profiles]
    ARC --> PRON[Pronunciation hints]
    ARC --> TRANS[Transliteration hints]
    ARC --> VAR[Variants / dialect notes]
    ARC --> CUSTOM[Custom language + custom lineage intake]

    LINEAGE --> IE[Indo-European]
    LINEAGE --> SEM[Afro-Asiatic / Semitic]
    LINEAGE --> IA[Indo-Aryan]
    LINEAGE --> DRAV[Dravidian]
    LINEAGE --> SIN[Sino-Tibetan]
    LINEAGE --> JAPONIC[Japonic]
    LINEAGE --> KOREANIC[Koreanic]
    LINEAGE --> TURKIC[Turkic]
    LINEAGE --> NIGER[Niger-Congo]
    LINEAGE --> ATH[Athabaskan]
    LINEAGE --> IROQ[Iroquoian]
    LINEAGE --> ALG[Algonquian]

    SCRIPT --> LATN[Latin]
    SCRIPT --> CYRL[Cyrillic]
    SCRIPT --> ARAB[Arabic / Nastaliq]
    SCRIPT --> DEVA[Devanagari]
    SCRIPT --> HANI[Han]
    SCRIPT --> JP[Kanji / Kana]
    SCRIPT --> HANG[Hangul]
    SCRIPT --> ETH[Ge'ez]
    SCRIPT --> CANS[Canadian Aboriginal Syllabics]
    SCRIPT --> CHER[Cherokee Syllabary]

    PHONO --> SOUND[Sound-shape hints]
    PRON --> SOUND
    TRANS --> BRIDGE[Cross-script bridge]
    VAR --> BRIDGE
    CUSTOM --> ARC

    ARC --> LLM[LLM / ARC-Neuron / compatible model]
    ARC --> OMNI[Omnibinary Runtime]
    ARC --> RAR[Arc-RAR bundles]
    ARC --> STREAM[ARC-StreamMemory visual modules]

    LLM --> LOWER[Lower need to store every language relation as raw memorized dataset rows]
    LOWER --> PARAM[More efficient parameter use through structured linguistic priors]

Connected ARC ecosystem roles

The language graph is designed to plug into the wider ARC stack without pretending those systems are bundled into this package:

  • ARC-Neuron / LLMBuilder can use the module as a lexical/provenance scaffold for model-growth and candidate evaluation.
  • Omnibinary Runtime can preserve language graph events, hashes, and source-spine references as device-portable binary continuity.
  • Arc-RAR can package language manifests, graph snapshots, receipts, and rollback evidence into restorable archive bundles.
  • ARC-StreamMemory can attach visual/video memory modules to language-aware receipts and AI-readable observation trails.
  • ProtoSynth / Neural Synth can later visualize language lineage, scripts, variants, and time-to-space projections as navigable cognition maps.

Mathematical intuition

A normal model without a language graph has to infer language relationships mostly from raw examples:

language behavior ≈ memorized examples + learned statistics

ARC Language Module adds a structured prior:

language behavior ≈ examples + lineage graph + script map + phonology map + transliteration map + variant map

So the model does not need to store every language connection as a separate memorized dataset weight. It can reference a reusable graph.

Simplified:

Effective language coverage = model weights × structured language graph × verified examples

Or:

C_eff = W_model × G_language × E_verified

Where:

  • W_model = the actual model weights
  • G_language = the structured language graph from ARC Language Module
  • E_verified = verified examples, corrections, and future datasets

The important point is that G_language raises the usefulness of the same model weights because related languages can share structure through lineage, script, phonology, transliteration, and variants.

This changes the “parameter bar” in a practical sense: the system is not relying only on raw stored examples. It has a retrievable, auditable language scaffold that helps future ARC-style systems align new examples against known language structure.

Current scope

The current seed graph includes 35 languages with supporting surfaces for:

  • language identity
  • aliases
  • scripts
  • family / branch lineage
  • variants
  • transliteration hints
  • pronunciation hints
  • phonology profiles
  • custom language submission
  • custom lineage overlays

This does not mean the system already speaks all 35 languages at full native quality. It means ARC has a structured foundation for organizing, comparing, extending, and verifying language knowledge.

Why this matters for future datasets

External datasets are still useful, but they become more efficient when they enter through the graph.

Instead of adding raw language data blindly:

dataset → model

ARC can do:

dataset → manifest → language graph alignment → lineage/script/phonology checks → candidate training/evaluation

This protects provenance and makes future dataset ingestion more controlled.

Custom language growth

The module can add custom languages or project-specific symbolic languages through governed intake:

new language
→ ID / aliases
→ script / orthography
→ phonology hints
→ lineage or custom lineage
→ variants
→ examples
→ review
→ approved graph entry

That lets ARC grow its language map without pretending every new language is already proven model knowledge.

For a focused standalone version of this section, see docs/LANGUAGE_GRAPH_AND_PARAMETER_EFFICIENCY.md.


Where it sits compared to other projects

Different projects solve different problems well.

  • Argos Translate is strong for offline open-source translation packages.
  • LibreTranslate is strong for self-hosted translation APIs.
  • Firefox Translations / Bergamot is strong for local in-browser translation.
  • Unicode CLDR is strong for locale/reference data used across software ecosystems.
  • ARC Language Module is strongest as the governed orchestration layer that sits above or beside those kinds of tools.

Qualitative comparison by role

This is a role comparison, not a latency or BLEU benchmark.

Language infrastructure comparison chart

Comparison table

ProjectPrimary strengthBest use caseWhat it does not focus on
ARC Language ModuleGoverned multilingual substrateAI backends that need language knowledge + routing + readiness + auditabilityBeing a single best MT engine
Argos TranslateOffline open-source translationLocal translation packages and desktop/local workflowsBroader governance / language graph surfaces
LibreTranslateSelf-hosted translation APIDrop-in translation endpoints and private deploymentRich language-knowledge modeling
Firefox Translations / BergamotPrivate on-device browser translationWebsite translation inside the browserOperator-facing language registry and ingestion governance
Unicode CLDRLocale/reference dataInternationalization, formatting, display names, locale metadataRuntime translation orchestration

For a more explicit comparison, see docs/COMPETITOR_COMPARISON.md.


Seed and package snapshot

Current release-integrity snapshot from the repo's single-source version path:

SurfaceCount
Version0.27.0
Languages35
Phrase translations385
Language variants104
Language capabilities245
Pronunciation profiles35
Phonology profiles35
Transliteration profiles21
Semantic concepts30
Concept links46

Provider support is intentionally modeled separately from core graph truth. Runtime provider availability depends on what is installed, registered, and enabled in the target environment.


Quick start

pip install -e .

PYTHONPATH=src python -m arc_lang.cli.main init-db
PYTHONPATH=src python -m arc_lang.cli.main seed-common-languages
PYTHONPATH=src python -m arc_lang.cli.main stats
PYTHONPATH=src python -m arc_lang.cli.main coverage-report
PYTHONPATH=src python -m arc_lang.cli.main system-status
PYTHONPATH=src python -m arc_lang.cli.main build-implementation-matrix
PYTHONPATH=src python -m arc_lang.cli.main release-snapshot

Example operator questions this repo can answer

  • What languages are in the graph right now?
  • Which ones are missing transliteration or pronunciation support?
  • Which variants exist for a given language?
  • What translation/assertion data came from which source?
  • Which capabilities are seeded, reviewed, experimental, or production?
  • What changed between releases?
  • Which providers are needed for a requested runtime path?

Architecture at a glance

The project is split into clear layers:

  • core/ — config, db, models
  • services/ — language logic, ingestion, routing, policy, evidence, coverage
  • api/ — FastAPI surface grouped by concern
  • cli/ — operator entrypoints and handlers
  • config/ — seed manifests and curated inputs
  • sql/ — schema and indexes
  • docs/ — architecture, runtime, policy, onboarding, and comparison docs

Deep dives:


Release integrity

PYTHONPATH=src python -m arc_lang.cli.main release-snapshot

This emits:

  • the package version
  • pyproject/version consistency checks
  • API health/version integrity checks
  • live graph counts for release verification

External dependencies and optional providers

This package can connect to or sit beside external tooling, but does not bundle all of them by default.

Provider / sourceRole
Argos TranslateLocal neural MT option
NLLB-style external inferenceLarge-scale MT bridge path
PersonaPlex-style speech providerSpeech boundary surface
GlottologExternal genealogy/reference corpus
ISO 639-3Authoritative language identifiers
CLDRScript/locale/reference data

Repository metadata

Suggested GitHub topics

Use the most specific topics first so the repo lands in the right lane:

multilingual
translation
language-detection
transliteration
pronunciation
phonology
natural-language-processing
multilingual-nlp
knowledge-graph
language-technology
fastapi
sqlite
cli
api
governance
auditability
orchestration
local-first
offline-first
artificial-intelligence

Suggested GitHub About text

Governed multilingual language-ops substrate for AI systems: language knowledge, provider routing, auditability, readiness, CLI, and API.

Short promotional line

A control layer for multilingual AI systems, not just a translator.


Support the project

If this repo is useful to you:

  • Star the repository
  • Open issues for bugs, corpus gaps, or runtime/provider edge cases
  • Send pull requests for new language data, provider integrations, or hardening work
  • Share it with people building multilingual AI, localization systems, or language tools
  • Support development on GitHub Sponsors

Release and validation status

Current production-track validation for this codebase includes:

  • 336 passing tests
  • wheel and sdist build verification
  • installed-wheel smoke validation
  • FastAPI app-load verification
  • CLI help / release snapshot verification

These checks support the repo's current positioning as a production-track language infrastructure package, while real-world deployment quality still depends on the target environment, provider integrations, telemetry, and soak testing.


License

This project is intended to ship under the MIT License. Add a root LICENSE file in the public repository so the visible GitHub repo matches the package metadata.