DialectOS Roadmap
June 3, 2026 · View on GitHub
This document outlines the planned direction for DialectOS. It is a living document — priorities may shift based on community feedback and contributor interest.
🎯 Vision
Become the most accurate open-source engine for Spanish dialect detection and adaptation. We focus on depth over breadth: 25 Spanish variants with high-quality keyword detection, register classification, and vocabulary adaptation rules.
📅 Short Term (Next 1–3 Months)
Detection Accuracy
- Reduce cross-dialect false positives — "chévere", "pana", "guagua" appear in multiple dialects; weight context and co-occurrence signals
- Grammar feature detection —
Detect voseo ("vos" + verb conjugations), leísmo/laísmo/loísmo, yeísmo distinctions✅ v0.2.0 - Confidence calibration — Better score thresholds so low-confidence input falls back to neutral rather than guessing
- Labeled benchmark corpus — Hand-label 500+ text samples per dialect for accuracy testing
Register & Style
- Finer register granularity — Split "slang" into "street", "youth", "regional casual"; split "formal" into "business", "academic", "legal"
- Register-aware translations — Pass register signal to LLM providers (GPT-4, Claude) for tone-matched output
- Gender-neutral register — Formal register should default to inclusive language (elles, latine, -e/-x endings where appropriate)
Adaptation Rules
- Grammar adaptations in manage-variants — Vosotros→ustedes, voseo verb endings, subjunctive variations
- Regional phrase replacements — "Qué lo que" → "¿Qué tal?" for formal contexts, etc.
- Number/date/currency formatting — MM/DD/YYYY for es-US, DD/MM/YYYY for most others, currency symbols
📅 Medium Term (3–6 Months)
Provider Quality
- Dialect-specific prompts for LLM providers — Feed dialect keywords + register markers into GPT-4/Claude prompts for better translations
- Provider quality benchmarks per dialect — Measure which provider (DeepL, GPT-4, LibreTranslate) handles each dialect best
- Translation memory / caching — Cache previously translated strings keyed by (text, dialect, register) to reduce API costs
Community & Data
- Community keyword contributions — Accept PRs adding verified slang/formal terms per dialect
- Dialect research citations — Link each keyword to a source (academic paper, native speaker survey, regional dictionary)
- Coverage dashboard — Visual map showing which dialects have strong vs weak keyword coverage
📅 Long Term (6–12 Months)
Beyond Text
- Speech-to-text dialect detection — Classify Spanish dialect from audio samples
- Real-time chat adaptation — WebSocket pipeline that rewrites incoming Spanish messages to the recipient's preferred dialect
Quality & Trust
- Human evaluation pipeline — Native speakers rate translation quality per dialect; feed scores back into provider selection
- Adversarial robustness — Test detection against code-switching (Spanglish, Portuñol), typos, and social media shorthand
🗳️ How to Influence the Roadmap
- Vote with reactions — Add 👍 to issues you care about
- Open an issue — Describe the feature and why it matters
- Start a discussion — Share your use case in GitHub Discussions
- Contribute code — The fastest way to get a feature shipped
✅ Recently Completed
Unreleased (post v0.3.0)
- Dialectal dictionary overhaul — Stripped 500 padded entries, added 58 high-variation concepts with genuine per-dialect variants; 833 total concepts (333 with real variation)
- Optional variants —
DictionaryEntry.variantsnow optional; pan-Hispanic entries have no variants block - New semantic fields —
medicine_health,family,financefields added alongside existing 11 fields (14 total)
v0.3.0 (2026-04-27)
-
dialectos validate— Standalone translation validation with quality gates, semantic check, and output judge - Translation corpus — Dialect-indexed JSONL storage for translations and corrections with CLI commands
- Public benchmark suite — 205 adversarial samples across 25 dialects and 13 categories with benchmark runner
- GitHub Action — Composite CI action for PR translation validation with multi-dialect matrix
- Auto-glossary from corrections — Suggestion engine +
dialectos glossary suggest/diffcommands - Automated tests across 7 packages
v0.2.0 (2026-04-27)
- Grammar detection — Voseo, leísmo/laísmo/loísmo, yeísmo with exact bigram matching
- Semantic backstop — Negation preservation, keyword overlap, structural parity checking
- Translation memory — Persistent cache with TTL, LRU eviction, and atomic writes
- Adversarial hardening — 125 samples, torture tests, public demo hardening
- LLM primary provider — Dialect-specific prompts, Anthropic API, LM Studio certification
- Certification harness — Incremental runner, adversarial certification, customer reports
v0.1.0 (2026-04-20)
- 25 Spanish dialects — Expanded from 20 to 25 (added
es-GQ,es-US,es-PH,es-BZ,es-AD) - Formal vs slang register differentiation —
detectRegister(),--registerCLI flag, MCPregisterparam - Comprehensive dialect adaptations — 200+ vocabulary rules in
manage-variantscovering all 25 dialects - Adversarial fixture corpus + CI lane
- Provider capability negotiation
- Semantic drift quality gate
- Versioned checkpoint schema with migration
- Reliability telemetry + health reports
- Provider chaos harness
- Operator policy profiles (strict/balanced/permissive)
Last updated: April 2026