wiktionary-de-parser
May 29, 2026 · View on GitHub
A Python library (3.13+) that extracts structured data from German Wiktionary XML dumps: IPA, hyphenation, inflection tables, part-of-speech tags, lemma references, rhymes, and meanings.
Features
- Streams compressed XML dumps memory-efficiently.
- Yields one structured entry per language and part of speech (a single Wiktionary page often holds several).
- Optional
multiprocessingmode for full-dump throughput.
Installation
pip install wiktionary-de-parser
The project uses uv for development; any
standard pip/PyPI install works for consumers.
Usage
Locating the dump file
from wiktionary_de_parser import WiktionaryDump
# Either point at an existing local file.
dump = WiktionaryDump(
dump_file_path="path/to/dewiktionary-latest-pages-articles-multistream.xml.bz2"
)
# Or download into a directory on first call.
dump = WiktionaryDump(dump_dir_path="dumps/")
dump.download_dump()
Parsing entries (serial)
from wiktionary_de_parser import WiktionaryParser
parser = WiktionaryParser()
for page in dump.pages():
if page.redirect_to or not page.wikitext:
continue
for entry in parser.entries(page):
parsed = parser.parse(entry)
if parsed.page_name == "Abend":
print(parsed)
Parsing entries (parallel)
For full-dump runs use iter_parsed. XML iteration stays on the main
process while parsing is sharded over a worker pool.
for parsed in dump.iter_parsed(workers=15):
... # ParsedEntry instances yielded across all workers
workers defaults to os.cpu_count() - 1. Pass workers=1 to skip
multiprocessing entirely (useful with pdb).
Output schema
ParsedEntry(
page_name="Abend",
page_id=2742,
entry_index=0,
language="Deutsch",
language_code="de",
lemma="Abend",
reference=None, # LemmaReference if the page is an inflected/variant form
pos=[PosTag(pos="Substantiv", subtypes=())],
inflection={
"gender": "m",
"nominative_singular": "Abend",
"nominative_plural": "Abende",
"genitive_singular": "Abends",
"genitive_plural": "Abende",
"dative_singular": "Abend",
"dative_plural": "Abenden",
"accusative_singular": "Abend",
"accusative_plural": "Abende",
},
ipa=["ˈaːbn̩t", "ˈaːbm̩t"],
hyphenation=["Abend"],
rhymes=["aːbn̩t"],
meanings=[Meaning(text="…", tags=["Astronomie"], raw_tags=[])],
)
All result containers are @dataclass(slots=True). The full schema
lives in wiktionary_de_parser/models.py.
Inflection keys
Inflection-table parameter names are token-translated to English
lowercase + underscore: "Nominativ Singular" → "nominative_singular",
"Präsens_er, sie, es" → "present_3sg". Unknown tokens are kept
verbatim (lowercased).
Lemma references
If the entry is an inflected form or alternative spelling, lemma
holds the canonical target and reference records the type:
# "gehörte" → "gehören"
parsed.lemma == "gehören"
parsed.reference == LemmaReference(target="gehören", type=ReferenceType.INFLECTED)
# "Geografie" → "Geographie"
parsed.reference == LemmaReference(target="Geographie", type=ReferenceType.VARIANT)
Development
uv sync # install dependencies
uv run pytest # run the test suite
uv run ruff format
uv run ruff check
License
MIT © Gregor Weichbrodt