GBIF Name Parser

May 7, 2026 · View on GitHub

A library and command-line tool that parses scientific names — including the authorship, rank, hybrid markers and nomenclatural notes — into a structured ParsedName model.

Modules

ModulePurpose
name-parser-apiPure model + interface module: ParsedName, Authorship, Rank, NomCode, NameType, the NameParser interface, plus formatter / Unicode utilities. Depend on this if you only need the data model.
name-parserThe parser implementation. Single public entry point: org.gbif.nameparser.NameParserImpl.
name-parser-cliCommand-line tools (parse, compare, benchmark) wrapping the parser, packaged as an executable shaded jar.

Build everything with mvn install from the repo root.

Library use

<dependency>
  <groupId>org.gbif</groupId>
  <artifactId>name-parser</artifactId>
  <version>4.0.0-SNAPSHOT</version>
</dependency>
NameParser parser = new NameParserImpl();
ParsedName pn = parser.parse("Vulpes vulpes silaceus Miller, 1907", null, null, null);

Command-line interface

After mvn install, the executable jar is at name-parser-cli/target/name-parser-cli-<version>-shaded.jar.

java -jar name-parser-cli-<version>-shaded.jar <command> [options]
CommandWhat it does
parseStream a text file with one name per row through the parser and write a JSONL file (one JSON object per row).
compareStream two JSONL files in lockstep, report aggregate metrics and a per-row dump of every differing parsed value.
benchmarkMeasure parser throughput against a name-per-line input file (count, total / avg / min / p50 / p95 / max).

Run <command> --help for the full per-command option list.

All commands stream their input — memory use stays flat regardless of input size, so multi-million-row inputs are fine.

Bundled sample corpora

Sample inputs ship in name-parser-cli/data/:

  • benchmark-data.txt — ~8k mixed names (hand-picked + test-assertion inputs + random Catalogue of Life rows with authorship) used for throughput benchmarking. Top up with more random names anytime via:
    python3 name-parser-cli/scripts/append-colnames-sample.py [-n 2000] [--seed 17]
    
    The script reservoir-samples col-names.tsv in a single pass and appends rows as scientificName authorship — manual edits to the benchmark file are preserved.
  • col-names.tsv — the full Catalogue of Life names dump (~6.3M rows, ~340 MB, not tracked in git — drop your own copy here)

Each command's --input defaults assume you run it from the repo root.

parse

Usage: name-parser-cli parse [options]

Options:
  --input=PATH    source file (default: data/col-names.tsv; '-' = stdin)
  --output=PATH   target file (default: <input>.<format-ext>; '-' = stdout)
  --format=FMT    output format: jsonl (default), json, csv, tsv
                  csv / tsv produce a flat ColDP Name file with header
  --quiet         suppress progress output
  -h --help       print this message and exit

Use - as the input or output path to stream from stdin / to stdout — the command is fully unix-pipe friendly. Progress messages and the final summary are written to stderr so stdout stays a clean data stream:

cat names.txt | name-parser-cli parse --input=- --output=- --format=tsv | head
xz -dc col-names.tsv.xz | name-parser-cli parse --input=- --output=- --format=jsonl > col.jsonl

Input

The input format is auto-detected from the first non-blank, non-comment line:

  • ColDP Name file (TSV or CSV) — recognised when the header row contains any ColdpTerm property names (looked up via ColdpTerm.find). Only the columns the parser interface accepts are honoured: ID, scientificName, authorship, rank, code. Other columns are read but ignored.
  • Plain text — one name per line. If a line contains a tab, only the substring before the first tab is treated as the name (so col-names.tsv is usable both as ColDP-style TSV and as bare plain text).

Lines starting with # and blank lines are skipped.

Output formats

FormatDescription
jsonl (default)One self-contained JSON object per line; consumed by compare.
jsonSingle document containing a JSON array of all rows (streamed; not held in memory).
csv / tsvFlat ColDP Name file with header row.

JSON / JSONL rows look like:

{"line":42,"id":"42","input":"Felis catus","parsed":{ ...full ParsedName... }}
{"line":99,"id":"99","input":"Iridoviridae","error":{"type":"VIRUS","message":"..."}}

The id field is populated from the ColDP ID column when present; otherwise it is omitted.

ColDP CSV/TSV column mapping

Every structural ParsedName field maps to a ColDP column. Where the ColDP Name entity lacks a column but the NameUsage entity defines one, that NameUsage term is used (nameStatus, namePhrase, namePublishedInPage, provisional, extinct). Parser-only fields without a ColDP equivalent are written into custom columns prefixed with np: — strict ColDP readers ignore unknown columns, so the file stays valid ColDP.

Multi-value rules: author lists join with | (the ColDP convention); notho parts join with ,.

ParsedName fieldColDP column
id (from input)ID (falls back to verbatim scientificName when absent)
canonicalNameWithoutAuthorship() (Candidatus prefixed when applicable)scientificName
authorshipComplete()authorship
rank, coderank, code (lower-cased)
nomenclaturalNote (or manuscript flag)nameStatus
uninomial, genus, infragenericEpithet, specificEpithet, infraspecificEpithet, cultivarEpithetsame column names
notho (every flagged part, comma-joined)notho
originalSpellingoriginalSpelling
combinationAuthorship.{authors,exAuthors,year}combinationAuthorship, combinationExAuthorship, combinationAuthorshipYear (authors joined with |)
basionymAuthorship.{authors,exAuthors,year}basionymAuthorship, basionymExAuthorship, basionymAuthorshipYear (authors joined with |)
publishedIn (free text)namePublishedInPage
extinctextinct
phrasenamePhrase
doubtfulprovisional
type (when not SCIENTIFIC)np:type
sanctioningAuthornp:sanctioningAuthor
taxonomicNote (sensu)np:taxonomicNote
unparsednp:unparsed
warnings (joined with |)np:warnings
(parser failure message)np:error

Unparsable rows are still written: ID, scientificName (the verbatim input) and the np:type / np:error columns are populated.

compare

Usage: name-parser-cli compare [options] <a.jsonl> <b.jsonl> [diffs.txt]

Options:
  --a=PATH              first JSONL file (alt. to first positional arg)
  --b=PATH              second JSONL file (alt. to second positional arg)
  --output=PATH         write per-row diffs here (default: stdout)
  --ignore-whitespace   strip whitespace from string leaves before compare
  --max-diffs=N         cap per-row diff dump at N rows (default: 100)
  -h --help             print this message and exit

Both inputs are expected to come from the same source file (matching line numbers, same row order). The summary reports rows compared / identical / differing, status transitions (PARSED→ERROR, ERROR→PARSED, …) and the top differing field paths. Whitespace inside parsed string values is significant by default — pass --ignore-whitespace to suppress whitespace-only differences in parsed values (the JSON formatting itself is ignored either way).

benchmark

Usage: name-parser-cli benchmark [options]

Options:
  --input=PATH    source file (default: data/benchmark-data.txt)
  --warmup        do an extra untimed pass over the input first to warm the JIT
  -h --help       print this message and exit

Pure throughput measurement — every input row is parsed and timed. JIT warmup is opt-in via --warmup, in which case the input is streamed through the parser once without timing before the timed pass; on subsequent runs the HotSpot-warmed numbers tend to be ~10× lower. Nothing is written to disk; the report goes to stdout.

License

Apache 2.0.