jschardet performance

June 25, 2026 · View on GitHub

Benchmarked on 2026-05-26 against 2517 test files from the chardet test corpus. Methodology mirrors chardet/docs/rewrite_performance.md: each detector runs against the full corpus, equivalence rules (directional supersets, bidirectional groups, decoded-output equivalence) are applied identically across all detectors, and timing benchmarks take the median across 5 runs.

Three detectors are compared:

  • jschardet 3.1.4 — the previous JavaScript port
  • jschardet 4.0.0 — this package, a TypeScript ground-up port of chardet 7
  • chardet 7.4.3 — upstream Python, run via PYTHONPATH against the chardet/ submodule

Reproduce locally:

npm run benchmark:accuracy
npm run benchmark:throughput
npm run benchmark:coldstart
npm run benchmark:memory

Overall accuracy

DetectorCorrectAccuracy
jschardet 3.1.41057/251742.0%
jschardet 4.0.02496/251799.2%
chardet 7.4.32496/251799.2%

jschardet 4 lifts accuracy by 57.2pp over jschardet 3 — the underlying chardet rewrite ships new bigram models, EBCDIC/DOS/Mac coverage, and magic-number plus markup-charset pipelines that v3 lacked.

Language detection accuracy

DetectorCorrectAccuracy
jschardet 3.1.4n/an/a
jschardet 4.0.02445/250997.4%
chardet 7.4.32445/250997.4%

jschardet 3 does not return a language field. The 64 wrong-language cases under jschardet 4 are primarily confusable language pairs within the same script (Danish/Norwegian, Belarusian/Bulgarian for Cyrillic, etc.).

Throughput

Per-file detection latency over the full corpus, median of 5 in-process runs with per-file times aggregated element-wise. Chardet's upstream _run_timing_with_median isolates each pass in its own subprocess; we don't, to keep runtime reasonable.

DetectorFiles/sMeanMedianp90p95
jschardet 3.1.41546.51 ms0.59 ms3.58 ms6.17 ms
jschardet 4.0.09451.06 ms0.28 ms2.26 ms3.31 ms
chardet 7.4.31875.36 ms1.89 ms12.77 ms16.22 ms

jschardet 4 processes about 6× more files per second than jschardet 3 on this corpus, and the tail latency narrows too — p95 drops from 6.17 ms to 3.31 ms.

Cold start

Import time and first detect() call latency in a fresh subprocess (median of 5 runs). Each measurement is isolated to avoid module-cache effects.

DetectorImportFirst detectTotal
jschardet 3.1.425.14 ms0.59 ms25.73 ms
jschardet 4.0.034.76 ms45.51 ms80.27 ms
chardet 7.4.329.81 ms65.08 ms94.90 ms

jschardet 4 trades a heavier first-call cost for a lighter steady state: the bigram models ship zlib-compressed and decompress lazily on the first detect() call (see docs/model-compression.md), so cold-start latency is ~3× higher than jschardet 3 but every subsequent call is faster (see throughput above). jschardet 3 has no models to decompress, which is why its first detect is essentially free.

Memory

Peak resident-set-size (RSS high-water mark since process start, via getrusage(RUSAGE_SELF).ru_maxrssprocess.resourceUsage().maxRSS in Node, resource.getrusage().ru_maxrss in Python) sampled in a fresh subprocess at three points: before the detector is imported, after import, and after detect() has been called on every file in the corpus (median of 5 runs). Peak RSS is the apples-to-apples cross-language metric — V8 has no tracemalloc equivalent, and reading the same syscall on both sides keeps the numbers comparable.

DetectorBaseline RSSImport deltaPeak deltaFinal RSS
jschardet 3.1.4106.7 MiB2.2 MiB751.4 MiB858.1 MiB
jschardet 4.0.0106.6 MiB7.1 MiB84.5 MiB191.1 MiB
chardet 7.4.359.5 MiB4.5 MiB50.7 MiB110.2 MiB

jschardet 4's peak RSS is ~9× lower than jschardet 3 (84.5 MiB vs 751.4 MiB of growth above baseline). The chardet rewrite's dense bigram model format (one 64 KiB lookup table per language, loaded once and shared across calls) replaces the per-call sparse-map allocations that drive v3's high water mark.

The baseline gap between Node (~107 MiB) and Python (~60 MiB) is the interpreter's own resident footprint plus the corpus bytes — both workers pre-load the full corpus into memory before measuring baseline, so the corpus shows up there rather than under the detector.

Per-encoding accuracy

EncodingNjschardet 3.1.4jschardet 4.0.0chardet 7.4.3
(binary)87/8 (87.5%)8/8 (100.0%)8/8 (100.0%)
ascii1817/18 (94.4%)18/18 (100.0%)18/18 (100.0%)
big52929/29 (100.0%)29/29 (100.0%)29/29 (100.0%)
cp037280/28 (0.0%)28/28 (100.0%)28/28 (100.0%)
cp100630/3 (0.0%)3/3 (100.0%)3/3 (100.0%)
cp102630/3 (0.0%)3/3 (100.0%)3/3 (100.0%)
cp112530/3 (0.0%)3/3 (100.0%)3/3 (100.0%)
cp27330/3 (0.0%)3/3 (100.0%)3/3 (100.0%)
cp42440/4 (0.0%)4/4 (100.0%)4/4 (100.0%)
cp437270/27 (0.0%)26/27 (96.3%)26/27 (96.3%)
cp500240/24 (0.0%)23/24 (95.8%)23/24 (95.8%)
cp72060/6 (0.0%)6/6 (100.0%)6/6 (100.0%)
cp73710/1 (0.0%)1/1 (100.0%)1/1 (100.0%)
cp775100/10 (0.0%)10/10 (100.0%)10/10 (100.0%)
cp850370/37 (0.0%)34/37 (91.9%)34/37 (91.9%)
cp852240/24 (0.0%)24/24 (100.0%)24/24 (100.0%)
cp8553939/39 (100.0%)39/39 (100.0%)39/39 (100.0%)
cp85630/3 (0.0%)3/3 (100.0%)3/3 (100.0%)
cp85740/4 (0.0%)4/4 (100.0%)4/4 (100.0%)
cp858330/33 (0.0%)31/33 (93.9%)31/33 (93.9%)
cp86030/3 (0.0%)3/3 (100.0%)3/3 (100.0%)
cp86130/3 (0.0%)3/3 (100.0%)3/3 (100.0%)
cp86230/3 (0.0%)3/3 (100.0%)3/3 (100.0%)
cp86330/3 (0.0%)3/3 (100.0%)3/3 (100.0%)
cp86410/1 (0.0%)1/1 (100.0%)1/1 (100.0%)
cp86540/4 (0.0%)4/4 (100.0%)4/4 (100.0%)
cp8663737/37 (100.0%)37/37 (100.0%)37/37 (100.0%)
cp86940/4 (0.0%)4/4 (100.0%)4/4 (100.0%)
cp87420/2 (0.0%)0/2 (0.0%)0/2 (0.0%)
cp87530/3 (0.0%)3/3 (100.0%)3/3 (100.0%)
cp93250/5 (0.0%)4/5 (80.0%)4/5 (80.0%)
cp94910/1 (0.0%)1/1 (100.0%)1/1 (100.0%)
euc-jp3232/32 (100.0%)32/32 (100.0%)32/32 (100.0%)
euc-kr3333/33 (100.0%)33/33 (100.0%)33/33 (100.0%)
gb1803044/4 (100.0%)4/4 (100.0%)4/4 (100.0%)
gb23122424/24 (100.0%)23/24 (95.8%)23/24 (95.8%)
hp-roman84212/42 (28.6%)42/42 (100.0%)42/42 (100.0%)
hz-gb-231222/2 (100.0%)2/2 (100.0%)2/2 (100.0%)
iso-2022-jp33/3 (100.0%)3/3 (100.0%)3/3 (100.0%)
iso-2022-jp-200433/3 (100.0%)3/3 (100.0%)3/3 (100.0%)
iso-2022-jp-ext11/1 (100.0%)1/1 (100.0%)1/1 (100.0%)
iso-2022-kr55/5 (100.0%)5/5 (100.0%)5/5 (100.0%)
iso-8859-13422/34 (64.7%)34/34 (100.0%)34/34 (100.0%)
iso-8859-1063/6 (50.0%)6/6 (100.0%)6/6 (100.0%)
iso-8859-13111/11 (9.1%)11/11 (100.0%)11/11 (100.0%)
iso-8859-14107/10 (70.0%)10/10 (100.0%)10/10 (100.0%)
iso-8859-153018/30 (60.0%)29/30 (96.7%)29/30 (96.7%)
iso-8859-16180/18 (0.0%)16/18 (88.9%)16/18 (88.9%)
iso-8859-24615/46 (32.6%)46/46 (100.0%)46/46 (100.0%)
iso-8859-3110/11 (0.0%)11/11 (100.0%)11/11 (100.0%)
iso-8859-470/7 (0.0%)7/7 (100.0%)7/7 (100.0%)
iso-8859-55151/51 (100.0%)51/51 (100.0%)51/51 (100.0%)
iso-8859-690/9 (0.0%)9/9 (100.0%)9/9 (100.0%)
iso-8859-71710/17 (58.8%)17/17 (100.0%)17/17 (100.0%)
iso-8859-82121/21 (100.0%)21/21 (100.0%)21/21 (100.0%)
iso-8859-9100/10 (0.0%)10/10 (100.0%)10/10 (100.0%)
johab70/7 (0.0%)7/7 (100.0%)7/7 (100.0%)
koi8-r2525/25 (100.0%)25/25 (100.0%)25/25 (100.0%)
koi8-t30/3 (0.0%)3/3 (100.0%)3/3 (100.0%)
koi8-u30/3 (0.0%)3/3 (100.0%)3/3 (100.0%)
kz104840/4 (0.0%)4/4 (100.0%)4/4 (100.0%)
maccyrillic3834/38 (89.5%)38/38 (100.0%)38/38 (100.0%)
macgreek30/3 (0.0%)3/3 (100.0%)3/3 (100.0%)
maciceland30/3 (0.0%)3/3 (100.0%)3/3 (100.0%)
maclatin2210/21 (0.0%)21/21 (100.0%)21/21 (100.0%)
macroman410/41 (0.0%)39/41 (95.1%)39/41 (95.1%)
macturkish30/3 (0.0%)3/3 (100.0%)3/3 (100.0%)
ptcp15440/4 (0.0%)4/4 (100.0%)4/4 (100.0%)
shift-jis33/3 (100.0%)3/3 (100.0%)3/3 (100.0%)
shift_jis3131/31 (100.0%)31/31 (100.0%)31/31 (100.0%)
tis-62088/8 (100.0%)8/8 (100.0%)8/8 (100.0%)
utf-16152152/152 (100.0%)152/152 (100.0%)152/152 (100.0%)
utf-16be1490/149 (0.0%)148/149 (99.3%)148/149 (99.3%)
utf-16le1490/149 (0.0%)148/149 (99.3%)148/149 (99.3%)
utf-32150150/150 (100.0%)150/150 (100.0%)150/150 (100.0%)
utf-32be1490/149 (0.0%)149/149 (100.0%)149/149 (100.0%)
utf-32le1490/149 (0.0%)149/149 (100.0%)149/149 (100.0%)
utf-71430/143 (0.0%)143/143 (100.0%)143/143 (100.0%)
utf-8170170/170 (100.0%)169/170 (99.4%)169/170 (99.4%)
utf-8-sig1450/145 (0.0%)145/145 (100.0%)145/145 (100.0%)
windows-1250373/37 (8.1%)37/37 (100.0%)37/37 (100.0%)
windows-12516258/62 (93.5%)62/62 (100.0%)62/62 (100.0%)
windows-12523120/31 (64.5%)30/31 (96.8%)30/31 (96.8%)
windows-125320/2 (0.0%)2/2 (100.0%)2/2 (100.0%)
windows-125410/1 (0.0%)1/1 (100.0%)1/1 (100.0%)
windows-125577/7 (100.0%)6/7 (85.7%)6/7 (85.7%)
windows-125690/9 (0.0%)9/9 (100.0%)9/9 (100.0%)
windows-125740/4 (0.0%)4/4 (100.0%)4/4 (100.0%)
windows-125850/5 (0.0%)5/5 (100.0%)5/5 (100.0%)

The full list of known per-file failures is tracked in tests/accuracy.test.ts.

Methodology notes

  • Correctness rule — exact match, alias, directional superset (e.g. windows-1252 accepted when iso-8859-1 was expected), bidirectional pair (UTF-16/32 endian variants), or decoded-output equivalence (NFKD-normalized text matches under both encodings). Same rule used by tests/accuracy.test.ts and chardet's own benchmark pipeline.
  • Raw counts — no KNOWN_FAILURES allow-list is applied. This matches chardet's rewrite_performance.md methodology so the two documents can be read side by side.
  • Median of 5 runs — applied to throughput, cold start, and memory. For throughput, the median is taken element-wise across per-file times (each file's median is taken across the 5 passes, then the standard mean/median/p90/p95 are computed over those medians).
  • Per-detector isolation — chardet 7 runs in a Python subprocess via tests/benchmark/lib/*-worker-chardet7.py. For cold start and memory, every measurement is in a fresh subprocess to avoid module-cache or RSS-high-water-mark contamination.

Hardware

Node v22.22.2 on linux/x64
CPU: Intel(R) Xeon(R) Processor @ 2.80GHz (4 cores)
RAM: 15.7 GiB