UTF-8 Processing Benchmarks

January 18, 2026 ยท View on GitHub

Benchmarks for UTF-8 text processing, including whitespace and newline splitting across different languages and hardware platforms.

Tokenization

Different scripts stress UTF-8 processing in different ways:

  • Korean: 3-byte Hangul syllables with single-byte whitespace between words - representative for tokenization workloads
  • Chinese: 3-byte CJK characters with rare whitespace - tests raw byte throughput
  • Arabic: 2-byte Arabic script with regular punctuation - good for newline splitting benchmarks
  • French: Mixed 1-2 byte Latin with high diacritic density
  • English: Mostly 1-byte ASCII baseline

AMD Zen5 Turin

On AMD Zen5 Turin CPUs on different datasets, StringZilla provides the following throughput for splitting around whitespace and newline characters on 5 vastly different languages.

LibraryEnglishChineseArabicFrenchKorean
Split around 25 whitespace characters:
stringzilla::utf8_whitespace_splits0.82 GB/s2.40 GB/s2.40 GB/s0.92 GB/s1.88 GB/s
stdlib::split(char::is_whitespace)0.77 GB/s1.87 GB/s1.04 GB/s0.72 GB/s0.98 GB/s
icu::WhiteSpace0.11 GB/s0.16 GB/s0.15 GB/s0.12 GB/s0.15 GB/s
Split around 8 newline combinations:
stringzilla::utf8_newline_splits15.45 GB/s16.65 GB/s18.34 GB/s14.52 GB/s16.71 GB/s
stdlib::split(char::is_unicode_newline)1.90 GB/s1.93 GB/s1.82 GB/s1.78 GB/s1.81 GB/s

Apple M2 Pro

LibraryEnglishChineseArabicFrenchKorean
Split around 25 whitespace characters:
stringzilla::utf8_whitespace_splits0.57 GB/s2.45 GB/s1.18 GB/s0.61 GB/s0.92 GB/s
stdlib::split(char::is_whitespace)0.59 GB/s1.16 GB/s0.99 GB/s0.63 GB/s0.89 GB/s
icu::WhiteSpace0.10 GB/s0.16 GB/s0.14 GB/s0.11 GB/s0.14 GB/s
Split around 8 newline combinations:
stringzilla::utf8_newline_splits5.69 GB/s6.24 GB/s6.58 GB/s6.70 GB/s6.29 GB/s
stdlib::split(char::is_unicode_newline)1.12 GB/s1.11 GB/s1.11 GB/s1.11 GB/s1.13 GB/s

Case Folding

AMD Zen5 Turin

LanguageStandard ๐Ÿฆ€StringZilla ๐Ÿฆ€Standard ๐ŸStringZilla ๐Ÿ
English ๐Ÿ‡ฌ๐Ÿ‡ง482 MB/s7.53 GB/s16x257 MB/s3.14 GB/s12x
German ๐Ÿ‡ฉ๐Ÿ‡ช432 MB/s2.59 GB/s6x260 MB/s1.81 GB/s7x
Russian ๐Ÿ‡ท๐Ÿ‡บ217 MB/s2.20 GB/s10x470 MB/s1.56 GB/s3x
French ๐Ÿ‡ซ๐Ÿ‡ท346 MB/s1.84 GB/s5x274 MB/s1.37 GB/s5x
Greek ๐Ÿ‡ฌ๐Ÿ‡ท220 MB/s1.00 GB/s5x431 MB/s779 MB/s2x
Armenian ๐Ÿ‡ฆ๐Ÿ‡ฒ223 MB/s908 MB/s4x470 MB/s746 MB/s2x
Vietnamese ๐Ÿ‡ป๐Ÿ‡ณ265 MB/s352 MB/s1x340 MB/s291 MB/s1x
Arabic ๐Ÿ‡ธ๐Ÿ‡ฆ232 MB/s1004 MB/s4x467 MB/s1.80 GB/s4x
Bengali ๐Ÿ‡ง๐Ÿ‡ฉ314 MB/s6.17 GB/s20x694 MB/s2.91 GB/s4x
Chinese ๐Ÿ‡จ๐Ÿ‡ณ325 MB/s1.21 GB/s4x697 MB/s886 MB/s1x
Czech ๐Ÿ‡จ๐Ÿ‡ฟ322 MB/s827 MB/s3x292 MB/s688 MB/s2x
Dutch ๐Ÿ‡ณ๐Ÿ‡ฑ471 MB/s4.73 GB/s10x262 MB/s2.97 GB/s11x
Farsi ๐Ÿ‡ฎ๐Ÿ‡ท235 MB/s858 MB/s4x475 MB/s1.42 GB/s3x
Georgian ๐Ÿ‡ฌ๐Ÿ‡ช294 MB/s192 MB/s1x689 MB/s488 MB/s1x
Hebrew ๐Ÿ‡ฎ๐Ÿ‡ฑ233 MB/s1.01 GB/s4x473 MB/s1.86 GB/s4x
Italian ๐Ÿ‡ฎ๐Ÿ‡น439 MB/s2.29 GB/s5x268 MB/s1.93 GB/s7x
Japanese ๐Ÿ‡ฏ๐Ÿ‡ต330 MB/s3.51 GB/s11x726 MB/s2.00 GB/s3x
Korean ๐Ÿ‡ฐ๐Ÿ‡ท314 MB/s861 MB/s3x623 MB/s2.80 GB/s4x
Lithuanian ๐Ÿ‡ฑ๐Ÿ‡น352 MB/s864 MB/s2x274 MB/s728 MB/s3x
Polish ๐Ÿ‡ต๐Ÿ‡ฑ364 MB/s939 MB/s3x277 MB/s786 MB/s3x
Portuguese ๐Ÿ‡ง๐Ÿ‡ท395 MB/s2.38 GB/s6x270 MB/s1.79 GB/s7x
Spanish ๐Ÿ‡ช๐Ÿ‡ธ414 MB/s2.38 GB/s6x272 MB/s1.80 GB/s7x
Tamil ๐Ÿ‡ฎ๐Ÿ‡ณ306 MB/s6.05 GB/s20x712 MB/s3.03 GB/s4x
Turkish ๐Ÿ‡น๐Ÿ‡ท326 MB/s852 MB/s3x284 MB/s706 MB/s2x
Ukrainian ๐Ÿ‡บ๐Ÿ‡ฆ217 MB/s2.09 GB/s10x476 MB/s1.58 GB/s3x

To rerun the benchmarks for all languages:

RUSTFLAGS="-C target-cpu=native" cargo build --release --bench bench_unicode --features bench_unicode
bin=$(find target/release/deps -name 'bench_unicode-*' -executable -type f | head -1)

for f in leipzig*.txt; do
  [ -f "$f" ] || continue
  echo "=== $f ==="
  STRINGWARS_DATASET="$f" STRINGWARS_TOKENS=file STRINGWARS_FILTER="case-fold" "$bin"
  STRINGWARS_DATASET="$f" STRINGWARS_TOKENS=file STRINGWARS_FILTER="case-fold/" uv run bench_unicode.py
done
LanguageStandard ๐Ÿฆ€StringZilla ๐Ÿฆ€Standard ๐ŸStringZilla ๐Ÿ
Arabic ๐Ÿ‡ธ๐Ÿ‡ฆ200 MB/s38.55 GB/s193x3.01 GB/s14.78 GB/s5x
Armenian ๐Ÿ‡ฆ๐Ÿ‡ฒ190 MB/s980 MB/s5x2.07 GB/s860 MB/s0x
Bengali ๐Ÿ‡ง๐Ÿ‡ฉ300 MB/s28.20 GB/s94x4.51 GB/s21.19 GB/s5x
Chinese ๐Ÿ‡จ๐Ÿ‡ณ240 MB/s25.65 GB/s107x5.40 GB/s13.94 GB/s3x
Czech ๐Ÿ‡จ๐Ÿ‡ฟ90 MB/s7.41 GB/s82x1.38 GB/s6.36 GB/s5x
Dutch ๐Ÿ‡ณ๐Ÿ‡ฑ90 MB/s12.61 GB/s140x860 MB/s7.99 GB/s9x
English ๐Ÿ‡ฌ๐Ÿ‡ง80 MB/s12.79 GB/s160x770 MB/s5.61 GB/s7x
Farsi ๐Ÿ‡ฎ๐Ÿ‡ท190 MB/s26.22 GB/s138x2.36 GB/s10.70 GB/s5x
French ๐Ÿ‡ซ๐Ÿ‡ท90 MB/s10.77 GB/s120x1.10 GB/s6.83 GB/s6x
Georgian ๐Ÿ‡ฌ๐Ÿ‡ช190 MB/s1.03 GB/s5x3.20 GB/s620 MB/s0x
German ๐Ÿ‡ฉ๐Ÿ‡ช80 MB/s10.67 GB/s133x900 MB/s6.08 GB/s7x
Greek ๐Ÿ‡ฌ๐Ÿ‡ท130 MB/s2.57 GB/s20x1.38 GB/s2.48 GB/s2x
Hebrew ๐Ÿ‡ฎ๐Ÿ‡ฑ190 MB/s34.54 GB/s182x2.92 GB/s15.72 GB/s5x
Italian ๐Ÿ‡ฎ๐Ÿ‡น80 MB/s12.99 GB/s162x970 MB/s8.87 GB/s9x
Japanese ๐Ÿ‡ฏ๐Ÿ‡ต220 MB/s21.71 GB/s99x4.88 GB/s13.17 GB/s3x
Korean ๐Ÿ‡ฐ๐Ÿ‡ท230 MB/s35.10 GB/s153x4.59 GB/s20.05 GB/s4x
Polish ๐Ÿ‡ต๐Ÿ‡ฑ90 MB/s10.50 GB/s117x1.29 GB/s8.02 GB/s6x
Portuguese ๐Ÿ‡ง๐Ÿ‡ท90 MB/s10.72 GB/s119x1.10 GB/s8.12 GB/s7x
Russian ๐Ÿ‡ท๐Ÿ‡บ140 MB/s7.12 GB/s51x2.30 GB/s5.70 GB/s2x
Spanish ๐Ÿ‡ช๐Ÿ‡ธ90 MB/s11.62 GB/s129x1.02 GB/s6.33 GB/s6x
Tamil ๐Ÿ‡ฎ๐Ÿ‡ณ270 MB/s29.53 GB/s109x5.81 GB/s23.11 GB/s4x
Turkish ๐Ÿ‡น๐Ÿ‡ท90 MB/s8.18 GB/s91x1.49 GB/s5.25 GB/s4x
Ukrainian ๐Ÿ‡บ๐Ÿ‡ฆ140 MB/s8.88 GB/s63x2.26 GB/s5.35 GB/s2x
Vietnamese ๐Ÿ‡ป๐Ÿ‡ณ110 MB/s4.25 GB/s39x1.07 GB/s1.12 GB/s1x

To rerun the benchmarks for all languages:

for f in leipzig*.txt; do
  [ -f "$f" ] || continue
  echo "=== $f ==="
  STRINGWARS_DATASET="$f" STRINGWARS_TOKENS=words STRINGWARS_FILTER="case-insensitive-find" STRINGWARS_UNIQUE=1 "$bin"
done

See README.md for dataset information and replication instructions.