UTF-8 Tokenization & Iteration Benchmarks

June 18, 2026 · View on GitHub

Benchmarks for UTF-8 segmentation and codepoint iteration — whitespace, newline, and TR29 word splitting, UTF-8 character counting and decoding, and locating the Nth codepoint — across different languages and hardware platforms.

Tokenization

Different scripts stress UTF-8 processing in different ways:

  • Korean: 3-byte Hangul syllables with single-byte whitespace between words - representative for tokenization workloads
  • Chinese: 3-byte CJK characters with rare whitespace - tests raw byte throughput
  • Arabic: 2-byte Arabic script with regular punctuation - good for newline splitting benchmarks
  • French: Mixed 1-2 byte Latin with high diacritic density
  • English: Mostly 1-byte ASCII baseline

Intel Xeon4 Sapphire Rapids

LibraryEnglishChineseArabicFrenchKorean
Split around 25 whitespace characters:
stringzilla::utf8_whitespace_splits0.44 GB/s1.10 GB/s0.66 GB/s0.43 GB/s0.70 GB/s
std::split<is_whitespace>0.27 GB/s0.59 GB/s0.35 GB/s0.26 GB/s0.42 GB/s
icu::WhiteSpace0.05 GB/s0.15 GB/s0.10 GB/s0.06 GB/s0.20 GB/s
Split around 8 newline combinations:
stringzilla::utf8_newline_splits1.90 GB/s1.64 GB/s2.39 GB/s1.72 GB/s3.18 GB/s
std::split<is_unicode_newline>0.44 GB/s0.75 GB/s0.40 GB/s0.39 GB/s0.70 GB/s
TR29 word segmentation:
stringzilla::utf8_word_splits0.07 GB/s0.11 GB/s0.06 GB/s0.06 GB/s0.19 GB/s
unicode-segmentation::unicode_words0.04 GB/s0.05 GB/s0.06 GB/s0.04 GB/s0.15 GB/s
unicode-segmentation::split_word_bounds0.04 GB/s0.08 GB/s0.07 GB/s0.04 GB/s0.17 GB/s
icu::WordSegmenter0.07 GB/s0.01 GB/s0.14 GB/s0.08 GB/s0.18 GB/s

Measured June 17, 2026 on an Intel Xeon4 Sapphire Rapids.

AMD Zen5 Turin

On AMD Zen5 Turin CPUs on different datasets, StringZilla provides the following throughput for splitting around whitespace and newline characters on 5 vastly different languages.

LibraryEnglishChineseArabicFrenchKorean
Split around 25 whitespace characters:
stringzilla::utf8_whitespace_splits0.82 GB/s2.40 GB/s2.40 GB/s0.92 GB/s1.88 GB/s
std::split<is_whitespace>0.77 GB/s1.87 GB/s1.04 GB/s0.72 GB/s0.98 GB/s
icu::WhiteSpace0.11 GB/s0.16 GB/s0.15 GB/s0.12 GB/s0.15 GB/s
Split around 8 newline combinations:
stringzilla::utf8_newline_splits15.45 GB/s16.65 GB/s18.34 GB/s14.52 GB/s16.71 GB/s
std::split<is_unicode_newline>1.90 GB/s1.93 GB/s1.82 GB/s1.78 GB/s1.81 GB/s

Apple M2 Pro

LibraryEnglishChineseArabicFrenchKorean
Split around 25 whitespace characters:
stringzilla::utf8_whitespace_splits0.57 GB/s2.45 GB/s1.18 GB/s0.61 GB/s0.92 GB/s
std::split<is_whitespace>0.59 GB/s1.16 GB/s0.99 GB/s0.63 GB/s0.89 GB/s
icu::WhiteSpace0.10 GB/s0.16 GB/s0.14 GB/s0.11 GB/s0.14 GB/s
Split around 8 newline combinations:
stringzilla::utf8_newline_splits5.69 GB/s6.24 GB/s6.58 GB/s6.70 GB/s6.29 GB/s
std::split<is_unicode_newline>1.12 GB/s1.11 GB/s1.11 GB/s1.11 GB/s1.13 GB/s

Codepoint Operations

Counting codepoints (count_utf8) and locating the byte offset of the Nth codepoint (find_nth_utf8), on the full Leipzig corpora, single-threaded, decimal GB/s. find_nth_utf8 targets the last codepoint, so every implementation scans the whole buffer. Counting is memory-bandwidth-bound, so every implementation converges near 7–14 GB/s and simdutf edges ahead on the multi-byte scripts; the decisive win is find_nth_utf8, where StringZilla is an order of magnitude faster than the standard library.

Intel Xeon4 Sapphire Rapids

LibraryEnglishChineseArabicFrenchKorean
Count UTF-8 codepoints:
stringzilla::count_utf87.24 GB/s6.93 GB/s7.23 GB/s7.23 GB/s13.28 GB/s
simdutf::count_utf87.06 GB/s7.67 GB/s8.15 GB/s7.08 GB/s14.05 GB/s
std::chars.count6.59 GB/s6.85 GB/s7.07 GB/s6.54 GB/s8.37 GB/s
Byte offset of the Nth codepoint:
stringzilla::find_nth_utf87.14 GB/s7.57 GB/s8.78 GB/s7.62 GB/s12.14 GB/s
std::char_indices.nth0.83 GB/s0.97 GB/s0.57 GB/s0.67 GB/s0.72 GB/s

Measured June 17, 2026 on an Intel Xeon4 Sapphire Rapids.

To rerun the benchmarks for all languages:

RUSTFLAGS="-C target-cpu=native" cargo build --release --bench bench_tokenization --features bench_tokenization
bin=$(find target/release/deps -name 'bench_tokenization-*' -executable -type f | head -1)

for f in leipzig*.txt; do
  [ -f "$f" ] || continue
  echo "=== $f ==="
  STRINGWARS_DATASET="$f" STRINGWARS_TOKENS=file STRINGWARS_FILTER="tokenize" "$bin"
done

See README.md for dataset information and replication instructions.