Case Folding & Normalization Benchmarks

June 18, 2026 · View on GitHub

Benchmarks for Unicode case-insensitive operations and normalization — case folding, case-insensitive comparison and substring search, and NFC/NFD/NFKC/NFKD normalization — across different languages and hardware platforms.

Case Folding

Measured on the full Leipzig corpora (STRINGWARS_TOKENS=file), single-threaded. Standard is std::to_lowercase and str.casefold(), StringZilla is utf8_uncased_fold.

Intel Xeon4 Sapphire Rapids

LanguageStandard 🦀StringZilla 🦀Standard 🐍StringZilla 🐍
Arabic 🇸🇦88 MB/s2.25 GB/s26x188 MB/s302 MB/s2x
Armenian 🇦🇲72 MB/s373 MB/s5x209 MB/s180 MB/s1x
Bengali 🇧🇩99 MB/s2.42 GB/s24x303 MB/s380 MB/s1x
Chinese 🇨🇳96 MB/s414 MB/s4x239 MB/s190 MB/s1x
Czech 🇨🇿96 MB/s1.41 GB/s15x125 MB/s228 MB/s2x
Dutch 🇳🇱136 MB/s4.71 GB/s35x303 MB/s331 MB/s1x
English 🇬🇧135 MB/s5.07 GB/s37x361 MB/s393 MB/s1x
Farsi 🇮🇷84 MB/s1.15 GB/s14x219 MB/s271 MB/s1x
French 🇫🇷125 MB/s1.71 GB/s14x122 MB/s238 MB/s2x
German 🇩🇪133 MB/s2.05 GB/s15x124 MB/s281 MB/s2x
Greek 🇬🇷69 MB/s1.26 GB/s18x194 MB/s275 MB/s1x
Hebrew 🇮🇱74 MB/s2.37 GB/s32x180 MB/s260 MB/s1x
Hindi 🇮🇳98 MB/s2.45 GB/s25x291 MB/s367 MB/s1x
Italian 🇮🇹140 MB/s3.07 GB/s22x152 MB/s343 MB/s2x
Japanese 🇯🇵97 MB/s1.17 GB/s12x242 MB/s267 MB/s1x
Korean 🇰🇷149 MB/s2.24 GB/s15x241 MB/s286 MB/s1x
Polish 🇵🇱117 MB/s1.12 GB/s10x110 MB/s195 MB/s2x
Portuguese 🇧🇷133 MB/s2.30 GB/s17x114 MB/s265 MB/s2x
Russian 🇷🇺69 MB/s1.31 GB/s19x199 MB/s288 MB/s1x
Spanish 🇪🇸130 MB/s2.17 GB/s17x109 MB/s280 MB/s3x
Tamil 🇮🇳113 MB/s2.40 GB/s21x319 MB/s394 MB/s1x
Turkish 🇹🇷106 MB/s1.12 GB/s11x124 MB/s228 MB/s2x
Ukrainian 🇺🇦69 MB/s1.23 GB/s18x203 MB/s283 MB/s1x
Vietnamese 🇻🇳86 MB/s1.25 GB/s15x155 MB/s255 MB/s2x

Measured June 17, 2026 on an Intel Xeon4 Sapphire Rapids.

AMD Zen5 Turin

LanguageStandard 🦀StringZilla 🦀Standard 🐍StringZilla 🐍
English 🇬🇧482 MB/s7.53 GB/s16x257 MB/s3.14 GB/s12x
German 🇩🇪432 MB/s2.59 GB/s6x260 MB/s1.81 GB/s7x
Russian 🇷🇺217 MB/s2.20 GB/s10x470 MB/s1.56 GB/s3x
French 🇫🇷346 MB/s1.84 GB/s5x274 MB/s1.37 GB/s5x
Greek 🇬🇷220 MB/s1.00 GB/s5x431 MB/s779 MB/s2x
Armenian 🇦🇲223 MB/s908 MB/s4x470 MB/s746 MB/s2x
Vietnamese 🇻🇳265 MB/s352 MB/s1x340 MB/s291 MB/s1x
Arabic 🇸🇦232 MB/s1004 MB/s4x467 MB/s1.80 GB/s4x
Bengali 🇧🇩314 MB/s6.17 GB/s20x694 MB/s2.91 GB/s4x
Chinese 🇨🇳325 MB/s1.21 GB/s4x697 MB/s886 MB/s1x
Czech 🇨🇿322 MB/s827 MB/s3x292 MB/s688 MB/s2x
Dutch 🇳🇱471 MB/s4.73 GB/s10x262 MB/s2.97 GB/s11x
Farsi 🇮🇷235 MB/s858 MB/s4x475 MB/s1.42 GB/s3x
Georgian 🇬🇪294 MB/s192 MB/s1x689 MB/s488 MB/s1x
Hebrew 🇮🇱233 MB/s1.01 GB/s4x473 MB/s1.86 GB/s4x
Italian 🇮🇹439 MB/s2.29 GB/s5x268 MB/s1.93 GB/s7x
Japanese 🇯🇵330 MB/s3.51 GB/s11x726 MB/s2.00 GB/s3x
Korean 🇰🇷314 MB/s861 MB/s3x623 MB/s2.80 GB/s4x
Lithuanian 🇱🇹352 MB/s864 MB/s2x274 MB/s728 MB/s3x
Polish 🇵🇱364 MB/s939 MB/s3x277 MB/s786 MB/s3x
Portuguese 🇧🇷395 MB/s2.38 GB/s6x270 MB/s1.79 GB/s7x
Spanish 🇪🇸414 MB/s2.38 GB/s6x272 MB/s1.80 GB/s7x
Tamil 🇮🇳306 MB/s6.05 GB/s20x712 MB/s3.03 GB/s4x
Turkish 🇹🇷326 MB/s852 MB/s3x284 MB/s706 MB/s2x
Ukrainian 🇺🇦217 MB/s2.09 GB/s10x476 MB/s1.58 GB/s3x

To rerun the benchmarks for all languages:

RUSTFLAGS="-C target-cpu=native" cargo build --release --bench bench_normalization --features bench_normalization
bin=$(find target/release/deps -name 'bench_normalization-*' -executable -type f | head -1)

for f in leipzig*.txt; do
  [ -f "$f" ] || continue
  echo "=== $f ==="
  STRINGWARS_DATASET="$f" STRINGWARS_TOKENS=file STRINGWARS_FILTER="case-fold" "$bin"
  STRINGWARS_DATASET="$f" STRINGWARS_TOKENS=file STRINGWARS_FILTER="case-fold/" uv run normalization/bench.py
done
LanguageStandard 🦀StringZilla 🦀Standard 🐍StringZilla 🐍
Arabic 🇸🇦103 MB/s7.24 GB/s70x3.01 GB/s14.78 GB/s5x
Armenian 🇦🇲135 MB/s272 MB/s2x2.07 GB/s860 MB/s0x
Bengali 🇧🇩191 MB/s6.97 GB/s37x4.51 GB/s21.19 GB/s5x
Chinese 🇨🇳104 MB/s8.72 GB/s84x5.40 GB/s13.94 GB/s3x
Czech 🇨🇿40 MB/s5.33 GB/s132x1.38 GB/s6.36 GB/s5x
Dutch 🇳🇱41 MB/s4.33 GB/s107x860 MB/s7.99 GB/s9x
English 🇬🇧43 MB/s4.91 GB/s115x770 MB/s5.61 GB/s7x
Farsi 🇮🇷127 MB/s6.63 GB/s52x2.36 GB/s10.70 GB/s5x
French 🇫🇷62 MB/s5.36 GB/s86x1.10 GB/s6.83 GB/s6x
Georgian 🇬🇪190 MB/s1.03 GB/s5x3.20 GB/s620 MB/s0x
German 🇩🇪47 MB/s4.47 GB/s95x900 MB/s6.08 GB/s7x
Greek 🇬🇷56 MB/s1.66 GB/s30x1.38 GB/s2.48 GB/s2x
Hebrew 🇮🇱77 MB/s6.86 GB/s89x2.92 GB/s15.72 GB/s5x
Italian 🇮🇹62 MB/s5.03 GB/s81x970 MB/s8.87 GB/s9x
Japanese 🇯🇵106 MB/s9.41 GB/s89x4.88 GB/s13.17 GB/s3x
Korean 🇰🇷154 MB/s9.94 GB/s65x4.59 GB/s20.05 GB/s4x
Polish 🇵🇱42 MB/s4.43 GB/s105x1.29 GB/s8.02 GB/s6x
Portuguese 🇧🇷41 MB/s4.93 GB/s121x1.10 GB/s8.12 GB/s7x
Russian 🇷🇺60 MB/s3.54 GB/s59x2.30 GB/s5.70 GB/s2x
Spanish 🇪🇸64 MB/s4.88 GB/s76x1.02 GB/s6.33 GB/s6x
Tamil 🇮🇳116 MB/s6.98 GB/s60x5.81 GB/s23.11 GB/s4x
Turkish 🇹🇷62 MB/s4.12 GB/s66x1.49 GB/s5.25 GB/s4x
Ukrainian 🇺🇦97 MB/s2.97 GB/s31x2.26 GB/s5.35 GB/s2x
Vietnamese 🇻🇳76 MB/s5.06 GB/s67x1.07 GB/s1.12 GB/s1x

Measured June 17, 2026 on an Intel Xeon4 Sapphire Rapids.

To rerun the benchmarks for all languages:

for f in leipzig*.txt; do
  [ -f "$f" ] || continue
  echo "=== $f ==="
  STRINGWARS_DATASET="$f" STRINGWARS_TOKENS=words STRINGWARS_FILTER="case-insensitive-find" STRINGWARS_UNIQUE=1 "$bin"
done

See README.md for dataset information and replication instructions.