Fingerprinting and Sketching Benchmarks

January 18, 2026 ยท View on GitHub

Benchmarks for byte-level fingerprinting and sketching algorithms across CPU and GPU implementations.

Overview

In large-scale Retrieval workloads a common technique is to convert variable-length messy strings into some fixed-length representations. Those are often called "fingerprints" or "sketches", like "Min-Hashing" or "Count-Min-Sketching". There are a million variations of those algorithms, all resulting in different speed-vs-accuracy tradeoffs.

Two of the approximations worth considering are:

  • The number of collisions of produced individual hashes within fingerprints
  • The bit-distribution entropy of the produced fingerprints

Adjusting all implementations to the same tokenization scheme, one may experience the following numbers:

Performance and Quality Metrics

Library~100 bytes lines~1,000 bytes lines
serial <ByteGrams> on 1x SPR0.44 MB/s0.47 MB/s
92.81% collisions94.58% collisions
0.8528 entropy0.7979 entropy
pc::MinHash<ByteGrams> on 1x SPR2.41 MB/s3.16 MB/s
91.80% collisions93.17% collisions
0.9343 entropy0.8779 entropy
stringzillas::Fingerprints on 1x SPR0.56 MB/s0.51 MB/s
stringzillas::Fingerprints on 16x SPR6.62 MB/s8.03 MB/s
stringzillas::Fingerprints on 384x GNR231.13 MB/s302.30 MB/s
stringzillas::Fingerprints on RTX6000138 MB/s162.99 MB/s
stringzillas::Fingerprints on H100102.07 MB/s392.37 MB/s
86.80% collisions93.21% collisions
0.9992 entropy0.9967 entropy

Quality Analysis

The trickiest part, however, is analyzing the retrieval quality of those fingerprints and comparing them to other approaches. So, how many bits per fingerprint are needed to achieve a specific recall rate for a given dataset? Or, how does the average Levenshtein distance among the top-k nearest neighbors change with the fingerprint size? It must clearly decrease, but how fast, and how does that compare to ground truth?

For detailed quality analysis, please check out the HashEvals repository.


See README.md for dataset information and replication instructions.