Kmer sets benchmark

March 20, 2026 · View on GitHub

DOI

Kmer sets benchmark

This is a benchmark of data structures for kmer dictionaries, i.e., data structures that represent a set of kmers and support, at least, exact membership queries. The goal of the benchmark is to test the data structures using the same benchmarking code and methodology (as well as, obviously, the same datasets and queries).

The scripts to run the benchmark are in the script folder.

Clone and compile as follows.

git clone --recursive https://github.com/jermp/kmer_sets_benchmark.git
cd kmer_sets_benchmark/
mkdir build
cd build/
cmake .. -DUSE_MAX_KMER_LENGTH_63=Off
make -j

Optionally, compile with cmake .. -DUSE_MAX_KMER_LENGTH_63=On to use k=63 in the benchmark.

Tested dictionaries

The dictionaries benchmarked here are:

All C++ implementations are by the respective authors.

Datasets

For these benchmarks we used the datasets available here https://zenodo.org/records/17582116: the files *.eulertigs.fa.gz were used as input, the *.fastq.gz files were used as queries instead.

CollectionNum. distinct 31-mersNum. distinct 63-mers
Cod502,465,200556,585,658
Kestrel1,150,399,2051,155,250,667
Human2,505,678,6802,771,316,093
NCBI-virus376,205,185412,515,880
SE894,310,0841,524,904,156
HPRC3,718,120,9495,926,785,469

Methodology

The dictionaries were built with a max RAM usage of 16 GB and 64 threads. All queries were run using 1 thread, instead.

For SSHash and SBWT, the building time reported in the tables refers to the time it takes to index the eulertigs.fa.gz files. FMSI first requires the computation of masked super strings from the eulertigs.fa.gz files. We excluded this time from the building time and report only the time FMSI takes to index its computed super strings.

The space reported is the space taken by the dictionaries on disk.

Positive random lookup time was measured by querying 1 million kmers that appear in the dictionaries, half of which were reverse complemented to test the dictionaries in the most general case.

For negative random lookups, random kmers were generated (i.e., each nucleotide was uniformly sampled from {A,C,G,T}) and used as queries instead.

For random access, we uniformly generated 1 million ranks and retrieved the corresponding kmers.

Lastly, for streaming queries, we queried the dictionaries using FASTQ reads. For each dictionary, a FASTQ readset was chosen to have a high-hit workload (i.e., most kmers are found in the dictionaries). See the folder script for details.

Results

These are the results obtained on Nov 2025 (see logs here) on a machine equipped with an AMD Ryzen Threadripper PRO 7985WX processor clocked at 5.40GHz. The code was compiled with gcc 13.3.0.

SSHash indexes reported here were built with option --canonical, using the indicated value for the m parameter (minimizer length). All results are available here https://github.com/jermp/sshash/tree/bench/benchmarks.

kCollectionmSpace (bits/kmer)Space (total GB)Building time (m:ss)Positive random lookup (µs/kmer)Negative random lookup (µs/kmer)Random Access (µs/kmer)Streaming Lookup high-hit (ns/kmer)
31Cod209.010.570:260.440.370.2826
Kestrel208.671.251:060.440.400.2846
Human2110.013.143:100.610.420.3574
NCBI-virus198.480.400:160.410.360.2629
SE2111.511.291:060.630.400.36186
HPRC2111.935.544:450.710.460.5493
63Cod244.90.350:150.560.450.2960
Kestrel244.220.610:190.540.480.3366
Human255.311.841:090.690.520.36146
NCBI-virus234.460.230:070.520.440.2872
SE317.771.480:581.000.510.41400
HPRC318.146.034:131.000.580.64181

Tab. 1 SSHash results

SBWT indexes were all built using the "plain-matrix" variant, with option --add-reverse-complements so that queries return the same results as for the other indexes. The indexes make use of the LCP array to speed up streaming queries.

kCollectionSpace (bits/kmer)Space (total GB)Building time (m:ss)Positive random Lookup (µs/kmer)Negative random Lookup (µs/kmer)Random Access (µs/kmer)Streaming Lookup high-hit (ns/kmer)
31Cod10.520.6603:342.720.917.7162
Kestrel10.521.5107:572.870.969.48287
Human10.503.2917:562.971.0710.81266
NCBI-virus10.530.5002:442.710.896.96139
SE10.721.2006:542.830.978.97189
HPRC10.504.8829:243.121.1611.45263
63Cod10.520.7306:136.590.9216.05118
Kestrel10.551.5214:386.910.9719.87435
Human10.503.6455:387.231.0922.73768
NCBI-virus10.550.5404:426.630.9014.93187
SE10.932.0821:376.951.0020.87290
HPRC10.507.78180:028.071.2425.20835

Tab. 2 SBWT results

FMSI indexes make use of the LCP array to speed up streaming queries.

kCollectionSpace (bits/kmer)Space (total GB)Building time (m:ss)Positive random Lookup (µs/kmer)Negative random Lookup (µs/kmer)Random Access (µs/kmer)Streaming Lookup high-hit (ns/kmer)
31Cod3.370.2102:125.701.7014.84275
Kestrel3.160.4505:186.201.9017.83983
Human3.311.0414:336.602.1618.621176
NCBI-virus3.500.1601:395.511.6913.61736
SE4.390.4905:086.502.0417.801018
HPRC4.261.9827:476.992.3718.041370
63Cod3.330.2302:3812.661.7731.18375
Kestrel3.170.4605:4113.741.9137.451143
Human3.221.1117:3514.642.1640.061642
NCBI-virus3.520.1802:0912.341.7429.98870
SE4.820.9212:2915.092.2039.651419
HPRC4.953.6761:4115.992.4837.092078

Tab. 3 FMSI results