Strobemers in Go

April 15, 2021 ยท View on GitHub

GoDoc Go Report Card

Introduction

This is a Go implementation of the strobemers (minstrobes and randstrobes), with some differences.

The implementation of Randstrobes has a not-bad performance (2-3X slower) compared to regular k-mer, while it's 10-20X slower than ntHash. Besides, Randstrobes is only slightly slower than MinStrobes (see benchmark).

Attention

The current implementation only computes strobemers of the positive strand, because the strobes are asymmetrical and the location matters.

Installation

go get github.com/shenwei356/strobemers

Quick Start

We followed the code style of ntHash.

n := 2
l := 3
w_min := 3
w_max := 5
rs, err := strobemers.NewRandStrobes(seq, n, l, w_min, w_max)
checkError(err)

var hash uint64
var ok bool
var i int  // 0-based index
var positions []int // 0-based indexes of all strobes

rs.SetWindowShrink(true)
for {
    hash, ok = rs.Next()
    if !ok {
        break
    }

    i = rs.Index()
    positions = rs.Indexes()
}

Differences

Here are some differences compared to the original implementation, see discussion: #1, #2.

itemorginalthiscomment
window rangew_min < w_maxw_min <= w_maxallow a fixed position
shrinking windowall w_min and w_maxoptional shrinking last w_maxsee figures below
number of strobemerslen(seq)-n*l+1len(seq)-n*l+1-(n-1)*lwindow shrinked
number of strobemerslen(seq)-n*l+1-(n-1)*(l+w_min-1)window not shrinked
choice of min hash(h(m)+h(mj))%q(h(m)+h(mj))&q& is faster than %
final hash value (n=2)h(m1)-h(m2)h(m1)/2+h(m2)/3keep asymmetry and avoid uint64 overflow
final hash value (n=3)h(m1)-h(m2)+2*h(m3)h(m1)/3+h(m2)/4+h(m3)/5~

Benchmark

methodtimerelative_time
ntHashKmers(30)85901
Kmers(30)555796
MinStrobes(2,15,20,30)10452012
MinStrobes(3,10,20,30)11166213
RandStrobes(2,15,20,30)9343611
RandStrobes(3,10,20,30)15246118
$ go test . -bench=Benchmark* -benchmem \
    | grep Bench \
    | perl -pe 's/\s\s+/\t/g' \
    | csvtk cut -Ht -f 1,3-5 \
    | csvtk add-header -t -n test,time,memory,allocs \
    | csvtk pretty -t -r

                                 test           time       memory        allocs
-------------------------------------   ------------   ----------   -----------
           BenchmarkNTHash/1.00_KB-16     8590 ns/op      48 B/op   1 allocs/op
            BenchmarkKmers/1.00_KB-16    55579 ns/op      32 B/op   1 allocs/op
 BenchmarkMinStrobesOrder2/1.00_KB-16   104520 ns/op   25064 B/op   7 allocs/op
 BenchmarkMinStrobesOrder3/1.00_KB-16   111662 ns/op   25064 B/op   7 allocs/op
BenchmarkRandStrobesOrder2/1.00_KB-16    93436 ns/op    8432 B/op   3 allocs/op
BenchmarkRandStrobesOrder3/1.00_KB-16   152461 ns/op    8432 B/op   3 allocs/op

Similar Projects

References