About this small project

April 23, 2021 ยท View on GitHub

what is strobemer? see https://github.com/ksahlin/strobemers

  1. this project implement an c++ version of the strobemer s(n,k,w_min,w_max) with small difference (the window will not shrink at the end of sequence).

Features:

  • any n>1 is supported.
  • randstrobes is supported.
  • minstrobes is supported.
  • hybirdstrobes is supported.
  1. there small benchmark were attached. (Ignore them if you don't care)

The implementation of strobemer

How to use the c++ implemention of strobemer ? See toy_example

  • first copy strobemer.h and strobemer.cpp into your project.

  • then write your code like :

#include "strobemer.h"

int main() {
     char seq[101]="ATGGGCAGAGTTTGACGTAGTCAATGCTTATGAACGAACGCTCCAATATGAATCAGCTCGTGATTTTTGCTGTAAAAATCGTAGCATACTGTTTGATAAA";
     //strobemer::init(3,13,13,20,strobemer_type::minstrobe); // n=3,k=13,w_min=13,w_max=20,type=minstrobe
     //strobemer::init(3,13,13,21,strobemer_type::hybridstrobe); // n=3,k=13,w_min=13,w_max=20,type=hybridstrobe
     strobemer::init(3,13,13,20,strobemer_type::randstrobe);  // n=3,k=13,w_min=13,w_max=20,type=randstrobe
     int number = 100-strobemer::strobmer_span()+1;
     strobemer * buff = new strobemer[number];
     strobemer::chop_strobemer(seq,100,buff);
     for(int i = 0 ; i< number ; i++ ){
         if(buff[i].valid)
             std::cout<<buff[i].to_string()<<'\n'; // or do whatever you want ...
     }
     delete [] buff;
     return 0;
}

then compile and link it like:

g++  -std=c++11 -c strobemer.cpp -o strobemer.o
g++  -std=c++11 example.cpp strobemer.o -o example

There benchmarks

the test date

  • random sequence with length=100,000 nt.
  • mutation rates are 0.01, 0.05 and 0.1.
  • random mutaion.
  • equal probability of sub/ins/del.
  • each test run 100 times.

benchmark_SIM-R-match-only

  • compare the match number of all kmer/strobemer.
  • compare k30, randstrobe(2,15,50) and minstrobe(2,15,50).

Results of the "average match of all kmers/strobemers (%) " for different error rates and different methods:

0.010.050.1
Kmer(30)74.622.54.7
minstrobe(2,15,15,50)70.117.53.2
randstrobe(2,15,15,50)70.117.53.2

benchmark_SIM-R-snp30

  • compare the match number of all kmer/strobemer snp markers.
  • assign one snp for each 1000bp ( total 99 snps for 100,000bp because I drop the snp at the sequence end ).
  • compare k30, randstrobe(2,15,50) and minstrobe(2,15,50).
  • for k30, one snp generate 30 kmer markers.
  • for strobemer(2,15,50), each snp generate 65 strobemer markers.

Results of the "average match number of all snp markers" for different error rates and different methods:

0.010.050.1
Kmer(30)2210665143
minstrobe(2,15,15,50)45111100208
randstrobe(2,15,15,50)45021134210

Results of the "#detected snp " for different error rates and different methods:

0.010.050.1
Kmer(30)94.555.518
minstrobe(2,15,15,50)9868.526.3
randstrobe(2,15,15,50)9992.352.3

benchmark_SIM-R-snp20

  • compare the match number of all kmer/strobemer snp markers.
  • assign one snp for each 1000bp ( total 99 snps for 100,000bp because I drop the snp at the sequence end ).
  • compare k20, k40, randstrobe(2,10,30) and minstrobe(2,10,30).
  • for K20, one snp generate 20 kmer markers.
  • for K40, one snp generate 40 kmer markers.
  • for strobemer(2,10,30), each snp generate 40 strobemer markers.

Results of the "average match number of all snp markers" for different error rates and different methods:

0.010.050.1
Kmer(20)1631.5748.93260.72
Kmer(40)2682.38540.2262.83
minstrobe(2,10,10,30)3171.381296.55411.71
randstrobe(2,10,10,30)3164.281292.72417.33

Results of the "#detected snp " for different error rates and different methods:

0.010.050.1
Kmer(20)96.7971.0537.79
Kmer(40)92.8139.988.01
minstrobe(2,10,10,30)98.5485.0551.82
randstrobe(2,10,10,30)98.9995.2473.39

Enjoy ~~