README.md
May 23, 2026 · View on GitHub
LS-GKM: A new gkm-SVM software for large-scale datasets
gkm-SVM, a sequence-based method for predicting regulatory DNA elements,
is a useful tool for studying gene regulatory mechanisms.
In continuous efforts to improve the method, new software, LS-GKM,
is introduced. It offers much better scalability and provides further
advanced gapped k-mer based kernel functions. As a result, LS-GKM
achieves considerably higher accuracy than the original gkm-SVM.
Citation
Please cite the following paper if you use LS-GKM in your research:
-
Ghandi, M.†, Lee, D.†, Mohammad-Noori, M. & Beer, M. A. Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features. PLoS Comput Biol 10, e1003711 (2014). doi:10.1371/journal.pcbi.1003711 † Co-first authors
-
Lee, D. LS-GKM: A new gkm-SVM for large-scale Datasets. Bioinformatics btw142 (2016). doi:10.1093/bioinformatics/btw142
Installation
After downloading and extracting the source codes, type:
$ cd src
$ make
If successful, You should be able to find the following executables in the current (src) directory:
gkmtrain
gkmpredict
gkmtrain-svr
gkmmatrix
make install will simply copy these executables to the ../bin direcory
Tutorial
We introduce the users to the basic workflow of LS-GKM. Please refer to help messages
for more detailed information of each program. You can access to it by running the programs
without any argument/parameter.
Training of LS-GKM
You train a SVM classifier using gkmtrain. It takes three arguments;
positive sequence file, negative sequence file, and prefix of output.
Usage: gkmtrain [options] <posfile> <negfile> <outprefix>
train gkm-SVM using libSVM
Arguments:
posfile: positive sequence file (FASTA format)
negfile: negative sequence file (FASTA format)
outprefix: prefix of output file(s) <outprefix>.model.txt
(or <outprefix>.model.txt.gz with -z) or
<outprefix>.cvpred.txt
Options:
-t <0 ~ 5> set kernel function (default: 2 gkm)
NOTE: RBF kernels (3 and 5) work best with -c 10 -g 2
0 -- gapped-kmer
1 -- estimated l-mer with full filter
2 -- estimated l-mer with truncated filter (gkm)
3 -- gkm + RBF (gkmrbf)
4 -- gkm + center weighted (wgkm)
[weight = max(M, floor(M*exp(-ln(2)*D/H)+1))]
5 -- gkm + center weighted + RBF (wgkmrbf)
-l <int> set word length, 3<=l<=12 (default: 11)
-k <int> set number of informative column, k<=l (default: 7)
-d <int> set maximum number of mismatches to consider, d<=4 (default: 3)
-g <float> set gamma for RBF kernel. -t 3 or 5 only (default: 1.0)
-M <int> set the initial value (M) of the exponential decay function
for wgkm-kernels. max=255, -t 4 or 5 only (default: 50)
-H <float> set the half-life parameter (H) that is the distance (D) required
to fall to half of its initial value in the exponential decay
function for wgkm-kernels. -t 4 or 5 only (default: 50)
-R if set, reverse-complement is not considered as the same feature
-c <float> set the regularization parameter SVM-C (default: 1.0)
-e <float> set the precision parameter epsilon (default: 0.001)
-w <float> set the parameter SVM-C to w*C for the positive set (default: 1.0)
-m <float> set cache memory size in MB (default: 100.0)
NOTE: Large cache signifcantly reduces runtime. >4Gb is recommended
-s if set, use the shrinking heuristics
-x <int> set N-fold cross validation mode (default: no cross validation)
-i <int> run i-th cross validation only 1<=i<=ncv (default: all)
-r <int> set random seed for shuffling in cross validation mode (default: 1)
-v <0 ~ 4> set the level of verbosity (default: 2)
0 -- error msgs only (ERROR)
1 -- warning msgs (WARN)
2 -- progress msgs at coarse-grained level (INFO)
3 -- progress msgs at fine-grained level (DEBUG)
4 -- progress msgs at finer-grained level (TRACE)
-T <1|4|16> set the number of threads for parallel calculation, 1, 4, or 16
(default: 1)
-z write the model as gzip-compressed (<outprefix>.model.txt.gz)
First try to train a model using simple test files. Type the following command in tests/ directory:
$ ../bin/gkmtrain wgEncodeSydhTfbsGm12878Nfe2hStdAlnRep0.tr.fa wgEncodeSydhTfbsGm12878Nfe2hStdAlnRep0.neg.tr.fa test_gkmtrain
It will generate test_gkmtrain.model.txt, which will then be used for scoring of
any DNA sequences as described below. This result should be the same as wgEncodeSydhTfbsGm12878Nfe2hStdAlnRep0.model.txt
You can also perform cross-validation (CV) analysis with -x <N> option. For example,
the following command will perform 5-fold CV.
$ ../bin/gkmtrain -x 5 wgEncodeSydhTfbsGm12878Nfe2hStdAlnRep0.tr.fa wgEncodeSydhTfbsGm12878Nfe2hStdAlnRep0.neg.tr.fa test_gkmtrain
The result will be stored in test_gkmtrain.cvpred.txt, and this should be the same as
wgEncodeSydhTfbsGm12878Nfe2hStdAlnRep0.cvpred.txt
Please note that it will run SVM training N times, which can take time if training
sets are large. In this case, you can perform CV analysis on a specific set
by using -i <I> option for parallel runnings. The output will be <outprefix>.cvpred.<I>.txt
The format of the cvpred file is as follows:
[sequenceid] [SVM score] [label] [CV-set]
...
Scoring DNA sequence using gkm-SVM
You use gkmpredict to score any set of sequences.
Usage: gkmpredict [options] <test_seqfile> <model_file> <output_file>
score test sequences using trained gkm-SVM
Arguments:
test_seqfile: sequence file for test (fasta format)
model_file: output of gkmtrain (plain text or gzip-compressed; auto-detected)
output_file: name of output file
Options:
-v <0|1|2|3|4> set the level of verbosity (default: 2)
0 -- error msgs only (ERROR)
1 -- warning msgs (WARN)
2 -- progress msgs at coarse-grained level (INFO)
3 -- progress msgs at fine-grained level (DEBUG)
4 -- progress msgs at finer-grained level (TRACE)
-T <1|4|16> set the number of threads for parallel calculation, 1, 4, or 16
(default: 1)
Here, you will try to score the positive and the negative test sequences. Type:
$ ../bin/gkmpredict wgEncodeSydhTfbsGm12878Nfe2hStdAlnRep0.test.fa wgEncodeSydhTfbsGm12878Nfe2hStdAlnRep0.model.txt test_gkmpredict.txt
$ ../bin/gkmpredict wgEncodeSydhTfbsGm12878Nfe2hStdAlnRep0.neg.test.fa wgEncodeSydhTfbsGm12878Nfe2hStdAlnRep0.model.txt test_gkmpredict.neg.txt
Training a regression model (gkmtrain-svr)
If your training data has continuous scores rather than positive/negative labels,
use gkmtrain-svr to train a support vector regression (SVR) model. The input is a
tab-delimited file with two columns: sequence and score.
Usage: gkmtrain-svr [options] <datafile> <outprefix>
train support vector regression (SVR) using gkm-kernel and libSVM
Arguments:
datafile: tab-delimited data file. The 1st column is sequence and
the 2nd column is score.
outprefix: prefix of output file(s) <outprefix>.model.txt
(or <outprefix>.model.txt.gz with -z) or
<outprefix>.cvpred.txt
It accepts the same kernel/training options as gkmtrain (-t, -l, -k, -d,
-g, -M, -H, -R, -c, -e, -m, -s, -x, -i, -r, -v, -T, -z),
plus -p <float> to set the epsilon parameter in the SVR loss function (default: 0.1).
The default regularization is -c 0.1 (compared to -c 1.0 for gkmtrain).
The resulting <outprefix>.model.txt can be scored by gkmpredict exactly like a
classifier model.
Computing the kernel matrix directly (gkmmatrix)
gkmmatrix computes the pairwise gkm kernel matrix for a set of sequences without
training an SVM. This is useful for diagnostics or for plugging the kernel into
external tools.
Usage: gkmmatrix [options] <pos_seqfile> <neg_seqfile> <output_kernel>
build kernel matrix of gkm-SVM
Arguments:
pos_seqfile: positive sequence file (fasta format)
neg_seqfile: negative sequence file (fasta format)
output_kernel: output kernel file
Options:
-t <0 ~ 5> set type of kernel function (default: 2)
-l <int> set word length, 3<=l<=12 (default: 10)
-k <int> set number of informative column, k<=l (default: 6)
-d <int> set maximum number of mismatches to consider, d<=4 (default: 3)
-g <float> set gamma for RBF kernel (-t 3 or 5) (default: 1.0)
-L write only the last row of the kernel matrix (all sequences
vs. the final sequence) on a single line instead of the full
lower-triangular matrix
-v <0 ~ 4> set the level of verbosity (default: 2)
-T <1|4|16> set the number of threads for parallel calculation, 1, 4, or 16
(default: 1)
Generating weight files for deltaSVM
You need to generate all possible non-redundant k-mers using the Python script
scripts/nrkmers.py. Then, you score them using gkmpredict as described above.
The output of lgkmpredict can be directly used by the deltaSVM script deltasvm.pl
available from our deltasvm website.
** Please email Dongwon Lee (dongwon.lee AT childrens DOT harvard DOT edu) if you have any questions. **