Sequence Feature Constructor

January 3, 2020 ยท View on GitHub

#=============================

Sequence Feature Constructor

A3M_TGT_Gen.sh (v1.07)

#=============================

Abstract: Construct sequence features in TGT format via the search of homology protein sequences for a query sequence in FASTA format. Also generate the Multiple Sequence Alignment (MSA) in A3M format.

Author: Sheng Wang

Email: realbigws@gmail.com

#=============

Publication:

#=============

[1] RaptorX-Property: a Web Server for Protein Structure Property Prediction Sheng Wang#, Wei Li, Shiwang Liu, Jinbo Xu# Nucleic Acids Research, 2016

[2] Protein Secondary Structure Prediction Using Deep Convolutional Neural Fields Sheng Wang#, Jian Peng, Jianzhu Ma, Jinbo Xu# Scientific Reports, 2016

[3] PredMP: a web server for de novo prediction and visualization of membrane proteins Sheng Wang#, Shiyang Fei, Zongan Wang, Yu Li, Jinbo Xu, Feng Zhao#, Xin Gao# Bioinformatics, 2018

#=========

Install:

#=========

  1. download the package

git clone https://github.com/realbigws/TGT_Package


  1. compile

cd source_code/ make cd ../


  1. jackhmmer (optional)

To re-compile the executables, run the following commands:

cd jackhmm/ ./install cd ../


  1. blastpgp (optional)

To compile the executables in 'util/', run the following commands:

cd buildali2/source_code/ make cd ../../

#==========

Database:

#==========

  1. if databases/ not exist, create it by

mkdir -p databases/

  1. download the UniProt20 database from the following link:

http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/old-releases/uniprot20_2016_02.tgz

uncompressed it, move all files to databases/, and rename these files with prefix 'uniprot20'.

  1. if other version of UniProt20 (or, UniClust30) is applied, then use '-d uniprot20_XXXX_YY' option in ./A3M_TGT_Gen.sh

    for example, the new version UniClust30 could be downloaded from the below link: http://wwwuser.gwdg.de/~compbiol/uniclust/2017_10/uniclust30_2017_10_hhsuite.tar.gz

Again, you may rename these files with prefix 'uniclust30'.

  1. if users want to run other packages, such as jackhmm or buildali2, please install the below databases:

(a) jackhmm : databases/uniref90.fasta

this database could be downloaded from the following link:
    ftp://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref90/uniref90.fasta.gz
[note]: you may rename this file (say, uniref90.fasta) with respect to the current date (e.g., uniref90_2018_06.fasta),
        and them symbol link it to uniref90.fasta

(b) buildali2 : databases/nr_databases (must contain nr90 and nr70)

these databases could be downloaded from the following link:
    http://raptorx.uchicago.edu/download/

#=======

Usage:

#=======

A3M_TGT_Gen v1.00 [Dec-06-2019] Generate A3M and TGT file from a given sequence in FASTA format.

USAGE: ./A3M_TGT_Gen.sh <-i input_fasta> [-o out_root] [-c CPU_num] [-m memory] [-h package] [-d database] [-n iteration] [-e evalue] [-E neff] [-C coverage] [-M min_cut] [-N max_num] [-A addi_meff] [-V addi_eval] [-D addi_db] [-K remove_tmp] [-f force] [-H home] Options:

***** required arguments ***** -i input_fasta : Query protein sequence in FASTA format.

***** optional arguments ***** #--| misc parameters -o out_root : Default output would the current directory. [default = './${input_name}_A3MTGT']

-c CPU_num : Number of processors. [default = 4]

-m memory : Maximal allowed memory (for hhsuite2 or hhsuite3 only). [default = 3.0 (G)]

#--| search engine and database -h package : The selected package to generate A3M file. [default = hhsuite2] users may use other packages: hhsuite3, jackhmm, or buildali2.

-d database : The selected database for sequence search. [default = uniprot20_2016_02] users may use other uniprot20 databases to run hhsuite2 or hhsuite3, or use uniref90 for jackhmm, and nr_databases for buildali2.

#--| search strategy -n iteration : Maximal iteration to run the seleced package. [default = 2]

-e evalue : E-value cutoff for the selected package. [default = 0.001]

-E neff : Neff cutoff for threading purpose (i.e., -C -2). [default = 7] (for hhsuite only)

-C coverage : Coverage for hhsuite only. [default = -2 (i.e., NOT use -cov in HHblits)] if set to -1, then automatically determine coverage value. if set to any other positive value, then use this -cov in HHblits.

#--| filter strategy -M min_cut : Minimal coverage of sequences in the generated MSA. [default = -1] -1 indicates that we DON'T perform any filtering. Please set from 50 to 70.

-N max_num : Maximal number of sequences in the generated MSA. [default = -1] -1 indicates that we DON'T perform any filtering. For example, set 20000 here.

#--| additional A3M -A addi_meff : run additional A3M only if the previous ln(meff) is lower than this. [default = -1] -1 indicates that we DON'T search for additional A3M.

-V addi_eval : run additional A3M with a given e-value. [default = 0.001]

-D addi_db : run additional A3M using a given database. [default = metaclust50]

#--| other options -K remove_tmp : Remove temporary folder or not. [default = 1 to remove]

-f force : If specificied, then FORCE overwrite existing files. [default = 0 NOT to]

***** home relevant ********** -H home : home directory of TGT_Package. [default = .]

#=================

Running example:

#=================

#-------------- part I: generate A3M without filtering and metagenomics ------------------#

#-> 1. by default, we run HHblits to generate A3M ./A3M_TGT_Gen.sh -c 12 -i example/1pazA.fasta -o 1pazA_out -d uniprot20

#-> 2. we may also use JackHMM to produce A3M ./A3M_TGT_Gen.sh -c 12 -i example/1pazA.fasta -o 1pazA_out -h jackhmm -d uniref90 -n 3

#-> 3. for legacy purpose, we allow BLAST wrapped in BuildAli2 ./A3M_TGT_Gen.sh -c 12 -i example/1pazA.fasta -o 1pazA_out -h buildali2 -d nr_databases -n 5

#-------------- part II: generate A3M with filtering and metagenomics --------------------#

./A3M_TGT_Gen.sh -i example/T1001-D1.fasta -d uniprot20 -N 5000 -A 6 #-> [note]: here 5000 is the maximal number of sequences in the generated A3M.

6 is ln(meff) threshold to run additional A3M.

#-------------- part III: generate a variety of A3Ms with user-defined strategies --------#

./SEQ_to_A3Ms.sh -i example/1pazA.fasta -x '1e-3:3:-1|1:3:-1' -y 'null' -z 'null' -X uniprot20 -c 12 #-> [note]: here we only use the HHblits strategies on uniprot20 database.

#============

References:

#============

[blastpgp]: a) legacy version: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/legacy.NOTSUPPORTED/2.2.26/

b) current version:
    ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/2.2.29/