Sequence Feature Constructor
January 3, 2020 ยท View on GitHub
#=============================
Sequence Feature Constructor
A3M_TGT_Gen.sh (v1.07)
#=============================
Abstract: Construct sequence features in TGT format via the search of homology protein sequences for a query sequence in FASTA format. Also generate the Multiple Sequence Alignment (MSA) in A3M format.
Author: Sheng Wang
Email: realbigws@gmail.com
#=============
Publication:
#=============
[1] RaptorX-Property: a Web Server for Protein Structure Property Prediction Sheng Wang#, Wei Li, Shiwang Liu, Jinbo Xu# Nucleic Acids Research, 2016
[2] Protein Secondary Structure Prediction Using Deep Convolutional Neural Fields Sheng Wang#, Jian Peng, Jianzhu Ma, Jinbo Xu# Scientific Reports, 2016
[3] PredMP: a web server for de novo prediction and visualization of membrane proteins Sheng Wang#, Shiyang Fei, Zongan Wang, Yu Li, Jinbo Xu, Feng Zhao#, Xin Gao# Bioinformatics, 2018
#=========
Install:
#=========
- download the package
git clone https://github.com/realbigws/TGT_Package
- compile
cd source_code/ make cd ../
- jackhmmer (optional)
To re-compile the executables, run the following commands:
cd jackhmm/ ./install cd ../
- blastpgp (optional)
To compile the executables in 'util/', run the following commands:
cd buildali2/source_code/ make cd ../../
#==========
Database:
#==========
- if databases/ not exist, create it by
mkdir -p databases/
- download the UniProt20 database from the following link:
uncompressed it, move all files to databases/, and rename these files with prefix 'uniprot20'.
-
if other version of UniProt20 (or, UniClust30) is applied, then use '-d uniprot20_XXXX_YY' option in ./A3M_TGT_Gen.sh
for example, the new version UniClust30 could be downloaded from the below link: http://wwwuser.gwdg.de/~compbiol/uniclust/2017_10/uniclust30_2017_10_hhsuite.tar.gz
Again, you may rename these files with prefix 'uniclust30'.
- if users want to run other packages, such as jackhmm or buildali2, please install the below databases:
(a) jackhmm : databases/uniref90.fasta
this database could be downloaded from the following link:
ftp://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref90/uniref90.fasta.gz
[note]: you may rename this file (say, uniref90.fasta) with respect to the current date (e.g., uniref90_2018_06.fasta),
and them symbol link it to uniref90.fasta
(b) buildali2 : databases/nr_databases (must contain nr90 and nr70)
these databases could be downloaded from the following link:
http://raptorx.uchicago.edu/download/
#=======
Usage:
#=======
A3M_TGT_Gen v1.00 [Dec-06-2019] Generate A3M and TGT file from a given sequence in FASTA format.
USAGE: ./A3M_TGT_Gen.sh <-i input_fasta> [-o out_root] [-c CPU_num] [-m memory] [-h package] [-d database] [-n iteration] [-e evalue] [-E neff] [-C coverage] [-M min_cut] [-N max_num] [-A addi_meff] [-V addi_eval] [-D addi_db] [-K remove_tmp] [-f force] [-H home] Options:
***** required arguments ***** -i input_fasta : Query protein sequence in FASTA format.
***** optional arguments ***** #--| misc parameters -o out_root : Default output would the current directory. [default = './${input_name}_A3MTGT']
-c CPU_num : Number of processors. [default = 4]
-m memory : Maximal allowed memory (for hhsuite2 or hhsuite3 only). [default = 3.0 (G)]
#--| search engine and database -h package : The selected package to generate A3M file. [default = hhsuite2] users may use other packages: hhsuite3, jackhmm, or buildali2.
-d database : The selected database for sequence search. [default = uniprot20_2016_02] users may use other uniprot20 databases to run hhsuite2 or hhsuite3, or use uniref90 for jackhmm, and nr_databases for buildali2.
#--| search strategy -n iteration : Maximal iteration to run the seleced package. [default = 2]
-e evalue : E-value cutoff for the selected package. [default = 0.001]
-E neff : Neff cutoff for threading purpose (i.e., -C -2). [default = 7] (for hhsuite only)
-C coverage : Coverage for hhsuite only. [default = -2 (i.e., NOT use -cov in HHblits)] if set to -1, then automatically determine coverage value. if set to any other positive value, then use this -cov in HHblits.
#--| filter strategy -M min_cut : Minimal coverage of sequences in the generated MSA. [default = -1] -1 indicates that we DON'T perform any filtering. Please set from 50 to 70.
-N max_num : Maximal number of sequences in the generated MSA. [default = -1] -1 indicates that we DON'T perform any filtering. For example, set 20000 here.
#--| additional A3M -A addi_meff : run additional A3M only if the previous ln(meff) is lower than this. [default = -1] -1 indicates that we DON'T search for additional A3M.
-V addi_eval : run additional A3M with a given e-value. [default = 0.001]
-D addi_db : run additional A3M using a given database. [default = metaclust50]
#--| other options -K remove_tmp : Remove temporary folder or not. [default = 1 to remove]
-f force : If specificied, then FORCE overwrite existing files. [default = 0 NOT to]
***** home relevant ********** -H home : home directory of TGT_Package. [default = .]
#=================
Running example:
#=================
#-------------- part I: generate A3M without filtering and metagenomics ------------------#
#-> 1. by default, we run HHblits to generate A3M ./A3M_TGT_Gen.sh -c 12 -i example/1pazA.fasta -o 1pazA_out -d uniprot20
#-> 2. we may also use JackHMM to produce A3M ./A3M_TGT_Gen.sh -c 12 -i example/1pazA.fasta -o 1pazA_out -h jackhmm -d uniref90 -n 3
#-> 3. for legacy purpose, we allow BLAST wrapped in BuildAli2 ./A3M_TGT_Gen.sh -c 12 -i example/1pazA.fasta -o 1pazA_out -h buildali2 -d nr_databases -n 5
#-------------- part II: generate A3M with filtering and metagenomics --------------------#
./A3M_TGT_Gen.sh -i example/T1001-D1.fasta -d uniprot20 -N 5000 -A 6 #-> [note]: here 5000 is the maximal number of sequences in the generated A3M.
6 is ln(meff) threshold to run additional A3M.
#-------------- part III: generate a variety of A3Ms with user-defined strategies --------#
./SEQ_to_A3Ms.sh -i example/1pazA.fasta -x '1e-3:3:-1|1:3:-1' -y 'null' -z 'null' -X uniprot20 -c 12 #-> [note]: here we only use the HHblits strategies on uniprot20 database.
#============
References:
#============
[blastpgp]: a) legacy version: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/legacy.NOTSUPPORTED/2.2.26/
b) current version:
ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/2.2.29/