v-annotate.pl example usage, command-line options and alert information

September 2, 2025 ยท View on GitHub


v-annotate.pl example usage

v-annotate.pl uses previously created VADR models from v-build.pl to analyze and annotate sequences in an input sequence file. As part of the analysis of the sequences, more than 70 types of unexpected characteristics, or alerts are detected and reported in the output. Many of the alerts are fatal in that if a sequence has one or more fatal alerts, they will be designated as failing sequences. Sequences with zero fatal alerts are designated as passing sequences. The types of alerts are described further below.

NOTE: the examples below are for norovirus and demonstrate the typical usage of vadr. For examples specific to other viruses (SARS-CoV-2, mpox, and RSV) see:

https://github.com/ncbi/vadr/wiki/Coronavirus-annotation

https://github.com/ncbi/vadr/wiki/Mpox-virus-annotation

https://github.com/ncbi/vadr/wiki/RSV-annotation

To determine the command-line usage of v-annotate.pl (or any VADR script), use the -h option, like this:

v-annotate.pl -h 

You'll see something like the following output:

# v-annotate.pl :: classify and annotate sequences using a model library
# VADR 1.6 (Nov 2023)
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# date:    Wed Nov  8 10:39:19 2023
#
Usage: v-annotate.pl [-options] <fasta file to annotate> <output directory to create>

The first few lines are the banner which show the name of the VADR script being run along with the version and release date. This is followed by the time and date the command was executed. The Usage: line details the expected command line arguments. v-annotate.pl takes as input two command line arguments, a fasta file with sequences to analyze and annotate (<fasta file to annotate>) and and the name of the output directory you want it to create (<output directory to create>) and populate with output files.

After that comes a list of all available command-line options. These are explained in more detail below.

Here is an example v-annotate.pl command using the fasta file vadr/documentation/annotate-files/noro.9.fa with 9 norovirus sequences, and creating an output directory with va-noro.9:

v-annotate.pl $VADRSCRIPTSDIR/documentation/annotate-files/noro.9.fa va-noro.9

The standard output of v-annotate.pl that is printed to the screen (which is also output to the .log output file) begins with the banner and date again followed by a list of relevant environment variables, the command line arguments used and any command line options used:

# date:              Wed Nov  8 10:40:06 2023
# $VADRBIOEASELDIR:  /home/nawrocki/vadr-install-dir/Bio-Easel
# $VADRBLASTDIR:     /home/nawrocki/vadr-install-dir/ncbi-blast/bin
# $VADREASELDIR:     /home/nawrocki/vadr-install-dir/infernal/binaries
# $VADRINFERNALDIR:  /home/nawrocki/vadr-install-dir/infernal/binaries
# $VADRMODELDIR:     /home/nawrocki/vadr-install-dir/vadr-models-calici
# $VADRSCRIPTSDIR:   /home/nawrocki/vadr-install-dir/vadr
#
# sequence file:     /home/nawrocki/vadr-install-dir/vadr/documentation/annotate-files/noro.9.fa
# output directory:  va-noro.9

No command line options were used in our example output, but if they were information on them would have appeared after the output directory line.

Next, information is output about each step the script is proceeding through. When each step is completed, the elapsed time for that step is output.

v-annotate.pl will use the default VADR model library (CM file $VADRMODELDIR/vadr.cm, model info file $VADRMODELDIR/vadr.minfo and BLAST DBs in $VADRMODELDIR/) to analyze the sequences in noro.9.fa, and will create an output directory named va-noro.9 and populate it with many output files.

There are four main stages to v-annotate.pl

  1. Classification: each sequence S in the input file is compared against all models in the model library to determine the best-matching (highest-scoring) model M(S) for each sequence.

  2. Coverage determination: each sequence S is compared again to M(S) to determine the coverage of S, which is basically the fraction of nucleotides in S that appear to be homologous (sufficiently similar) to M(S).

  3. Alignment and feature mapping: each sequence S is aligned to M(S) and based on that alignment, features of M(S) are mapped onto S.

  4. Protein validation of CDS features: for each sequence S that has 1 or more predicted CDS features, blastx or hmmsearch is used to compare the predicted CDS and the full sequence S against the VADR library BLAST or HMM DB.

The output of v-annotate.pl lists one or more steps per stage. The first two steps are:

# Validating input                                                                        ... done. [    0.3 seconds]
# Classifying sequences (9 seqs)                                                          ... done. [   31.9 seconds]

The first step validates that the VADR library .minfo file being used corresponds to the VADR library .cm file. Next, the classification stage is performed. After that, each model that is M(S) for at least one S is run separately through the coverage determination stage for all of its sequences:

# Determining sequence coverage (NC_001959: 1 seq)                                        ... done. [    0.8 seconds]
# Determining sequence coverage (NC_008311: 2 seqs)                                       ... done. [    3.0 seconds]
# Determining sequence coverage (NC_029645: 2 seqs)                                       ... done. [    1.0 seconds]
# Determining sequence coverage (NC_039477: 2 seqs)                                       ... done. [    3.1 seconds]
# Determining sequence coverage (NC_044854: 2 seqs)                                       ... done. [    0.9 seconds]

Next, the alignments are performed for each model, and used to map feature annotation:

# Aligning sequences (NC_001959: 1 seq)                                                   ... done. [    0.7 seconds]
# Aligning sequences (NC_008311: 2 seqs)                                                  ... done. [   13.5 seconds]
# Aligning sequences (NC_029645: 2 seqs)                                                  ... done. [    1.5 seconds]
# Aligning sequences (NC_039477: 2 seqs)                                                  ... done. [   14.2 seconds]
# Aligning sequences (NC_044854: 2 seqs)                                                  ... done. [    0.9 seconds]
# Determining annotation                                                                  ... done. [    0.3 seconds]

The classification and alignment stages are typically the slowest. The protein validation stage is usually relatively fast:

# Validating proteins with blastx (NC_001959: 1 seq)                                      ... done. [    1.0 seconds]
# Validating proteins with blastx (NC_008311: 2 seqs)                                     ... done. [    0.8 seconds]
# Validating proteins with blastx (NC_029645: 2 seqs)                                     ... done. [    0.6 seconds]
# Validating proteins with blastx (NC_039477: 2 seqs)                                     ... done. [    0.7 seconds]
# Validating proteins with blastx (NC_044854: 2 seqs)                                     ... done. [    0.5 seconds]

The only remaining steps are to create the output files:

# Generating feature table output                                                         ... done. [    0.0 seconds]
# Generating tabular output                                                               ... done. [    0.0 seconds]

After the output files are generated, a summary of the results is output, starting with a list of all models that were the best matching model to one or more sequences:

# Summary of classified sequences:
#
#                                      num   num   num
#idx  model      group      subgroup  seqs  pass  fail
#---  ---------  ---------  --------  ----  ----  ----
1     NC_008311  Norovirus  GV           2     1     1
2     NC_029645  Norovirus  GIII         2     2     0
3     NC_039477  Norovirus  GII          2     2     0
4     NC_044854  Norovirus  GI           2     2     0
5     NC_001959  Norovirus  GI           1     1     0
#---  ---------  ---------  --------  ----  ----  ----
-     *all*      -          -            9     8     1
-     *none*     -          -            0     0     0
#---  ---------  ---------  --------  ----  ----  ----

The above table is also saved to a file with the suffix .mdl, and the format is described in more detail here.

The nine input sequences collectively had five different best-matching models, all Norovirus models with the subgroups shown above. (The groups and subgroups shown above were input as command-line options to v-build.pl when the models were built. An example is here. Eight of the nine sequences passed and one failed.

Next, the alerts reported for the sequences are summarized:

# Summary of reported alerts:
#
#     alert     causes   short                            per    num   num  long
#idx  code      failure  description                     type  cases  seqs  description
#---  --------  -------  ---------------------------  -------  -----  ----  -----------
1     mutendcd  yes      MUTATION_AT_END              feature      1     1  expected stop codon could not be identified, predicted CDS stop by homology is invalid
2     cdsstopn  yes      CDS_HAS_STOP_CODON           feature      1     1  in-frame stop codon exists 5' of stop position predicted by homology to reference
3     cdsstopp  yes      CDS_HAS_STOP_CODON           feature      1     1  stop codon in protein-based alignment
4     indf5pst  yes      INDEFINITE_ANNOTATION_START  feature      1     1  protein-based alignment does not extend close enough to nucleotide-based alignment 5' endpoint
5     indf3pst  yes      INDEFINITE_ANNOTATION_END    feature      1     1  protein-based alignment does not extend close enough to nucleotide-based alignment 3' endpoint
#---  --------  -------  ---------------------------  -------  -----  ----  -----------

For the nine sequences, there were five different alert codes reported, each with one case for a single sequence. This tabular summary does not indicate to which sequence(s) each alert pertains. We can find that out from other output files as discussed further below.

Next, the list of output files created by v-annotate.pl is printed, along with elapsed time:

# Output printed to screen saved in:                              va-noro.9.vadr.log
# List of executed commands saved in:                             va-noro.9.vadr.cmd
# List and description of all output files saved in:              va-noro.9.vadr.filelist
# esl-seqstat -a output for input fasta file saved in:            va-noro.9.vadr.seqstat
# 5 column feature table output for passing sequences saved in:   va-noro.9.vadr.pass.tbl
# 5 column feature table output for failing sequences saved in:   va-noro.9.vadr.fail.tbl
# list of passing sequences saved in:                             va-noro.9.vadr.pass.list
# list of failing sequences saved in:                             va-noro.9.vadr.fail.list
# list of alerts in the feature tables saved in:                  va-noro.9.vadr.alt.list
# fasta file with passing sequences saved in:                     va-noro.9.vadr.pass.fa
# fasta file with failing sequences saved in:                     va-noro.9.vadr.fail.fa
# per-sequence tabular annotation summary file saved in:          va-noro.9.vadr.sqa
# per-sequence tabular classification summary file saved in:      va-noro.9.vadr.sqc
# per-feature tabular summary file saved in:                      va-noro.9.vadr.ftr
# per-model-segment tabular summary file saved in:                va-noro.9.vadr.sgm
# per-model tabular summary file saved in:                        va-noro.9.vadr.mdl
# per-alert tabular summary file saved in:                        va-noro.9.vadr.alt
# alert count tabular summary file saved in:                      va-noro.9.vadr.alc
# alignment doctoring tabular summary file saved in:              va-noro.9.vadr.dcr
#
# All output files created in directory ./va-noro.9/
#
# Elapsed time:  00:01:16.72
#                hh:mm:ss
# 
[ok]

All of these files were created in the newly created directory va-noro.9. What follows is a brief discussion of some of these output file types. More information on these files and their formats can be found here.

The first three files are the .log file, which is the same as the standard output printed to the screen currently being discussed, the .cmd file, and the .filelist file which lists the output files created by v-annotate.pl. Next comes a .seqstat file with lengths for each sequence in the input file.

v-annotate.pl also creates 5-column tab-delimited feature table files that end with the suffix .tbl. There is a separate file for passing (va-noro9.vadr.pass.tbl) and failing (va-noro9.vadr.fail.tbl) sequences. The format of the .tbl files is described here: https://www.ncbi.nlm.nih.gov/genbank/feature_table/ These files contain some of the same information as the .ftr files discussed above, with some additional information on qualifiers for features read from the model info file. From the va-noro9.vadr.pass.tbl file, the feature table for KY887602 is:

>Feature KY887602.1
<1	5083	gene
			gene	ORF1
<1	5083	CDS
			product	nonstructural polyprotein
			codon_start	2
			protein_id	KY887602.1_1
<1	979	mat_peptide
			product	p48
			protein_id	KY887602.1_1
980	2077	mat_peptide
			product	NTPase
			protein_id	KY887602.1_1
2078	2608	mat_peptide
			product	p22
			protein_id	KY887602.1_1
2609	3007	mat_peptide
			product	VPg
			protein_id	KY887602.1_1
3008	3550	mat_peptide
			product	Pro
			protein_id	KY887602.1_1
3551	5080	mat_peptide
			product	RdRp
			protein_id	KY887602.1_1
5064	6686	gene
			gene	ORF2
5064	6686	CDS
			product	VP1
			protein_id	KY887602.1_2
6686	7492	gene
			gene	ORF3
6686	7492	CDS
			product	VP2
			protein_id	KY887602.1_3

In the .fail.tbl file, each sequence's feature table contains a special section at the end of the table that lists errors due to reported fatal alerts. For example, at the end of va-noro9.vadr.fail.tbl:

Additional note(s) to submitter:
ERROR: CDS_HAS_STOP_CODON: (CDS:VF1) in-frame stop codon exists 5' of stop position predicted by homology to reference [TGA, shifted S:408,M:408]; seq-coords:5275..5277:+; mdl-coords:5300..5302:+; mdl:NC_008311;
ERROR: CDS_HAS_STOP_CODON: (CDS:VF1) stop codon in protein-based alignment [-]; seq-coords:5275..5277:+; mdl-coords:5300..5302:+; mdl:NC_008311;
ERROR: INDEFINITE_ANNOTATION_END: (CDS:VF1) protein-based alignment does not extend close enough to nucleotide-based alignment 3' endpoint [36>5, no valid stop codon in nucleotide-based prediction]; seq-coords:5650..5685:+; mdl-coords:5710..5710:+; mdl:NC_008311;
ERROR: INDEFINITE_ANNOTATION_START: (CDS:VP2) protein-based alignment does not extend close enough to nucleotide-based alignment 5' endpoint [54>5]; seq-coords:6656..6709:+; mdl-coords:6681..6681:+; mdl:NC_008311;

Note that this file lists only four ERRORs while the .alt output file below lists five alerts. Not all fatal alerts will be printed to this .fail.tbl file, because when specific pairs of alerts occur, only one is output to reduce the number of overlapping or redundant problems reported to the submitter/user. In this case the mutendcd (MUTATION_AT_END) alert is omitted in the .fail.tbl file because it occurs in combination with a cdsstopn (CDS_HAS_STOP_CODON) alert. But this is rare, as only one alert (mutendcd) can possibly be omitted. See the far right column of the this table for which alerts, when present in combination with mutendcd causes it to be omitted. Only alerts output to the .fail.tbl table are output to the .alt.list file.

The next three output files all end in .list. The .pass.list and .fail.list files are simply lists of all passing and failing sequences, respectively, with each sequence listed on a separate line. The .alt.list file lists all alerts that cause errors in the .fail.tbl file in a four field tab-delimited format described here. The va-noro9.vadr.alt.list file lists the four alerts/errors in the .fail.tbl file shown above:

#sequence	model	feature-type	feature-name	error	seq-coords	mdl-coords	error-description
JN975492.1	NC_008311	CDS	VF1	CDS_HAS_STOP_CODON	5275..5277:+	5300..5302:+	in-frame stop codon exists 5' of stop position predicted by homology to reference [TGA, shifted S:408,M:408]
JN975492.1	NC_008311	CDS	VF1	CDS_HAS_STOP_CODON	5275..5277:+	5300..5302:+	stop codon in protein-based alignment [-]
JN975492.1	NC_008311	CDS	VF1	INDEFINITE_ANNOTATION_END	5650..5685:+	5710..5710:+	protein-based alignment does not extend close enough to nucleotide-based alignment 3' endpoint [36>5, no valid stop codon in nucleotide-based prediction]
JN975492.1	NC_008311	CDS	VP2	INDEFINITE_ANNOTATION_START	6656..6709:+	6681..6681:+	protein-based alignment does not extend close enough to nucleotide-based alignment 5' endpoint [54>5]

After that are two FASTA-formatted sequence files. One of these files includes all passing sequences (va-noro.9.vadr.pass.fa) and the other includes all failing sequences (va-noro.9.vadr.fail.fa).

After these two FASTA files are eight tabular summary files that end with three letter suffixes:

suffixdescriptionreference
.alcper-alert code information (counts)description of format
.altper-alert instance informationdescription of format
.ftrper-feature informationdescription of format
.mdlper-model informationdescription of format
.sgmper-segment informationdescription of format
.sqaper-sequence annotation informationdescription of format
.sqcper-sequence classification informationdescription of format
.dcralignment doctoring informationdescription of format

The contents of the .mdl and .alc files were already output by v-annotate.pl as covered above. To get more information on each sequence, see the .sqa and .sqc files. The .sqc file (va-noro.9.vadr.sqc) includes information on the classification of each sequence:

#seq  seq          seq                                   sub                    seq    mdl         num                             sub     score  diff/  seq   
#idx  name         len  p/f   ant  model1     grp1       grp1   score  sc/nt    cov    cov  bias  hits  str  model2     grp2       grp2     diff     nt  alerts
#---  ----------  ----  ----  ---  ---------  ---------  ----  ------  -----  -----  -----  ----  ----  ---  ---------  ---------  -----  ------  -----  ------
1     KY887602.1  7547  PASS  yes  NC_039477  Norovirus  GII   8142.8  1.079  1.000  0.997  12.5     1    +  NC_044046  Norovirus  GVIII  4348.2  0.576  -     
2     KT818729.1   243  PASS  yes  NC_044854  Norovirus  GI     170.4  0.701  0.996  0.031     0     1    +  NC_044932  Norovirus  GII      90.0  0.370  -     
3     EU437710.1   291  PASS  yes  NC_001959  Norovirus  GI     249.4  0.857  1.000  0.038     0     1    +  NC_044855  Norovirus  GIV     161.5  0.555  -     
4     DQ288307.1  1094  PASS  yes  NC_029645  Norovirus  GIII   973.4  0.890  1.000  0.150   6.2     1    +  NC_001959  Norovirus  GI      671.1  0.613  -     
5     AY237464.1   255  PASS  yes  NC_039477  Norovirus  GII    221.1  0.867  1.000  0.034     0     1    +  NC_044047  Norovirus  GVII    113.5  0.445  -     
6     KF475958.1   275  PASS  yes  NC_044854  Norovirus  GI     248.0  0.902  1.000  0.036     0     1    +  NC_044932  Norovirus  GII     164.0  0.596  -     
7     AB713840.1   347  PASS  yes  NC_008311  Norovirus  GV     330.6  0.953  1.000  0.047   0.2     1    +  NC_040876  Norovirus  GII     239.5  0.690  -     
8     JN585032.1   286  PASS  yes  NC_029645  Norovirus  GIII   242.3  0.847  0.997  0.039   0.1     1    +  NC_039897  Norovirus  GI      154.8  0.541  -     
9     JN975492.1  7286  FAIL  yes  NC_008311  Norovirus  GV    4666.2  0.640  1.000  0.987  16.8     1    +  NC_044047  Norovirus  GVII   3382.8  0.464  -     

This file includes per-sequence information on whether each sequence passed or failed, its best and second-best matching model, and scores and coverage. The difference in score between the best and second-best model gives an indication of how confidently classified the sequence is to the best-matching model. (The qstgroup and qstsbgrp alerts are reported if these scores are not sufficiently far apart to alert the user that a sequence is not confidently classified.) Note that the sequence JN975492.1 is the only sequence that failed. It has no seq alerts listed in the final field, so the fatal alerts that caused it to fail must have been per-feature alerts. These can be seen in the .ftr and .alt files. An explanation of all fields in the .sqc file type is here.

Next, take a look at the first few lines of the .ftr file (va-noro.9.vadr.ftr):

#     seq          seq                   ftr          ftr                         ftr  ftr  par                                                                                                        seq         model  ftr   
#idx  name         len  p/f   model      type         name                        len  idx  idx  str  n_from  n_to  n_instp  trc    5'N  3'N  p_from  p_to           p_instp  p_sc  nsa  nsn        coords        coords  alerts
#---  ----------  ----  ----  ---------  -----------  -------------------------  ----  ---  ---  ---  ------  ----  -------  -----  ---  ---  ------  ----  ----------------  ----  ---  ---  ------------  ------------  ------
1.1   KY887602.1  7547  PASS  NC_039477  gene         ORF1                       5083    1   -1    +       1  5083        -  5'       0    0       -     -                 -     -    1    0     1..5083:+    22..5104:+  -     
1.2   KY887602.1  7547  PASS  NC_039477  CDS          nonstructural_polyprotein  5083    2   -1    +       1  5083        -  5'       0    0       2  5080                 -  9094    1    0     1..5083:+    22..5104:+  -     
1.3   KY887602.1  7547  PASS  NC_039477  gene         ORF2                       1623    3   -1    +    5064  6686        -  no       0    0       -     -                 -     -    1    0  5064..6686:+  5085..6707:+  -     
1.4   KY887602.1  7547  PASS  NC_039477  CDS          VP1                        1623    4   -1    +    5064  6686        -  no       0    0    5064  6683                 -  2878    1    0  5064..6686:+  5085..6707:+  -     
1.5   KY887602.1  7547  PASS  NC_039477  gene         ORF3                        807    5   -1    +    6686  7492        -  no       0    0       -     -                 -     -    1    0  6686..7492:+  6707..7513:+  -     
1.6   KY887602.1  7547  PASS  NC_039477  CDS          VP2                         807    6   -1    +    6686  7492        -  no       0    0    6686  7486                 -  1393    1    0  6686..7492:+  6707..7513:+  -     
1.7   KY887602.1  7547  PASS  NC_039477  mat_peptide  p48                         979    7    2    +       1   979        -  5'       0    0       -     -                 -     -    1    0      1..979:+    22..1000:+  -     
1.8   KY887602.1  7547  PASS  NC_039477  mat_peptide  NTPase                     1098    8    2    +     980  2077        -  no       0    0       -     -                 -     -    1    0   980..2077:+  1001..2098:+  -     
1.9   KY887602.1  7547  PASS  NC_039477  mat_peptide  p22                         531    9    2    +    2078  2608        -  no       0    0       -     -                 -     -    1    0  2078..2608:+  2099..2629:+  -     
1.10  KY887602.1  7547  PASS  NC_039477  mat_peptide  VPg                         399   10    2    +    2609  3007        -  no       0    0       -     -                 -     -    1    0  2609..3007:+  2630..3028:+  -     
1.11  KY887602.1  7547  PASS  NC_039477  mat_peptide  Pro                         543   11    2    +    3008  3550        -  no       0    0       -     -                 -     -    1    0  3008..3550:+  3029..3571:+  -     
1.12  KY887602.1  7547  PASS  NC_039477  mat_peptide  RdRp                       1530   12    2    +    3551  5080        -  no       0    0       -     -                 -     -    1    0  3551..5080:+  3572..5101:+  -     
#

This file includes information on each annotated feature, organized by sequence. The lines above show the annotated features for the first sequence KY887602.1 and include information on the feature length, strand, positions in the sequence and in the reference model, whether it is truncated or not, and where the blastx protein validation stage predicted coordinates are. The seq coords and mdl coords lines near the end show the sequence and model coordinates of each feature in the VADR coordinate string format, described here. The final column lists any alerts pertaining to each feature. For the features in this sequence there are no alerts, but for the final sequence, some alerts are listed for the second CDS. An explanation of all fields in the .ftr file type is here.

Another important file is the .alt output file (va-noro9.vadr.alt) which includes one line per alert reported:

#      seq                    ftr   ftr   ftr  alert           alert                                 seq  seq           mdl  mdl  alert 
#idx   name        model      type  name  idx  code      fail  description                        coords  len        coords  len  detail
#----  ----------  ---------  ----  ----  ---  --------  ----  ---------------------------  ------------  ---  ------------  ---  ------
9.1.1  JN975492.1  NC_008311  CDS   VF1     6  mutendcd  yes   MUTATION_AT_END              5683..5685:+    3  5708..5710:+    3  expected stop codon could not be identified, predicted CDS stop by homology is invalid [TCA]
9.1.2  JN975492.1  NC_008311  CDS   VF1     6  cdsstopn  yes   CDS_HAS_STOP_CODON           5275..5277:+    3  5300..5302:+    3  in-frame stop codon exists 5' of stop position predicted by homology to reference [TGA, shifted S:408,M:408]
9.1.3  JN975492.1  NC_008311  CDS   VF1     6  cdsstopp  yes   CDS_HAS_STOP_CODON           5275..5277:+    3  5300..5302:+    3  stop codon in protein-based alignment [-]
9.1.4  JN975492.1  NC_008311  CDS   VF1     6  indf3pst  yes   INDEFINITE_ANNOTATION_END    5650..5685:+   36  5710..5710:+    1  protein-based alignment does not extend close enough to nucleotide-based alignment 3' endpoint [36>5, no valid stop codon in nucleotide-based prediction]
9.2.1  JN975492.1  NC_008311  CDS   VP2     8  indf5pst  yes   INDEFINITE_ANNOTATION_START  6656..6709:+   54  6681..6681:+    1  protein-based alignment does not extend close enough to nucleotide-based alignment 5' endpoint [54>5]

All alerts are for the JN975492.1 sequence. Four are for the VF1 CDS and 1 is for the VP2 CDS. The alert codes are listed in the seventh column, along with a brief description in the eigth column. Then the sequence and model coordinates pertaining to the alert and the lengths of those regions are listed in columns 9 to 12. A more detailed description of the problem can be found in the final column. All possible alerts are listed in the alert table. For some examples of different types of alerts see here.

Example of using the v-annotate.pl --alt_pass and --alt_fail to change alerts from fatal to non-fatal and vice versa

One way to change the behavior of v-annotate.pl is to change which alerts are fatal or non-fatal. Most alerts are fatal by default, but some are not, as shown in the alert table. Some alerts are always fatal in that they cannot be changed, but all others can be toggled between fatal or non-fatal using the --alt_pass and --alt_fail options.

For example, we can make the JN976492.1 sequence above pass by making the five observed alerts (mutendcd, cdsstopn, cdsstopp, indf3pst, and indf5pst) by rerunning the v-annotate.pl above with this command:

v-annotate.pl --alt_pass mutendcd,cdsstopn,cdsstopp,indf3pst,indf5pst $VADRSCRIPTSDIR/documentation/annotate-files/noro.9.fa va-pass-noro.9

To supply multiple alerts with --alt_pass or --alt_fail, separate them by a , without any whitespace, like above.

The output will look very similar to the earlier run, but the summary information printed at the end will show that no sequences fail this time, despite the same alerts being reported.

# Summary of classified sequences:
#
#                                      num   num   num
#idx  model      group      subgroup  seqs  pass  fail
#---  ---------  ---------  --------  ----  ----  ----
1     NC_008311  Norovirus  GV           2     2     0
2     NC_039477  Norovirus  GII          2     2     0
3     NC_044854  Norovirus  GI           2     2     0
4     NC_029645  Norovirus  GIII         2     2     0
5     NC_001959  Norovirus  GI           1     1     0
#---  ---------  ---------  --------  ----  ----  ----
-     *all*      -          -            9     9     0
-     *none*     -          -            0     0     0
#---  ---------  ---------  --------  ----  ----  ----
#
# Summary of reported alerts:
#
#     alert     causes   short                            per    num   num  long
#idx  code      failure  description                     type  cases  seqs  description
#---  --------  -------  ---------------------------  -------  -----  ----  -----------
1     mutendcd  no       MUTATION_AT_END              feature      1     1  expected stop codon could not be identified, predicted CDS stop by homology is invalid
2     cdsstopn  no       CDS_HAS_STOP_CODON           feature      1     1  in-frame stop codon exists 5' of stop position predicted by homology to reference
3     cdsstopp  no       CDS_HAS_STOP_CODON           feature      1     1  stop codon in protein-based alignment
4     indf5pst  no       INDEFINITE_ANNOTATION_START  feature      1     1  protein-based alignment does not extend close enough to nucleotide-based alignment 5' endpoint
5     indf3pst  no       INDEFINITE_ANNOTATION_END    feature      1     1  protein-based alignment does not extend close enough to nucleotide-based alignment 3' endpoint
#---  --------  -------  ---------------------------  -------  -----  ----  -----------

The --alt_fail <s> option works the same way as alt_pass <s> but alert codes in <s> should be non-fatal by default.

Alternatively, if you want to relax the stringency of alerts only for specific features you can do that by modifying the modelinfo input file as explained below.


Running v-annotate.pl inside the v-scan.pl wrapper](#scan)

The v-scan.pl script can be used as a simple wrapper for v-annotate.pl if you want to use multiple model libraries. v-scan.pl reads a list of model libraries and command-line options for each from an input config file, determines which model library to use for each sequence in the input fasta file and runs v-annotate.pl using the options specified in the config file.

For more information see this page


v-annotate.pl command-line options

To get a list of command-line options, execute:

v-annotate.pl -h

This will output the usage and available command-line options. Each option has a short description, but additional information on some of these options can be found below. For v-annotate.pl the available options are split into nine different categories, each explained in their own subsection below.

In the tables describing options below, <s> represents a string, <x> indicates a floating point number and <n> represents an integer.

v-annotate.pl basic options

......option....explanation
-fif <output directory> already exists, then using this option will cause it to be overwritten, otherwise the progam exits in error
-vverbose mode: all commands will be output to standard output as they are run
--atgonlyonly consider ATG as a valid start codon, regardless of model's translation table
--minpvlen <n>set the minimum length in nucleotides for CDS/mat_peptide/gene features to be output to feature tables and for protein validation analysis to <n>, default <n> is 30
--nkb <n>set the target number of Kb of sequence for each alignment job and/or chunk (with --split) to <n> Kb (thousand nucleotides), default <n> is 300
--keepkeep additional output files that are normally removed

v-annotate.pl options for specifying expected sequence classification

..........option..........explanation
--group <s>specify that the expected classification of all sequences is group <s>, sequences determined to not be in this group will trigger an incgroup alert
--subgroup <s2>specify that the expected classification of all sequences is subgroup <s> within group <s2> from --group <s2>, sequences determined to not be in this group will trigger an incsubgrp alert; requires --group

v-annotate.pl options for controlling which alerts are fatal and cause a sequence to FAIL

............option............explanation
--alt_listoutput summary of all alerts and then exit
--alt_pass <s>specify that alert codes in comma-separated string <s> are non-fatal (do not cause a sequence to fail), all alert codes listed must be fatal by default
--alt_fail <s>specify that alert codes in comma-separated string <s> are fatal (cause a sequence to fail), all alert codes listed must be non-fatal by default
--alt_mnf_yes <s>specify that alert codes in comma-separated string <s> for 'misc_not_failure' features cause misc_feature-ization, not failure as explained more here
--alt_mnf_no <s>specify that alert codes in comma-separated string <s> for 'misc_not_failure' features cause failure, not misc-feature-ization as explained more here

v-annotate.pl options for ignoring specific keys in the input model info (.minfo) file

............option............explanation
--ignore_mnfignore non-zero 'misc_not_feature' values in modelinfo file, set to 0 for all features/models
--ignore_isdelignore non-zero 'is_deletable' values in modelinfo file, set to 0 for all features/models
--ignore_afsetignore non-zero 'alternative_ftr_set' and 'alternative_ftr_set_subn' values in modelinfo file
--ignore_afsetsubnignore non-zero 'alternative_ftr_set_subn' values in modelinfo file
--ignore_canonssignore non-zero 'canon_splice_sites' values in modelinfo file
--force_canonssforce 'canon_splice_sites' value is 1 for all CDS with qualifying introns (gaps between segments >= <n> nucleotides from --intlen option, by default <n> is 40), this will force a check for GT/AG splice sites in all introns
--ignore_excdo not allow any exceptions, ignoring all exception keys (*_exc) in the model info file
--ignore_oftforce all feature types to output to feature table, ignoring any omit_from_tbl keys in the model info file
.......option.......explanation
-m <s>use the CM file <s>, instead of the default CM file ($VADRMODELDIR/vadr.cm)
-a <s>use HMM file <s> instead of the default HMM file ($VADRMODELDIR/vadr.hmm)
-i <s>use the VADR model info file <s>, instead of the default model info file ($VADRMODELDIR/vadr.minfo)
-n <s>use the blastn DB file <s> when necessary, instead of the default blastn DB file ($VADRMODELDIR/vadr.fa), only used if -s or -r is also used
-x <s>specify that the blastx database files to use for protein validation are in dir <s>, instead of the default directory ($VADRMODELDIR)
--mkey <s>specify that .cm, .minfo, and blastn .fa files in $VADRMODELDIR start with key <s>, not 'vadr'
--mdir <s>specify that all model files to use are in the directory <s>, not in $VADRMODELDIR
--mlist <s>specify that only the subset of models listed in the file <s> be used

v-annotate.pl options for controlling output feature table

.......option.......explanation
--nomiscin feature table, never change feature to misc_feature
--notrim in feature table, do not trim coordinate start and stops due to Ns at beginning or end of features for all feature types
--noftrtrim <s>in feature table, do not trim coordinate start and stops due to Ns at beginning or end of features for feature types listed in the comma-delimited string <s> (no spaces)
--noprotidin feature table, don't add protein_id for CDS and mat_peptide features
--forceprotidin feature table, force protein_id value to be sequence name, then idx
--forcegenein feature table, add 'gene' qualifiers from model info file for CDS and mat_peptide features, normally these are omitted
--forcequal <s>in feature table, add qualifiers from model info file listed in comma-delimited string

In the table below, <n> represents a positive interger argument and <x> represents a positive floating-point argument.

...........option...........relevant alert code(s)relevant error(s)default value that triggers alertexplanation
--lowsc <x>lowscoreLOW_SCORE< 0.3 set bits per nt threshold for alert to <x>
--indefclass <x>indfclasINDEFINITE_CLASSIFICATION< 0.03 set bits per nt difference threshold for alert between top two models (not in same subgroup) to <x>
--incspec <x>incgroup, incsubgrpINCORRECT_SPECIFIED_GROUP, INCORRECT_SPECIFIED_SUBGROUP< 0.2 set bits per nt difference threshold for alert between best-matching model <m> and highest-scoring model in specified group <s1> (from --group <s1>) or subgroup <s2> (from --subgroup <s2>), where <m> is not in group/subgroup <s1>/<s2> to <x>
--lowcov <x>lowcovrgLOW_COVERAGE< 0.9 set fractional coverage threshold for alert to <x>
--dupregolp <n>dupreginDUPLICATE_REGIONS>= 20set min number of model position overlap for alert to <n> positions
--dupregsc <x>dupreginDUPLICATE_REGIONS>= 10.0 set min bit score of weaker overlapping hit to <x> bits
--indefstr <x>indfstrnINDEFINITE_STRAND>= 25.0 set bit score of weaker strand hit for alert to <x>
--lowsim5seq <n>lowsim5sLOW_SIMILARITY_START>= 15 set length (nt) threshold for alert to <n>
--lowsim3seq <n>lowsim3sLOW_SIMILARITY_END>= 15 set length (nt) threshold for alert to <n>
--lowsimiseq <n>lowsimisLOW_SIMILARITY>= 1 set length (nt) threshold for alert to <n>
--lowsim5ftr <n>lowsim5c, lowsim5nLOW_FEATURE_SIMILARITY_START>= 5 set length (nt) threshold for alert to <n>
--lowsim3ftr <n>lowsim3c, lowsim3nLOW_FEATURE_SIMILARITY_END>= 5 set length (nt) threshold for alert to <n>
--lowsimiftr <n>lowsimic, lowsiminLOW_FEATURE_SIMILARITY>= 1 set length (nt) threshold for alert to <n>
--lowsim5lftr <n>lowsim5lLOW_FEATURE_SIMILARITY_START>= 30 set length (nt) threshold for alert to <n>
--lowsim3lftr <n>lowsim3lLOW_FEATURE_SIMILARITY_END>= 30 set length (nt) threshold for alert to <n>
--lowsimilftr <n>lowsimilLOW_FEATURE_SIMILARITY>= 30 set length (nt) threshold for alert to <n>
--extrant5 <n>extrant5EXTRA_SEQUENCE_START>= 1 set length (nt) threshold for alert to <n>
--extrant3 <n>extrant3EXTRA_SEQUENCE_END>= 1 set length (nt) threshold for alert to <n>
--biasfrac <x>biasdseqBIASED_SEQUENCE>= 0.25 set fractional bit score threshold for biased score/total score for alert to <x>
--nmiscftrthr <n>nmiscftrTOO_MANY_MISC_FEATURES>= 4 set minimum number of misc_features per sequence for alert to <n>
--indefann <x>indf5lcc, indf5lcn, indf3lcc, indf3lcnINDEFINITE_ANNOTATION_START, INDEFINITE_ANNOTATION_END< 0.8 set posterior probability threshold for non-mat_peptide features for alert to <x>
--indefann_mp <x>indf5lcc, indf5lcn, indf3lcc, indf3lcnINDEFINITE_ANNOTATION_START, INDEFINITE_ANNOTATION_END< 0.6 set posterior probability threshold for mat_peptide features for alert to <x>
--fstminntt <n>fsthicft, fstlocft, fstukct5POSSIBLE_FRAMESHIFT_HIGH_CONF, POSSIBLE_FRAMESHIFT_LO_CONF, POSSIBLE_FRAMESHIFT>= 4 set maximum allowed length of aligned region in different frame in which frame is not restored before CDS end to <n>
--fstminnti <n>fsthicfi, fstlocfi, fstukcfiPOSSIBLE_FRAMESHIFT_HIGH_CONF, POSSIBLE_FRAMESHIFT_LO_CONF, POSSIBLE_FRAMESHIFT>= 6 set maximum allowed length of aligned region in different frame in which frame is restored before CDS end to <n>
--fsthighthr <x>fsthicnfPOSSIBLE_FRAMESHIFT_HIGH_CONF>= 0.8 set average posterior probability threshold for potentially frameshifted region for high confidence alert to <x>
--fstlowthr <x>fstlocnfPOSSIBLE_FRAMESHIFT_LOW_CONF>= 0.0 set average posterior probability threshold for potentially frameshifted region for low confidence alert to <x>
--xalntol <n>indf5pst, indf3pstINDEFINITE_ANNOTATION_START, INDEFINITE_ANNOTATION_END> 5 set maximum allowed difference in nucleotides between predicted blastx and CM start/end without alert to <n> (blastx coordinates must be internal to CM coordinates)
--xmaxins <n>insertnpINSERTION_OF_NT> 27 set maximum allowed nucleotide insertion length in blastx validation alignment without alert to <n>
--xmaxdel <n>deletinpDELETION_OF_NT> 27 set maximum allowed nucleotide deletion length in blastx validation alignment without alert to <n>
--nmaxins <n>insertnnINSERTION_OF_NT> 27 set maximum allowed nucleotide insertion length in CDS nt alignment without alert to <n>
--nmaxdel <n>deletinnDELETION_OF_NT> 27 set maximum allowed nucleotide deletion length in CDS nt alignment without alert to <n>
--xlonescore <n>indfantpINDEFINITE_ANNOTATION>= 80 set minimum blastx raw score for a lone blastx hit not supported by CM analysis for alert to <n>
--hlonescore <n>indfantpINDEFINITE_ANNOTATION>= 10 set minimum hmmer bit score for a lone hmmsearch hit not supported by CM analysis for alert to <n>

v-annotate.pl options for controlling cmalign alignment stage

Several options exist for controlling the command-line options that will be passed to Infernal's cmalign program in the alignment stage. For more information on these options and how they control cmalign, see the Infernal User's Guide manual page for cmalign (section 8 of http://eddylab.org/infernal/Userguide.pdf)

.........option.........explanation
--mxsize <n>set maximum allowed cmalign DP matrix size to <n> Mb before triggering an unexpdivg alert, default <n> is 16000
--tau <x>set the initial tau (probability loss) value to <x> (sets the cmalign --tau option), default <x> is 0.001
--nofixedtaudo not fix the tau value, allow it to increase if necessary (removes the cmalign --fixedtau option), default is to fix tau with cmalign --fixedtau
--nosubuse alternative alignment strategy for truncated sequences (removes the cmalign --sub --notrunc options), default is use sub-CM alignment strategy with cmalign --sub --notrunc
--noglocalrun in local mode instead of glocal mode (removes the cmalign -g option), default is to use glocal mode with cmalign -g
--cmindiforce cmalign to align one sequence at a time, mainly useful for debugging
--noflankdo not use flank* options to improve alignments at 5' and 3' ends
--flanktoins <x>set CM transition probabilities to ROOT_IL and ROOT_IR that insert before first and after final reference position to <x>, default <x> is 0.1
--flankselfins <x>set CM self-transition probabilities in ROOT_IL and ROOT_IR that insert before first and after final reference position to <x>, default <x> is 0.8

v-annotate.pl options for controlling glsearch alignment stage as alternative to cmalign

The glsearch program from the FASTA package can be used as an alternative to the cmalign program. For more information on these options and how they control glsearch, see the FASTA documentation (https://fasta.bioch.virginia.edu/wrp_fasta/fasta_guide.pdf).

.........option.........explanation
--glsearchalign with glsearch instead of cmalign
--gls_match <n>set glsearch match score to <n> > 0 (-r option in glsearch), default is `5'
--gls_mismatch <n>set glsearch mismatch score to <n> < 0 (-r option in glsearch), default is -3
--gls_gapopen <n>set glsearch gap open score to <n> < 0 (-f option in glsearch), default is -17
--gls_gapextend <n>set glsearch gap extend score to <n> < 0 (-g option in glsearch), default is -4

v-annotate.pl options for controlling blastx protein validation stage

Below is a list of options for controlling the blastx protein validaation stage. Several of these control command-line options that will be passed to blastx. For more information on these options and how they control blastx, see the NCBI BLAST documentation (tables C1 and C4 of https://www.ncbi.nlm.nih.gov/books/NBK279684/).

.........option.........explanation
--xmatrix <s>use the substitution matrix <s> (sets the blastx -matrix <s> option), default is to use the default blastx matrix
--xdrop <n>set the xdrop options to <n> (sets the blastx -xdrop_ungap <n>, -xdrop_gap <n> and -xdrop_gap_final <n> with the same <n>), default is to use default blastx values
--xnumali <n>specify that the top <n> alignments are output by blastx (sets the blastx -num_alignments <n> option), default <n> is 20
--xnolongestdo not consider the longest blastx alignment of those returned (controlled by --xnumali <n>), default is to consider both the longest or the highest scoring alignment, and use the one that results in the fewest alerts
--xnocompdo not use composition-based statistics for blastx
--xwordsize <n>set the blastx word size value to <n> (<n> must be in the range [2..7])

v-annotate.pl options for using hmmer instead of blastx for protein validation

Optionally, HMMER's hmmsearch program can be used instead of blastx for the protein validation stage. CAUTION: This feature is relatively new and untested. Several of these control command-line options that will be passed to blastx. For more information on HMMER, see the HMMER user's guide (http://eddylab.org/software/hmmer/Userguide.pdf).

......option......explanation
--pv_hmmeruse hmmer instead of blastx for protein validation
--h_maxuse the --max option with hmmsearch
--h_minbit <x>set the minimum hmmsearch bit score threshold to <x>, the default <x> is -10.

The -s option turns on an acceleration heuristic based on a first-pass blastn alignment of each input sequence. With -s, blastn is used instead of cmscan for sequence classification, and the largest ungapped alignment region, called the 'seed', is extracted from the top hit blastn hit, and fixed for the alignment stage, such that only the sequence before and after the fixed seed is aligned with cmalign. This option was originally developed for SARS-CoV-2, for which it offers significant acceleration for many sequences which are highly similar to the SARS-CoV-2 RefSeq model. Seeds can also be derived by the very fast minimap2 program instead of blast using the --minimap2 option (see more related options here The minimap2-derived seeds tend to be longer than blastn-seeds, at least for monkeypox virus (mpxv) sequences for which --minimap2 often results in significant acceleration.

When -s option is used, an additional output file with suffix .sda is created, with format described here.

.........option.........explanation
-sturn on the seed acceleration heuristic: use the max length ungapped region from blastn to seed the alignment
--s_blastnws <n>for -s, set the blastn -word_size parameter to <n>, the default value for <n> is 7
--s_blastnrw <n>for -s, set the blastn -reward parameter to <n>, the default value for <n> is 1
--s_blastnpn <n>for -s, set the blastn -penalty parameter to <n>, the default value for <n> is -2
--s_blastngo <n>for -s, set the blastn -gapopen parameter to <n>, the default value for <n> is 2
--s_blastnge <n>for -s, set the blastn -gapextend parameter to <n>, the default value for <n> is 1
--s_blastndffor -s, do not use -gapopen/-gapextend options with blastn, use default values for gap penalties
--s_blastnsc <x>for -s, set the blastn minimum HSP score to consider to <x>, the default value for <x> is 50.0
--s_blastntkfor -s, set blastn option -task blastn
--s_blastnxd <n>for -s, set the blastn -xdrop_gap_final parameter to <n>, the default value for <n> is 110
--s_minsgmlen <n>for -s, set minimum length of ungapped region in HSP seed to <n>, the default value for <n> is 10
--s_allsgmfor -s, keep full HSP as seed, do not enforce a minimum segment length
--s_ungapsgmfor -s, only keep max length ungapped segment of HSP, this was default behavior for vadr v1.1 to v1.3
--s_startstopfor -s, allow seed to include gaps in start/stop codons
--s_overhang <n>for -s, set the length, in nt, of overlap between the 5' and 3' regions that are aligned with cmalign and the seed region to <n>, the default value for <n> is 100

v-annotate.pl options for deriving seeds from minimap2 as an alternative to blastn

.........option.........explanation
--minimap2use minimap2 insead of blastn to derive seeds, also requires -s and --glsearch
--mm2_asm5use the option -x asm5 with minimap2, instead of -x asm20 which is used by default
--mm2_asm10use the option -x asm10 with minimap2, instead of -x asm20 which is used by default
--mm2_k <n>use the option -k <n> with minimap2 to set the minimizer k-mer length to <n>, instead of using -x asm20, which is used by default and sets the k-mer length to 19
--mm2_w <n>use the option -w <n> with minimap2 to set the minimizer window size to <n>, instead of using -x asm20, which is used by default and sets the window size to 10

The -r option adds a pre-processing step to v-annotate.pl in which stretches of Ns are identified in each sequence and replaced with the expected nucleotides at the corresponding positions, when possible. This can sidestep problems with annotation that the Ns would normally cause. However, this option should be used with caution because it is based on the assumption that the missing regions match exactly to the expected nucleotide sequence that correspond to those missing regions.

Regions of Ns are identified using blastn and examining regions between hits for content of Ns. Ns in regions that satisfy the following three criteria are then replaced with the expected nucleotide at each corresponding position:

  • missing sequence region must be at least 5 nt (controllable with --r_minlen option)

  • length of missing sequence region must equal length of missing model region

  • missing sequence region must be >= 0.25 fraction Ns if it includes the 5' end or 3' end of the sequence, or >= 0.50 fraction Ns if it does not (controllable with --r_minfract5, --r_minfract3 and --r_minfracti options).

Additionally, as of v1.4, regions for which the length of the missing sequence region and missing model region are not identical are also potentially replaced if the following criteria are met:

  • length of missing model region is 10 nt or less longer than the length of missing sequence region (controllable with --r_diffmaxdel option) OR length of missing model region is 10 nt or less shorter than the length of missing sequence region (controllable with --r_diffmaxins option)

  • at least 1 of the nt in the missing sequence region is not an N (controllable with --r_diffminnonn option)

  • fraction of non-N nt in sequence region that match expected nt after "aligning" sequence region by flushing left or right with respect to differently length model region is at least 0.75 (controllable with --r_diffminfract option)

  • --r_diffno option is not used

When -r is used, an additional output file with suffix .rpn is created, with format described here.

...........option...........explanation
-rturn on the replace-N strategy: replace stretches of Ns with expected nucleotides, where possible
--r_minlen <n>for -r, set minimum length subsequence to possibly replace Ns in to <n>, the default value for <n> is 5
--r_minfract5 <f>for -r, set the minimum fraction of nucleotides in a subsequence at the 5' end to trigger N replacement to <x>, the default value for <x> is 0.25
--r_minfract3 <f>for -r, set the minimum fraction of nucleotides in a subsequence at the 3' end to trigger N replacement to <x>, the default value for <x> is 0.25
--r_minfracti <f>for -r, set the minimum fraction of nucleotides in an internal subsequence to trigger N replacement to <x>, the default value for <x> is 0.5
--r_diffnodo not try replacement of N rich regions if sequence and model regions are of different lengths, the default is to try if criteria defined by other --r_diff* options are met
--r_diffmaxdelmaxium allowed length difference b/t sequence and model regions (when model length > sequence length) to try replacement is <n> nt, the default value for <n> is 10
--r_diffmaxinsmaxium allowed length difference b/t sequence and model regions (when sequence length > model length) to try replacement is <n> nt, the default value for <n> is 10
--r_diffminnonnminimum number of non-N nts in replacement region when model and sequence region are different lengths to try replacement is <n>, the default value for <n> is 1
--r_diffminfractminimum allowed fraction of non-N nts that must match expected nt from reference model in replacement region when model and sequence region are different lengths is <f>, the default value for <f> is 0.75
--r_fetchrfor -r, fetch features to fasta files from sequences with Ns replaced, instead of original input sequences without Ns replaced
--r_cdsmprfor -r, identify CDS- and mat_peptide-specific alerts using subsequences fetched from sequences with Ns replaced, instead of original input sequences without Ns replaced
--r_pvorigfor -r, use original input sequences without Ns replaced in protein validation stage, instead of sequences with Ns replaced
--r_proffor -r, use slower profile methods, not blastn, to identify Ns to replaced
--r_listfor -r, only use models listed in file <s> for N replacement stage
--r_only <s>for -r, only use model named <s> for N replacement stage
--r_blastnws <n>for -r, set the blastn -word_size parameter to <n>, the default value for <n> is 7
--r_blastnrw <n>for -r, set the blastn -reward parameter to <n>, the default value for <n> is 1
--r_blastnpn <n>for -r, set the blastn -penalty parameter to <n>, the default value for <n> is -2
--r_blastngo <n>for -r, set the blastn -gapopen parameter to <n>, the default value for <n> is 2
--r_blastnge <n>for -r, set the blastn -gapextend parameter to <n>, the default value for <n> is 1
--r_blastndffor -r, do not use -gapopen/-gapextend options with blastn, use default values for gap penalties
--r_blastnsc <x>for -r, set the blastn minimum HSP score to consider to <x>, the default value for <x> is 50.0
--r_blastntkfor -r, set blastn option -task blastn
--r_blastnxd <n>for -r, set the blastn -xdrop_gap_final parameter to <n>, the default value for <n> is 110
--r_lowsimokfor -r, do not report lowsim{5,3,i}s alerts for low similarity regions within N-rich regions that were identified during N-replacement (even for N-rich regions that were not replaced)
--r_lowsimmffor -r, with --r_lowsimok lowsim{5,3,i}s must be within an N-rich region of at least <x> fraction Ns to not be reported, default value for <x> is 0.75
--r_lowsimxlfor -r, with --r_lowsimok lowsim{5,3,i}s must be within an N-rich region with length of at most <n> nt to not be reported, default value for <x> is 5000
--r_lowsimxdfor -r, with --r_lowsimok lowsim{5,3,i}s must be within an N-rich region that differs from expected length by at most <n> nt to not be reported, default value for <x> is 200

The --split option specifies that v-annotate.pl should split up the input file into chunks and processing each chunk separately and then combining results at the end after all chunks have been processed. This limits total memory usage for large input sequence files as explained more here.

........option........explanation
--splitsplit input fasta sequence file into chunks of <n> Kb where <n> is from --nkb <n> (300 Kb, by default) and run each chunk separately
--cpu <n>with --split or --glsearch, parallelize across <n> CPU threads/workers (requires --split oor --glsearch)
--sidx <n>start sequence indexing at <n> for output files, not intended to be set by user except when debugging
........option........explanation
-prun in parallel mode so that classification, and each per-model coverage determination and alignment step is split into multiple jobs and run in parallel on a cluster
-q <s>read cluster information file from file <s> instead of from the default file $VADRSCRIPTSDIR/vadr.qsubinfo
--errcheckconsider any output to STDERR from a parallel job as an indication the job has failed, this will cause v-annotate.pl to exit, default is to ignore output to STDERR
........option........explanation
--wait <n>set the total number of minutes to wait for all jobs to finish at each stage to <n>, if any job is not finished this many minutes after being submitted (as indicated by the existence of an expected output file) then v-annotate.pl will exit in error, default <n> is 500
--maxnjobs <n>set the maximum number of jobs at each stage to <n>, default <n> is 2500

v-annotate.pl options for skipping stages

......option......explanation
--pv_skipdo not perform protein validation stage for CDS
--align_skipskip the cmalign stage, use results from previous run, this is mostly useful for debugging purposes
--val_onlyvalidate CM and other input files and exit

v-annotate.pl options for optional output files

.......option.......explanation
--out_stkcreate additional per-model output stockholm alignments with .stk suffix
--out_afacreate additional per-model output aligned fasta alignments with .afa suffix
--out_rpstkwith -r, create additional per-model output stockholm alignments with sequences with Ns replaced with .rpstk suffix
--out_rpafacreate additional per-model output aligned fasta alignments with sequences with Ns replaced with .rpafa suffix
--out_fsstkoutput frameshift stockholm alignment files with .frameshift.stk suffix
--out_allfastaoutput fasta files of predicted features
--out_nofastaminimize total size of output; do not output fasta files of all passing and all failing sequences
--out_noftrfastawith --keep, do not output fasta file for each feature
--out_debugcreate additional output files with information on various data structures

Other v-annotate.pl expert options

.........option.........explanation
--execname <s>in banner and usage output, replace v-annotate.pl with <s>
--alicheckfor debugging purposes, check aligned sequence versus input sequence for identity
--noseqnamemaxdo not enforce the GenBank maximum length of 50 characters for sequence names
--minbit <x>set minimum cmsearch/cmscan bit score threshold to <x>, the default value for <x> is -10
--origfado not copy the input fasta file into output directory prior to analysis, use the original
--msub <s>specify that file <s> lists models to substitute, each line should contain two space-delimited tokens, model listed in token 2 will substitute as best-matching model for all sequences classified as the model listed in token 1
--xsub <s>specify that file <s> lists blastx dbs to substitute, each line should contain two space-delimited tokens, blastx db for model listed in token 2 will substitute as blastx db for all sequences classified as the model listed in token 1
--nodcrnever doctor alignments to shift gaps to correct start/stop codon annotation
--forcedcrinsforce insert type alignment doctoring, requires --cmindi, mainly useful for debugging/testing
--xnoidignore blastx hits that are full length and 100% identical, mainly useful for testing
--intlen <n>define intron as any gap >= <n> nucleotides between segments in a CDS, only relevant for identifying canonical splice sites, the default value for <n> is 40

Information on v-annotate.pl alerts

To see a table with information on alerts, use the --alt_list option, like this:

v-annotate.pl --alt_list 

The table below contains the same information as in the --alt_list output, with sequences organized according to whether they are fatal or not. Always fatal alert codes are always fatal and cannot be changed using the --alt_pass options. All other alert codes can be changed from fatal to non-fatal by using the --alt_pass option, or from non-fatal to fatal using the --alt_fail option. An example is included below.

In the table below, the type column reports if each alert pertains to an entire sequence or a specific annotated feature within a sequence. The causes misc_feature, not failure (if in modelinfo file) shows which alerts are not fatal for non-essential features as described more below. The exception key and exception value type indicate the key string and type for defining exceptions in the model info file as described more below. These columns will be - for any alert for which exception ranges are not allowed.

Description of always fatal alert codes

alert codetypecauses misc_feature, not failure (if in modelinfo file)short description/error name.........long_description.........exception key (in modelinfo file)exception value type
noannotnsequenceneverNO_ANNOTATION no significant similarity detected--
revcomplsequenceneverREVCOMPLEM sequence appears to be reverse complemented--
unexdivgsequenceneverUNEXPECTED_DIVERGENCE sequence is too divergent to confidently assign nucleotide-based annotation--
noftrannsequenceneverNO_FEATURES_ANNOTATED sequence similarity to homology model does not overlap with any features--
noftrantsequenceneverNO_FEATURES_ANNOTATED all annotated features are too short to be output to feature table--
ftskipflsequenceneverUNREPORTED_FEATURE_PROBLEM only fatal alerts are for feature(s) not output to feature table--

Description of alerts that are fatal by default

alert codetypecauses misc_feature, not failure (if in modelinfo file)short description/error name.........long_description.........exception key (in modelinfo file)exception value type
incsbgrpsequenceneverINCORRECT_SPECIFIED_SUBGROUP score difference too large between best overall model and best specified subgroup model--
incgroupsequenceneverINCORRECT_SPECIFIED_GROUP score difference too large between best overall model and best specified group model--
lowcovrgsequenceneverLOW_COVERAGE low sequence fraction with significant similarity to homology model--
dupreginsequenceneverDUPLICATE_REGIONS similarity to a model region occurs more than oncedupregin_exccoords-only
discontnsequenceneverDISCONTINUOUS_SIMILARITY not all hits are in the same order in the sequence and the homology model--
indfstrnsequenceneverINDEFINITE_STRAND significant similarity detected on both strandsindfstr_exccoords-only
lowsim5ssequenceneverLOW_SIMILARITY_START significant similarity not detected at 5' end of the sequencelowsim_exccoords-only
lowsim3ssequenceneverLOW_SIMILARITY_END significant similarity not detected at 3' end of the sequencelowsim_exccoords-only
lowsimissequenceneverLOW_SIMILARITY internal region without significant similaritylowsim_exccoords-only
nmiscftrsequenceneverTOO_MANY_MISC_FEATURES too many features reported as misc_features--
deletinssequenceneverDELETION_OF_FEATURE internal deletion of a complete feature--
mutstartfeatureyesMUTATION_AT_START expected start codon could not be identified--
mutendcdfeatureyesMUTATION_AT_END expected stop codon could not be identified, predicted CDS stop by homology is invalid--
mutendnsfeatureyesMUTATION_AT_END expected stop codon could not be identified, no in-frame stop codon exists 3' of predicted valid start codon--
mutendexfeatureyesMUTATION_AT_END expected stop codon could not be identified, first in-frame stop codon exists 3' of predicted stop position--
unexlengfeatureyesUNEXPECTED_LENGTH length of complete coding (CDS or mat_peptide) feature is not a multiple of 3--
cdsstopnfeatureyesCDS_HAS_STOP_CODON in-frame stop codon exists 5' of stop position predicted by homology to reference--
cdsstoppfeatureyesCDS_HAS_STOP_CODON stop codon in protein-based alignment--
fsthicftfeatureyesPOSSIBLE_FRAMESHIFT_HIGH_CONF high confidence possible frameshift in CDS (frame not restored before end) (not reported if --glsearchfst_exccoords-only
fsthicfifeatureyesPOSSIBLE_FRAMESHIFT_HIGH_CONF high confidence possible frameshift in CDS (frame restored before end) (not reported if --glsearch)fst_exccoords-only
fstukcf3featureyesPOSSIBLE_FRAMESHIFT possible frameshift in CDS (frame not restored before end) (only reported if --glsearch)fst_exccoords-only
fstukcfifeatureyesPOSSIBLE_FRAMESHIFT possible frameshift in CDS (frame restored before end) (only reported if --glsearch)fst_exccoords-only
mutspst5featureyesMUTATION_AT_SPLICE_SITE expected splice site at 5' end of intron (GT) could not be identified (only reported for CDS with canon_splice_sites set to 1 in .minfo file)--
mutspst3featureyesMUTATION_AT_SPLICE_SITE expected splice site at 3' end of intron (AG) could not be identified (only reported for CDS with canon_splice_sites set to 1 in .minfo file)--
peptransfeatureyesPEPTIDE_TRANSLATION_PROBLEM mat_peptide may not be translated because its parent CDS has a problem--
pepadjcyfeatureyesPEPTIDE_ADJACENCY_PROBLEM predictions of two mat_peptides expected to be adjacent are not adjacent--
indfantpfeaturenoINDEFINITE_ANNOTATION protein-based search identifies CDS not identified in nucleotide-based search--
indfantnfeaturenoINDEFINITE_ANNOTATION nucleotide-based search identifies CDS not identified in protein-based search--
indf5gapfeatureyesINDEFINITE_ANNOTATION_START alignment to homology model is a gap at 5' boundary--
indf5lcnfeatureyesINDEFINITE_ANNOTATION_START alignment to homology model has low confidence at 5' boundary for feature that does not match a CDS--
indf5plgfeatureyesINDEFINITE_ANNOTATION_START protein-based alignment extends past nucleotide-based alignment at 5' end--
indf5pstfeatureyesINDEFINITE_ANNOTATION_START protein-based alignment does not extend close enough to nucleotide-based alignment 5' endpoint--
indf3gapfeatureyesINDEFINITE_ANNOTATION_END alignment to homology model is a gap at 3' boundary--
indf3lcnfeatureyesINDEFINITE_ANNOTATION_END alignment to homology model has low confidence at 3' boundary for feature that does not match a CDS--
indf3plgfeatureyesINDEFINITE_ANNOTATION_END protein-based alignment extends past nucleotide-based alignment at 3' end--
indf3pstfeatureyesINDEFINITE_ANNOTATION_END protein-based alignment does not extend close enough to nucleotide-based alignment 3' endpoint--
indfstrpfeaturenoINDEFINITE_STRAND strand mismatch between protein-based and nucleotide-based predictionsindfstr_exccoords-only
insertnpfeaturenoINSERTION_OF_NT too large of an insertion in protein-based alignmentinsertn_exccoords-value
deletinpfeatureyesDELETION_OF_NT too large of a deletion in protein-based alignmentdeletin_exccoords-value
deletinffeaturenoDELETION_OF_FEATURE_SECTION internal deletion of a complete section in a multi-section feature with other section(s) annotated--
lowsim5nfeatureyesLOW_FEATURE_SIMILARITY_START region overlapping annotated feature that does not match a CDS at 5' end of sequence lacks significant similaritylowsim_exccoords-only
lowsim5lfeaturenoLOW_FEATURE_SIMILARITY_START long region overlapping annotated feature that does not match a CDS at 5' end of sequence lacks significant similaritylowsim_exccoords-only
lowsim3nfeatureyesLOW_FEATURE_SIMILARITY_END region overlapping annotated feature that does not match a CDS at 3' end of sequence lacks significant similaritylowsim_exccoords-only
lowsim3lfeaturenoLOW_FEATURE_SIMILARITY_END long region overlapping annotated feature that does not match a CDS at 3' end of sequence lacks significant similaritylowsim_exccoords-only
lowsiminfeatureyesLOW_FEATURE_SIMILARITY region overlapping annotated feature that does not match a CDS lacks significant similaritylowsim_exccoords-only
lowsimilfeaturenoLOW_FEATURE_SIMILARITY long region overlapping annotated feature that does not match a CDS lacks significant similaritylowsim_exccoords-only

Description of alerts that are non-fatal by default

alert codetypecauses misc_feature, not failure (if in modelinfo file)short description/error name.........long_description.........exception key (in modelinfo file)exception value type
qstsbgrpsequenceneverQUESTIONABLE_SPECIFIED_SUBGROUP best overall model is not from specified subgroup--
qstgroupsequenceneverQUESTIONABLE_SPECIFIED_GROUP best overall model is not from specified group--
ambgnt5ssequenceneverAMBIGUITY_AT_START first nucleotide of the sequence is an ambiguous nucleotide--
ambgnt3ssequenceneverAMBIGUITY_AT_END final nucleotide of the sequence is an ambiguous nucleotide--
indfclassequenceneverINDEFINITE_CLASSIFICATION low score difference between best overall model and second best model (not in best model's subgroup)--
lowscoresequenceneverLOW_SCORE score to homology model below low threshold--
biasdseqsequenceneverBIASED_SEQUENCE high fraction of score attributed to biased sequence composition--
extrant5sequenceneverEXTRA_SEQUENCE_START extra sequence detected 5' of expected sequence start--
extrant3sequenceneverEXTRA_SEQUENCE_END extra sequence detected 3' of expected sequence end--
unjoinblsequenceneverUNJOINABLE_SUBSEQ_ALIGNMENTS inconsistent alignment of overlapping region between ungapped seed and flanking region--
deletinasequenceneverDELETION_OF_FEATURE allowed internal deletion of a complete feature (feature with is_deletable flag set to 1 in .minfo file)--
ambgntrpsequenceneverN_RICH_REGION_NOT_REPLACED N-rich region of unexpected length not replaced during N replacement region (only possibly reported if -r)--
fstlocftfeatureyesPOSSIBLE_FRAMESHIFT_LOW_CONF low confidence possible frameshift in CDS (frame not restored before end) (not reported if --glsearch)fst_exccoords-only
fstlocfifeatureyesPOSSIBLE_FRAMESHIFT_LOW_CONF low confidence possible frameshift in CDS (frame restored before end) (not reported if --glsearch)fst_exccoords-only
indf5lccfeatureyesINDEFINITE_ANNOTATION_START alignment to homology model has low confidence at 5' boundary for feature that is or matches a CDS--
indf3lccfeatureyesINDEFINITE_ANNOTATION_END alignment to homology model has low confidence at 3' boundary for feature that is or matches a CDS--
insertnnfeaturenoINSERTION_OF_NT too large of an insertion in nucleotide-based alignment of CDS featureinsertn_exccoords-value
deletinnfeatureyesDELETION_OF_NT too large of a deletion in nucleotide-based alignment of CDS featuredeletin_exccoords-value
lowsim5cfeaturenoLOW_FEATURE_SIMILARITY_START region overlapping annotated feature that is or matches a CDS at 5' end of sequence lacks significant similaritylowsim_exccoords-only
lowsim3cfeaturenoLOW_FEATURE_SIMILARITY_END region overlapping annotated feature that is or matches a CDS at 3' end of sequence lacks significant similaritylowsim_exccoords-only
lowsimicfeaturenoLOW_FEATURE_SIMILARITY region overlapping annotated feature that is or matches a CDS lacks significant similaritylowsim_exccoords-only
ambgnt5ffeaturenoAMBIGUITY_AT_FEATURE_START first nucleotide of non-CDS feature is an ambiguous nucleotide--
ambgnt3ffeaturenoAMBIGUITY_AT_FEATURE_END final nucleotide of non-CDS feature is an ambiguous nucleotide--
ambgnt5cfeaturenoAMBIGUITY_AT_CDS_START first nucleotide of CDS is an ambiguous nucleotide--
ambgnt3cfeaturenoAMBIGUITY_AT_CDS_END final nucleotide of CDS is an ambiguous nucleotide--
ambgcd5cfeaturenoAMBIGUITY_IN_START_CODON 5' complete CDS starts with canonical nt but includes ambiguous nt in its start codon--
ambgcd3cfeaturenoAMBIGUITY_IN_STOP_CODON 3' complete CDS ends with canonical nt but includes ambiguous nt in its stop codon--

Additional information on v-annotate.pl alerts

The table below has additional information on the alerts not contained in the --alt_list output. The relevant_options column lists command-line options that pertain to each alert. The relevant feature types column shows which feature types each alert can be reported for (this field is "-" for alerts that pertain to a sequence instead of a feature). The omitted in .tbl and .alt.list by column lists other alerts that, if present, will cause this alert to be omitted in the .tbl and .alt.list files to reduce redundant information reported to user, this is "-" for alerts that are never omitted from those files.

More information on always fatal alert codes

alert codeshort description/error namerelevant optionsrelevant feature typesomitted in .tbl and .alt.list by
noannotnNO_ANNOTATIONnone--
revcomplREVCOMPLEMnone--
unexdivgUNEXPECTED_DIVERGENCEnone--
noftrannNO_FEATURES_ANNOTATEDnone--
noftrantNO_FEATURES_ANNOTATEDnone--
ftskipflUNREPORTED_FEATURE_PROBLEMnone--

More information on alerts that are fatal by default

alert codeshort description/error namerelevant_optionsrelevant feature typesomitted in .tbl and .alt.list by
incsbgrpINCORRECT_SPECIFIED_SUBGROUP--incspec--
incgroupINCORRECT_SPECIFIED_GROUP--incspec--
lowcovrgLOW_COVERAGE--lowcov--
dupreginDUPLICATE_REGIONS--dupreg--
discontnDISCONTINUOUS_SIMILARITYnone--
indfstrnINDEFINITE_STRAND--indefstr--
lowsim5sLOW_SIMILARITY_START--lowsim5seq--
lowsim3sLOW_SIMILARITY_END--lowsim3seq--
lowsimisLOW_SIMILARITY--lowsimint--
nmiscftrTOO_MANY_MISC_FEATURES--nmiscftrthrall-
deletinsDELETION_OF_FEATUREnoneall-
mutstartMUTATION_AT_START--atgonlyCDS-
mutendcdMUTATION_AT_ENDnoneCDScdsstopn, mutendex, mutendns
mutendnsMUTATION_AT_ENDnoneCDS-
mutendexMUTATION_AT_ENDnoneCDS-
unexlengUNEXPECTED_LENGTHnoneCDS, mat_peptide-
cdsstopnCDS_HAS_STOP_CODONnoneCDS-
cdsstoppCDS_HAS_STOP_CODONnoneCDS-
fsthicftPOSSIBLE_FRAMESHIFT_HIGH_CONF--fsthighthr, --fstminnttCDS-
fsthicfiPOSSIBLE_FRAMESHIFT_HIGH_CONF--fsthighthr, --fstminntiCDS-
fstukcftPOSSIBLE_FRAMESHIFT--glsearch, --fstminnttCDS-
fstukcfiPOSSIBLE_FRAMESHIFT--glsearch, --fstminntiCDS-
mutspst5MUTATION_AT_SPLICE_SITE--ignore_canonss, --force-canonss, --intlenCDS-
mutspst3MUTATION_AT_SPLICE_SITE--ignore_canonss, --force-canonss, --intlenCDS-
peptransPEPTIDE_TRANSLATION_PROBLEMnonemat_peptide-
pepadjcyPEPTIDE_ADJACENCY_PROBLEMnonemat_peptide-
indfantpINDEFINITE_ANNOTATION--xlonescoreCDS-
indfantnINDEFINITE_ANNOTATIONnoneCDS-
indf5gapINDEFINITE_ANNOTATION_STARTnoneall-
indf5lcnINDEFINITE_ANNOTATION_START--indefann, --indefann_mpall except CDS and any gene or mat_peptide with identical start coordinate to a CDS-
indf5plgINDEFINITE_ANNOTATION_STARTnoneCDS-
indf5pstINDEFINITE_ANNOTATION_START--xalntolCDS-
indf3gapINDEFINITE_ANNOTATION_ENDnoneall-
indf3lcnINDEFINITE_ANNOTATION_END--indefann, --indefann_mpall except CDS and any gene with identical stop coordinate to CDS-
indf3plgINDEFINITE_ANNOTATION_ENDnoneCDS-
indf3pstINDEFINITE_ANNOTATION_END--xalntolCDS-
indfstrpINDEFINITE_STRANDnoneCDS-
insertnpINSERTION_OF_NT--xmaxinsCDS-
insertnnINSERTION_OF_NT--nmaxinsCDS-
deletinpDELETION_OF_NT--xmaxdelCDS-
deletinfDELETION_OF_FEATURE_SECTIONnoneall-
lowsim5nLOW_FEATURE_SIMILARITY_START--lowsim5ftrall except CDS, mat_peptide and any feature with identical coordinates to a CDS or mat_peptide-
lowsim3nLOW_FEATURE_SIMILARITY_END--lowsim3ftrall except CDS, mat_peptide and any feature with identical coordinates to a CDS or mat_peptide-
lowsiminLOW_FEATURE_SIMILARITY--lowsimiftrall except CDS, mat_peptide and any feature with identical coordinates to a CDS or mat_peptide-
lowsim5lLOW_FEATURE_SIMILARITY_START--lowsim5lftrall except CDS, mat_peptide and any feature with identical coordinates to a CDS or mat_peptide-
lowsim3lLOW_FEATURE_SIMILARITY_END--lowsim3lftrall except CDS, mat_peptide and any feature with identical coordinates to a CDS or mat_peptide-
lowsimilLOW_FEATURE_SIMILARITY--lowsimilftrall except CDS, mat_peptide and any feature with identical coordinates to a CDS or mat_peptide-

More information on alerts that are non-fatal by default

alert codeshort description/error namerelevant_optionsrelevant feature typesomitted in .tbl and .alt.list by
qstsbgrpQUESTIONABLE_SPECIFIED_SUBGROUPnone--
qstgroupQUESTIONABLE_SPECIFIED_GROUPnone--
ambgnt5sAMBIGUITY_AT_STARTnone--
ambgnt3sAMBIGUITY_AT_ENDnone--
indfclasINDEFINITE_CLASSIFICATION--indefclas--
lowscoreLOW_SCORE--lowsc--
biasdseqBIASED_SEQUENCE--biasfrac--
extrant5EXTRA_SEQUENCE_START--extrant5--
extrant3EXTRA_SEQUENCE_END--extrant3--
unjoinblUNJOINABLE_SUBSEQ_ALIGNMENTSnone-
deletinaDELETION_OF_FEATURE--ignore_isdelall-
ambgntrpN_RICH_REGION_NOT_REPLACED--r_diffno, --r_diffmaxdel, --r_diffmaxins, --r_diffminnonn, --r_diffminfract--
fstlocftPOSSIBLE_FRAMESHIFT_LOW_CONF--fstlothr, --fstminnttCDS-
fstlocfiPOSSIBLE_FRAMESHIFT_LOW_CONF--fstlothr, --fstminntiCDS-
indf5lccINDEFINITE_ANNOTATION_START--indefann, --indefann_mpCDS and any gene or mat_peptide with identical start coordinate to a CDS-
indf3lccINDEFINITE_ANNOTATION_END--indefann, --indefann_mpCDS and any gene with identical stop coordinate to CDS-
insertnnINSERTION_OF_NT--nmaxinsCDS-
deletinnDELETION_OF_NT--nmaxdelCDS-
lowsim5cLOW_FEATURE_SIMILARITY_START--lowsim5ftrCDS, mat_peptide and any feature with identical coordinates to a CDS or mat_peptide-
lowsim3cLOW_FEATURE_SIMILARITY_END--lowsim3ftrCDS, mat_peptide and any feature with identical coordinates to a CDS or mat_peptide-
lowsimicLOW_FEATURE_SIMILARITY--lowsimiftrCDS, mat_peptide and any feature with identical coordinates to a CDS or mat_peptide-
ambgnt5fAMBIGUITY_AT_FEATURE_STARTnone--
ambgnt3fAMBIGUITY_AT_FEATURE_ENDnone--
ambgnt5cAMBIGUITY_AT_CDS_STARTnoneCDS-
ambgnt3cAMBIGUITY_AT_CDS_ENDnoneCDS-
ambgcd5cAMBIGUITY_IN_START_CODONnoneCDS-
ambgcd3cAMBIGUITY_IN_STOP_CODONnoneCDS-

Non-essential features: allowing sequences to pass despite fatal alerts for specific features

It is possible to specify that certain features are non-essential and so have relaxed requirements. Some alerts that are normally fatal are not fatal for non-essential features. If any such alerts are reported for an non-essential feature that feature will be turned into a misc_feature in the output feature table .pass.tbl file, but the sequence will still pass, as long as it has zero fatal alerts for all other (essential) features and zero fatal sequence alerts.

The default set of specific alerts that a non-essential feature can have without failing its sequence are listed with 'yes' in the 'causes misc_feature, not failure (if in modelinfo file)' column in the tables describing alerts above as well as in the --alt_list output. This set can be changed using the --alt_mnf_yes <s1> option to specify that alert codes in the comma-separated string <s1> be added to the set, and the --alt_mnf_no <s2> option to specify that alert codes in the comma-separated string <s2> be removed from the set.

Non-essential features are specified in the .modelinfo file, with a key/value pair string: misc_not_feature:"1" in the FEATURE line for the corresponding feature.

For example, the sequence JN975492.1 is the one sequence in the example above that fails. It matches best to the NC_008311 model. It fails due to the fatal alerts mutendcd, cdsstopn, cdsstopp, and indf3pst for the VF1 CDS feature, and indf5pst fatal alert for the VP2 CDS as shown above. If the VF1 and VP2 features were defined as non-essential using the misc_not_failure:"1" key/value pair in the .minfo file as they are in the included example file vadr.mnf-example.minfo, then the sequence would have passed.

The relevant excerpt from the $VADRSCRIPTSDIR/documentation/annotate-files/vadr.mnf-example.minfo file:

FEATURE NC_008311 type:"gene" coords:"5069..5710:+" parent_idx_str:"GBNULL" gene:"ORF4" misc_not_failure:"1"
FEATURE NC_008311 type:"CDS" coords:"5069..5710:+" parent_idx_str:"GBNULL" gene:"ORF4" product:"VF1" misc_not_failure:"1"
FEATURE NC_008311 type:"gene" coords:"6681..7307:+" parent_idx_str:"GBNULL" gene:"ORF3" misc_not_failure:"1"
FEATURE NC_008311 type:"CDS" coords:"6681..7307:+" parent_idx_str:"GBNULL" gene:"ORF3" product:"VP2" misc_not_failure:"1"

Note that in addition to the two CDS features, the two gene features that correspond to them also have misc_not_failure:"1" key/value pairs. When a CDS is made non-essential, it often makes sense to make any corresponding gene features non-essential too. However, gene features are an exception in that they do not get turned into a misc_feature if they have alerts that are normally fatal, as per GenBank convention, but it is still relevant to mark them as non-essential because some alerts in them will not cause the sequence to fail.

To rerun the example using this new .minfo file, execute:

v-annotate.pl -i $VADRSCRIPTSDIR/documentation/annotate-files/vadr.mnf-example.minfo $VADRSCRIPTSDIR/documentation/annotate-files/noro.9.fa va-mnf-noro.9

The output will indicate that all sequences now pass:

# Summary of classified sequences:
#
#                                      num   num   num
#idx  model      group      subgroup  seqs  pass  fail
#---  ---------  ---------  --------  ----  ----  ----
1     NC_008311  Norovirus  GV           2     2     0
2     NC_029645  Norovirus  GIII         2     2     0
3     NC_039477  Norovirus  GII          2     2     0
4     NC_044854  Norovirus  GI           2     2     0
5     NC_001959  Norovirus  GI           1     1     0
#---  ---------  ---------  --------  ----  ----  ----
-     *all*      -          -            9     9     0
-     *none*     -          -            0     0     0
#---  ---------  ---------  --------  ----  ----  ----

But the output .pass.tbl will not include the VF1 and VP2 CDS features for JN975492.1, instead it will include misc_feature features with note qualifiers that indicate the regions are similar to their respective CDS:

Relevant excerpt from va-mnf-noro.9.vadr.pass.tbl:

5044	5685	gene
			gene	ORF4
5044	5685	misc_feature
			note	similar to VF1
6656	7282	gene
			gene	ORF3
6656	7282	misc_feature
			note	similar to VP2

Note that if only the VF1 CDS or VP2 CDS feature lines included the misc_not_failure:"1" key/value pairs in the modelinfo file, the sequence would have failed.

Two important caveats above non-essential features and misc_feature-ization:

  1. As mentioned above, features with type gene, 5'UTR, 3'UTR or operon are never converted to misc_feature values as per GenBank convention.

  2. misc_feature-ization occurs in .pass.tbl output files for non-essential features as explained above even when the option --nomisc is used. (The --nomisc option causes misc_features not to be reported in .fail.tbl files.)


Alert exceptions: ignoring alerts in specific model position ranges

For some alerts, it is possible to specify model position ranges as exceptions - alert instances that occur within these regions will not be reported. This can be useful for alerts that are permissible, or even expected, in a given sequence region. For example, a known repeat region of a sequence may consistently trigger a dupregin (DUPLICATE_REGIONS) alert that we do not want to cause a sequence to fail. However, we may still want other dupregin alerts outside of the repeat region to be reported. To ignore (and not report) any dupregin alerts completely within the model position range 37 to 100 on the top (+) strand, add the string dupregin_exc:"37..100:+" to the relevant MODEL line of the model info file. In this case, dupregin_exc is the exception key and "37..100:+" is the exception value. You'll want the strand of the exception to match the strand of the alert, and for negative strand, the start position is greater than the stop position. To exclude positions 100 to 37 on the negative strand the exception value would be "100..37:-". If you are able to have v-annotate.pl output an alert that you want to make an exception for using a test sequence, you can check the mdl coords field of the .alt output file to determine the relevant model coordinates to use for the exception value.

The table below lists all alert codes for which exceptions are allowed along with their specific exception keys and value types, and whether they pertain to a specific feature and should be added to the corresponding feature line (lines starting with FEATURE) in the model info file or are not specific to a feature and should be added to the model line (line starting with MODEL):

alert codeshort descriptionexception keyexception value typemodel info file line type
dupreginDUPLICATE_REGIONSdupregin_exccoords-onlymodel
indfstrnINDEFINITE_STRANDindfstr_exccoords-onlymodel
indfstrpINDEFINITE_STRANDindfstr_exccoords-onlymodel
insertnpINSERTION_OF_NTinsertn_exccoords-valuefeature (CDS)
insertnnINSERTION_OF_NTinsertn_exccoords-valuefeature (CDS)
deletinnDELETION_OF_NTdeletin_exccoords-valuefeature (CDS)
deletinpDELETION_OF_NTdeletin_exccoords-valuefeature (CDS)
lowsim5sLOW_SIMILARITY_STARTlowsim_exccoords-onlymodel
lowsim3sLOW_SIMILARITY_ENDlowsim_exccoords-onlymodel
lowsimisLOW_SIMILARITYlowsim_exccoords-onlymodel
lowsim5nLOW_FEATURE_SIMILARITY_STARTlowsim_exccoords-onlymodel
lowsim5lLOW_FEATURE_SIMILARITY_STARTlowsim_exccoords-onlymodel
lowsim3nLOW_FEATURE_SIMILARITY_ENDlowsim_exccoords-onlymodel
lowsim3lLOW_FEATURE_SIMILARITY_ENDlowsim_exccoords-onlymodel
lowsiminLOW_FEATURE_SIMILARITYlowsim_exccoords-onlymodel
lowsimilLOW_FEATURE_SIMILARITYlowsim_exccoords-onlymodel
lowsim5cLOW_FEATURE_SIMILARITY_STARTlowsim_exccoords-onlymodel
lowsim3cLOW_FEATURE_SIMILARITY_ENDlowsim_exccoords-onlymodel
lowsimicLOW_FEATURE_SIMILARITYlowsim_exccoords-onlymodel
fsthicftPOSSIBLE_FRAMESHIFT_HIGH_CONFfst_exccoords-onlyfeature (CDS)
fsthicfiPOSSIBLE_FRAMESHIFT_HIGH_CONFfst_exccoords-onlyfeature (CDS)
fstukcftPOSSIBLE_FRAMESHIFTfst_exccoords-onlyfeature (CDS)
fstukcfiPOSSIBLE_FRAMESHIFTfst_exccoords-onlyfeature (CDS)
fstlocftPOSSIBLE_FRAMESHIFT_LOW_CONFfst_exccoords-onlyfeature (CDS)
fstlocfiPOSSIBLE_FRAMESHIFT_LOW_CONFfst_exccoords-onlyfeature (CDS)
extrant5EXTRA_SEQUENCE_STARTextrant5_exccoords-value*model
extrant3EXTRA_SEQUENCE_ENDextrant3_exccoords-value*model

If you specify a given exception key and value in the model info file, it will mean that all alerts with that specific key will have exceptions in that region. For example, a indfstr_exc exception will prevent reporting of both indfstrn and indfstrp in the region specified.

There are two types of alert 'exception value types', differentiated by the required format of the value string in the model info file:

  1. 'coords-only' exception keys have value strings that are VADR coordinate (coords) strings. Our previous example of dupregin exception is an example of a 'coords-only' exception. If you want to allow exceptions for multiple regions, they be separated by commas, for example to additionally exclude positions 181 to 333 the string to add would be dupregin_exc:"37..100:+,181..333:+".

  2. 'coords-value' exception keys have value strings that are VADR coordinate (coords) strings appended with :<d>, where <d> is a number relevant to the alert. For example, to increase the maximum allowed insertion length without a insertnn or insertnp alert after model (nucleotide) position 367 or 368 for a CDS feature encoded on the top (+) strand from the default value of 27 to 36, add the following string to the FEATURE line for that CDS feature in the model info file: insertn_exc:"367..368:+:36". As with coords-only keys, to add multiple position ranges and values, separate with commas.

The alert codes which allow exception ranges can also be viewed by running v-annotate.pl with the --alt_list option.

For extrant5_exc the coords must be 1..1:+. For extrant3_exc, the coords must be <mdllen>..<mdllen>:+ where <mdllen> is the length of the reference model.

Prior to VADR version 1.6, some alert exceptions in model info files were permitted in different formats. As of version 1.6, the formats above are enforced, but the formats present in publicly available model files created prior to v1.6 are also compatible with v1.6+.


Other ways to modify default behavior for features by manually changing the .minfo file

In addition to the misc_not_failure key and alert exceptions listed above, there are additional key/value pairs which can be added to to FEATURE lines in the modelinfo (.minfo) that modify how a feature is treated. These are listed in this table.


Limiting memory usage and parallelization with multi-threading

The v-annotate.pl script, in particular the alignment step, is memory intensive. For Norovirus and Dengue virus, it is recommended to have 16G of RAM available. For larger viruses, such as the roughly 30Kb SARS-CoV-2 virus, 64G of available RAM is recommended. However, the --glsearch and --split options can be used to reduce the memory requirements.

The --glsearch option causes the glsearch program from the FASTA software package to be used instead of Infernal's memory intensive cmalign program. However, --glsearch has only been extensively tested for SARS-CoV-2 sequences, for which it is now recommended due to the high 64G memory recommendation with cmalign.

With --glsearch the amount of required memory is roughly 2G of RAM for small input fasta files with 2000 sequences or less, but can exceed 2G for very large input files. Required memory will increase with the size of the input file.

Using the --split option removes the dependence of required memory on input file size as it causes splitting of the input fasta file into independent chunks with each chunk processed separately and results from all chunks combined at the end.

Also, in combination with the --glsearch and --split options, the user can specify multi-threading with <n> CPUs by using the --cpu <n> option. It is recommended that at least 2G * <n> total RAM is available when using this option.

In summary, the following combination of options are recommended to reduce memory usage and speed-up processing for SARS-CoV-2 annotation, provided you are running on a machine with 8 available cores and 16G of total RAM: --glsearch --split --cpu 8.

For more information on SARS-CoV-2 annotation with VADR see https://github.com/ncbi/vadr/wiki/Coronavirus-annotation


Alternative parallelization using a cluster

Alternatively, if you have access to a cluster and want to parallelize but do not want to use glsearch, you can use the -p option. Importantly, the -p option will not work with --glsearch and will not reduce the memory requirements like --glsearch does.

Using -p will parallelize the most time-consuming stages of v-annotate.pl (classification, coverage determination and alignment) on a cluster by splitting up the input sequence file randomly into multiple files, and running each as a separate job. This is most beneficial for large input sequence files.

With -p, by default, v-annotate.pl will consult the file $VADRSCRIPTSDIR/vadr.qsubinfo to read the command prefix and suffix for submitting jobs to the cluster. This file is set up to use Univa Grid Engine (UGE 8.5.5) and specific flags used on the NCBI system, but you can either modify this file to work with your own cluster or create a new file <s> and use the option -q <s> to read that file. The $VADRSCRIPTSDIR/vadr.qsubinfo has comments at the top that explain the format of the file. Email eric.nawrocki@nih.gov for help.

To repeat the above v-annotate.pl run in the example usage section, use this command:

v-annotate.pl -p $VADRSCRIPTSDIR/documentation/annotate-files/noro.9.fa va-parallel-noro.9

The output will look very similar to the run without -p, but with additional lines of output explaining that jobs have been submitted and are running on the compute farm:

# Submitting 1 cmscan classification job(s) to the farm                               ... 
# Waiting a maximum of 500 minutes for all farm jobs to finish                        ... 
#	   0 of    1 jobs finished (0.2 minutes spent waiting)
#	   0 of    1 jobs finished (0.5 minutes spent waiting)
#	   0 of    1 jobs finished (0.8 minutes spent waiting)
#	   0 of    1 jobs finished (1.0 minutes spent waiting)
#	   1 of    1 jobs finished (1.2 minutes spent waiting)
# done. [   75.7 seconds]
# Submitting 1 cmsearch coverage determination job(s) (NC_001959: 1 seqs) to the farm ... 
# Waiting a maximum of 500 minutes for all farm jobs to finish                        ... 
#	   1 of    1 jobs finished (0.2 minutes spent waiting)
# done. [   15.2 seconds]

Usage of -p will not affect the output of v-annotate.pl other than these lines about the status of jobs, but it can make processing of large sequence files significantly faster depending on how busy the cluster is.


Questions, comments or feature requests? Send a mail to eric.nawrocki@nih.gov.